Converting Sunflower LoRA Fine-tuned Models to GGUF Quantizations
In this guide, provide a tutorial for converting LoRA fine-tuned Sunflower models to GGUF format with multiple quantization levels, including experimental ultra-low bit quantizations.Table of Contents
- Prerequisites
- Environment Setup
- Model Preparation
- LoRA Merging
- GGUF Conversion
- Quantization Process
- Experimental Quantizations
- Quality Testing
- Ollama Integration
- Distribution
- Troubleshooting
Prerequisites
Hardware Requirements
- RAM: Minimum 32GB (64GB recommended for 14B+ models)
- Storage: 200GB+ free space for intermediate files
- GPU: Optional but recommended for faster processing
Software Requirements
- Linux/macOS (WSL2 for Windows)
- Python 3.9+
- Git and Git LFS
- CUDA toolkit (optional, for GPU acceleration)
Environment Setup
1. Install Dependencies
2. Clone and Build llama.cpp
Model Preparation
1. Download Models
LoRA Merging
- Create Merge Script (Skip if Using Pre-merged Model) Note: If you downloaded Sunbird/qwen3-14b-sunflower-merged, skip this section and go directly to GGUF Conversion. Create merge_lora.py only if using separate base model and LoRA adapter:
merge_lora.py:
2. Run Merge Process
models/merged_model
GGUF Conversion
1. Convert to F16 GGUF
Quantization Process
1. Generate Importance Matrix
The importance matrix (imatrix) significantly improves quantization quality by identifying which weights are most critical to model performance.-ngl based on your GPU memory (0 for CPU-only)
Note: Understanding Importance Matrix (imatrix) The importance matrix is a calibration technique that identifies which model weights contribute most significantly to output quality. During quantization, weights deemed “important” by the matrix receive higher precision allocation, while less critical weights can be more aggressively compressed. This selective approach significantly improves quantized model quality compared to uniform quantization. The imatrix is generated by running representative text through the model and measuring activation patterns. While general text datasets (like WikiText) work well for most models, using domain-specific calibration data (e.g., translation examples for the Sunflower model) can provide marginal quality improvements. The process adds 30-60 minutes to quantization time but is highly recommended for production models, especially when using aggressive quantizations like Q4_K_M and below.
2. Standard Quantizations
Create quantized models with different quality/size trade-offs:3. Quantization Options Reference
| Quantization | Bits per Weight | Quality | Use Case |
|---|---|---|---|
| Q8_0 | ~8.0 | Highest | Production, quality critical |
| Q6_K | ~6.6 | High | Production, balanced |
| Q5_K_M | ~5.5 | Good | Most users |
| Q4_K_M | ~4.3 | Acceptable | Resource constrained |
Experimental Quantizations
Warning: These quantizations achieve extreme compression but may significantly impact model quality.Ultra-Low Bit Quantizations
Experimental Quantization Reference
| Quantization | Bits per Weight | Compression | Warning Level |
|---|---|---|---|
| IQ2_XXS | 2.06 | 85% smaller | Moderate quality loss |
| TQ1_0 | 1.69 | 87% smaller | High quality loss |
| IQ1_S | 1.56 | 88% smaller | Severe quality loss |
Quality Testing
1. Quick Functionality Test
2. Perplexity Evaluation
3. Size Verification
Ollama Integration
Ollama provides an easy way to run your quantized models locally with a simple API interface.Installation and Setup
Creating Modelfiles for Different Quantizations
Q4_K_M (Recommended) - Modelfile:Importing Models to Ollama
Using Ollama Models
Interactive Chat:Ollama API Usage
Start API Server:Model Management
Performance Comparison Script
Createtest_models.py:
Production Deployment
Distribution
1. Hugging Face Upload
Create upload scriptupload_models.py:
2. Ollama Integration (Complete Guide)
Installation and Setup
Creating Modelfiles for Different Quantizations
Q4_K_M (Recommended) - Modelfile:Importing Models to Ollama
Using Ollama Models
Interactive Chat:Ollama API Usage
Start API Server:Model Management
Performance Comparison Script
Createtest_models.py:
Troubleshooting Ollama
Common Issues:- Model fails to load:
- Out of memory:
- Poor quality with experimental models:
- Ollama service not running:
Production Deployment
Troubleshooting
Common Issues
1. Out of Memory During Merge- This is expected for extreme quantizations like IQ1_S
- Test with your specific use case
- Consider using IQ2_XXS as minimum viable quantization
File Size Expectations
For a 14B parameter model:- Merge process: Requires 2x model size in RAM (~56GB peak)
- F16 GGUF: ~28GB final size
- Quantized models: 3GB-15GB depending on level
- Total storage needed: ~200GB for all quantizations
Performance Notes
- Importance matrix generation: 30-60 minutes on modern hardware
- Each quantization: 5-10 minutes per model
- Upload time: Varies by connection, large files use Git LFS
- Memory usage: Peaks during merge, lower during quantization

