Skip to main content

🌻 Sunflower Quantized Inference

The Sunflower models are available in 14B and 32B sizes, supporting 8-bit and 4-bit quantized inference. These models are capable of high-quality translation, text generation, and conversational tasks, optimized for efficient performance on GPUs with limited memory.

8-bit Quantization

Balanced memory usage (~16GB for 14B) with high accuracy.

4-bit Quantization

Low memory usage (~10GB for 14B) and faster inference speeds.
Do not set both 8-bit and 4-bit modes simultaneously.

Model Variants

Sunflower 14B

  • 14B 8-bit: Balanced memory and accuracy, suitable for most GPUs.
  • 14B 4-bit: Optimized for memory-limited GPUs and faster inference.

Sunflower 32B

  • 32B 8-bit: High accuracy, requires significant VRAM.
  • 32B 4-bit: Reduced memory usage, slightly lower accuracy but faster.

Best Practices

Use 4-bit models when GPU memory is constrained. Use 8-bit models when you need maximum accuracy and have sufficient VRAM.
  • Monitor GPU memory for large inputs or batch processing.
  • Adjust inference parameters (sequence length) to fit your hardware limits.