🌻 Sunflower Quantized Inference

The Sunflower models are available in 14B and 32B sizes, supporting 8-bit and 4-bit quantized inference. These models are capable of high-quality translation, text generation, and conversational tasks, optimized for efficient performance on GPUs with limited memory.

8-bit Quantization

Balanced memory usage (~16GB for 14B) with high accuracy.

4-bit Quantization

Low memory usage (~10GB for 14B) and faster inference speeds.

Do not set both 8-bit and 4-bit modes simultaneously.

Model Variants

Sunflower 14B

14B 8-bit: Balanced memory and accuracy, suitable for most GPUs.
14B 4-bit: Optimized for memory-limited GPUs and faster inference.

Sunflower 32B

32B 8-bit: High accuracy, requires significant VRAM.
32B 4-bit: Reduced memory usage, slightly lower accuracy but faster.

Best Practices

Use 4-bit models when GPU memory is constrained. Use 8-bit models when you need maximum accuracy and have sufficient VRAM.

Monitor GPU memory for large inputs or batch processing.
Adjust inference parameters (sequence length) to fit your hardware limits.

Speech Recognition

Translation

Sunflower LLM

Overview

🌻 Sunflower Quantized Inference

8-bit Quantization

4-bit Quantization

Model Variants

Sunflower 14B

Sunflower 32B

Best Practices

Speech Recognition

Translation

Sunflower LLM

​🌻 Sunflower Quantized Inference

8-bit Quantization

4-bit Quantization

​Model Variants

​Sunflower 14B

​Sunflower 32B

​Best Practices

🌻 Sunflower Quantized Inference

Model Variants

Sunflower 14B

Sunflower 32B

Best Practices