What is a GGUF?
GGUF (GPT-Generated Unified Format) is a binary file format designed for storing and running large language models efficiently on consumer hardware. It is the standard format for local AI inference, allowing users to run models using a combination of CPU and GPU without requiring massive enterprise servers. GGUF focuses on quantization, compressing the model to reduce memory usage and file size significantly, while packaging everything needed for inference—such as tensors and metadata—into one single, easy-to-deploy file.
Understanding GGUF Model Names & Quantization
Letters in GGUF model names indicate quantization levels (how much the model is compressed to fit your hardware). The format typically looks like Model-Name-Size-Quant.gguf.
The Letter Breakdown:
- Q: Stands for Quantized. This means the model's weights have been compressed into fewer bits to use less memory.
- Number after Q: The bit-depth used for compression (e.g., Q4 is 4-bit). Lower numbers mean smaller files but less accuracy; higher numbers offer better reasoning but require more RAM/VRAM.
- K: Stands for K-quants. It indicates a smarter, hierarchical compression method that allocates bits more efficiently across different layers of the model.
- S, M, L (after K): These are size and quality modifiers for K-quants:
- S (Small): The most aggressive compression, smallest file size, lowest accuracy.
- M (Medium): The community sweet spot. Uses mixed precision to balance file size and reasoning quality.
- L (Large): Prioritizes accuracy over size, resulting in a larger file.
The "A" and Active Parameters (Mixture of Experts)
With the rise of Mixture of Experts (MoE) architectures, you will increasingly see an "A" (Active) or a split parameter count in model documentation and naming (e.g., 80B-3BA or noting "32B active parameters per token").
- Total vs. Active Parameters: MoE architectures route data through specialized sub-networks (experts) rather than activating the entire neural network at once. For example, a model might have 80 billion total parameters, but only 3 billion of them are "active parameters" during any single token generation.
- Why It Matters for Your Hardware: You must have enough system RAM or GPU VRAM to load the entire total parameter size into memory. However, the generation speed (compute cost) is driven by the active parameters. This allows massive, trillion-parameter models to run surprisingly fast locally—provided you have the sheer memory capacity to load the files.
What should you download?
For most users with a standard local setup, the Q4_K_M is the recommended starting point as it provides the best balance of speed, quality, and memory size.
