Distillation vs. quantization vs. pruning
Three different ways to shrink a model — knowledge transfer, precision reduction, and removal — what each one actually changes, and how to stack them into one local-ready pipeline.
Distillation trains a new, smaller model to mimic a larger one (knowledge transfer); quantization stores the same model's weights at lower precision like FP16→INT4 (precision reduction); pruning removes weights, neurons, or layers (removal). They are orthogonal and stack: a common pipeline is prune → distill to recover accuracy → quantize to a Q4_K_M GGUF for local use.
People use "compression," "distillation," and "quantization" almost interchangeably. They're not the same thing — they change different parts of a model, and the best local models often use all three. Getting the distinctions straight is the fastest way to stop being confused by model names and release notes.
The one-line difference
| Technique | What it changes | What you get |
|---|---|---|
| Distillation | Trains a new, smaller architecture to mimic a teacher | Fewer parameters, retained capability |
| Pruning | Removes weights, neurons, heads, or layers | A sparser or structurally smaller model |
| Quantization | Reduces numeric precision of weights (e.g. FP16 → INT4) | Same parameters, fewer bits each |
Said even more briefly: distillation = knowledge transfer. Pruning = removal. Quantization = precision reduction.
Quantization, a little deeper
Quantization is the cheapest, most common way to make a model runnable locally, because it touches only how the weights are stored, not what the model is. A 16-bit weight becomes a 4-bit approximation; the model gets ~4× smaller with a modest quality hit.
You'll meet these names constantly:
- GGUF — the de-facto file format for distributing quantized models for
llama.cppand Ollama. The labelQ4_K_Mdecodes as: Q4 ≈ 4-bit, _K = k-quant (smarter per-block scaling), _M = the medium mixed-precision tier.Q4_K_M(~4.5 bits/weight) is the community's quality-vs-size sweet spot. - GPTQ / AWQ — GPU-oriented post-training quantizers. AWQ ("activation-aware") protects the ~1% of weight channels that matter most, which helps reasoning and code hold up at low bit-rates.
- FP8 / NVFP4 — float formats that keep dynamic range for near-FP16 quality; FP8 is fastest on new GPUs, and 4-bit float (NVFP4) is now reaching consumer hardware.
Pruning, a little deeper
Pruning removes parts of the network judged unimportant.
- Unstructured pruning zeros individual weights — high compression, but the irregular sparsity is hard for hardware to exploit.
- Structured pruning removes whole groups (neurons, attention heads, channels, or entire layers) — friendlier to hardware, at some accuracy cost.
- Semi-structured (e.g. 2:4) keeps 2 of every 4 weights nonzero, hitting a ~2× speedup on modern tensor cores.
The punchline: they stack
These aren't competitors — they're stages of a pipeline. Two canonical recipes:
- Prune → distill → quantize. NVIDIA's Minitron does exactly the first two: prune a Llama-3.1-8B down, then use distillation against the unpruned model to recover the lost accuracy — reaching a strong 4B with up to 40× fewer training tokens than training from scratch. Finish by quantizing to a GGUF and you have a small, cheap, capable local model.
- Distill, then quantize. Train a small student from a big teacher, then ship it as a
Q4_K_MGGUF for laptops.
Distillation even shows up inside quantization: quantization-aware training (QAT) can use the full-precision model's outputs as distillation targets so the quantized version stays close to the original. Gemma 3 ships QAT checkpoints built this way.
Which do you reach for?
- Just want to run an existing good model on your hardware? Quantize (or download a GGUF someone already quantized).
- Want a smaller model that's genuinely good at your task, not just a compressed giant? Distill.
- Need to squeeze a specific architecture for a specific device? Prune, then distill to recover, then quantize.
The art of building a great local model is knowing how to combine all three. The rest of the knowledge base is about doing exactly that.