Distillation vs. quantization vs. pruning

People use "compression," "distillation," and "quantization" almost interchangeably. They're not the same thing — they change different parts of a model, and the best local models often use all three. Getting the distinctions straight is the fastest way to stop being confused by model names and release notes.

The one-line difference

Technique	What it changes	What you get
Distillation	Trains a new, smaller architecture to mimic a teacher	Fewer parameters, retained capability
Pruning	Removes weights, neurons, heads, or layers	A sparser or structurally smaller model
Quantization	Reduces numeric precision of weights (e.g. FP16 → INT4)	Same parameters, fewer bits each

Said even more briefly: distillation = knowledge transfer. Pruning = removal. Quantization = precision reduction.

Quantization, a little deeper

Quantization is the cheapest, most common way to make a model runnable locally, because it touches only how the weights are stored, not what the model is. A 16-bit weight becomes a 4-bit approximation; the model gets ~4× smaller with a modest quality hit.

You'll meet these names constantly:

GGUF — the de-facto file format for distributing quantized models for llama.cpp and Ollama. The label Q4_K_M decodes as: Q4 ≈ 4-bit, _K = k-quant (smarter per-block scaling), _M = the medium mixed-precision tier. Q4_K_M (~4.5 bits/weight) is the community's quality-vs-size sweet spot.
GPTQ / AWQ — GPU-oriented post-training quantizers. AWQ ("activation-aware") protects the ~1% of weight channels that matter most, which helps reasoning and code hold up at low bit-rates.
FP8 / NVFP4 — float formats that keep dynamic range for near-FP16 quality; FP8 is fastest on new GPUs, and 4-bit float (NVFP4) is now reaching consumer hardware.

Pruning, a little deeper

Pruning removes parts of the network judged unimportant.

Unstructured pruning zeros individual weights — high compression, but the irregular sparsity is hard for hardware to exploit.
Structured pruning removes whole groups (neurons, attention heads, channels, or entire layers) — friendlier to hardware, at some accuracy cost.
Semi-structured (e.g. 2:4) keeps 2 of every 4 weights nonzero, hitting a ~2× speedup on modern tensor cores.

The punchline: they stack

These aren't competitors — they're stages of a pipeline. Two canonical recipes:

Prune → distill → quantize. NVIDIA's Minitron does exactly the first two: prune a Llama-3.1-8B down, then use distillation against the unpruned model to recover the lost accuracy — reaching a strong 4B with up to 40× fewer training tokens than training from scratch. Finish by quantizing to a GGUF and you have a small, cheap, capable local model.
Distill, then quantize. Train a small student from a big teacher, then ship it as a Q4_K_M GGUF for laptops.

Distillation even shows up inside quantization: quantization-aware training (QAT) can use the full-precision model's outputs as distillation targets so the quantized version stays close to the original. Gemma 3 ships QAT checkpoints built this way.

Which do you reach for?

Just want to run an existing good model on your hardware? Quantize (or download a GGUF someone already quantized).
Want a smaller model that's genuinely good at your task, not just a compressed giant? Distill.
Need to squeeze a specific architecture for a specific device? Prune, then distill to recover, then quantize.

The art of building a great local model is knowing how to combine all three. The rest of the knowledge base is about doing exactly that.