What is model distillation?

Imagine you could sit a brilliant professor down with a sharp student and have the professor pour decades of intuition into them — not just the answers to a test, but the way they think. Then you send the student out into the world to do the work, at a fraction of the professor's cost, fast enough to keep up with you.

That, in one sentence, is model distillation.

The problem distillation solves

The most capable AI models are enormous. A frontier model might have hundreds of billions of parameters, demand a cluster of expensive GPUs, and live behind an API where every answer is metered by the token. They are remarkable — and, for most people and most uses, impractical to run yourself.

Meanwhile, a small model — say 1 to 8 billion parameters — can run on a laptop, a phone, or a single consumer GPU. It's fast, private, and yours. The catch: trained on its own, a small model is usually far less capable.

Distillation closes that gap. It transfers the knowledge of a large teacher model into a small student model, so the student keeps much of the capability while shedding most of the size.

Why not just train the small model normally?

You can — but you'll usually get a worse result. Here's the key insight that makes distillation work.

When a normal model trains on data, it learns from hard labels: this image is a "cat" (1) and nothing else (0). That throws away information. A good teacher model, shown a picture of a cat, doesn't just say "cat" — it says probably cat, but a little bit like a dog, and definitely not a car. Those graded probabilities are called soft labels (or soft targets), and they're far richer than a single right answer.

The teacher's uncertainty is itself a form of knowledge. Learning cat is 8% dog-like teaches the student something about the structure of the world that a bare "cat" label never could.

This was the original insight behind distillation, introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper "Distilling the Knowledge in a Neural Network." They called the soft-label information dark knowledge — the hidden structure in a model's probability distribution.

The mechanics, briefly

A classic distillation setup has three moving parts:

The teacher produces soft predictions over its outputs — for a language model, that's a probability distribution over the next token.
A temperature is applied to soften those probabilities, exaggerating the small differences between unlikely options so the student can learn from them. Higher temperature = softer, more revealing distribution.
The student is trained to match the teacher's softened distribution (often alongside the real labels), using a loss function — typically KL divergence — that measures how far the student's distribution is from the teacher's.

The student isn't copying the teacher's weights. It's learning to imitate the teacher's behavior — which is exactly why a completely different, much smaller architecture can still capture the teacher's skill.

Distillation for modern language models

The 2015 picture was about image classifiers. Large language models added powerful new ways to distill:

Sequence-level distillation — the student learns to reproduce entire sequences the teacher generates, not just per-token distributions.
Synthetic-data distillation — the teacher generates a large, high-quality training set (questions, answers, explanations) and the student trains on it. Much of today's "distillation" is really this: learning from a teacher's outputs at scale.
Reasoning distillation — the most exciting recent development. Reasoning models produce long chains of thought before answering. Train a small student on those traces and it learns to reason, not just to answer. This is how a modest open model can suddenly tackle hard math and code problems. (We go deep on this in Reasoning distillation.)

What distillation is not

It's easy to conflate distillation with two neighbors:

Quantization shrinks a model by storing its existing weights at lower precision (e.g. 16-bit → 4-bit). Same model, smaller footprint.
Pruning removes parts of a model judged unimportant.

Distillation is different: it trains a new, smaller model to behave like a bigger one. The three are complementary — a common recipe is distill, then quantize to get a model that's both smaller in architecture and cheaper to store. We compare all three in Distillation vs. quantization vs. pruning.

Why this matters now

For years distillation was a quiet compression trick. Two things changed:

Open teachers got extremely good. When a state-of-the-art open model can serve as a teacher, anyone can distill from it — no closed API required.
Reasoning became distillable. The discovery that you can transfer chains of thought meant small models could inherit capabilities everyone assumed required scale.

The result is a Cambrian explosion of small, sharp, runnable models — and a craft that's still being invented. That craft is what this site is about.

Where to go next

How distillation works — the types of distillation, in depth.
The distiller's toolkit — the frameworks people actually use.
Run models locally — put a distilled model on your own machine.
Glossary — every term, defined.