When distillation beats fine-tuning — and when it doesn't

The most common distillation mistake is treating it like a magic upgrade button: take a big model, ask it for answers, train a small model, collect tiny lightning in a bottle.

Sometimes, yes. Sometimes you just built expensive supervised fine-tuning with a fancier invoice.

A new paper posted June 22, 2026 — "Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails" by Xin Liu, Simin Ma, Shujian Liu, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Lu Wang, and Kaiqiang Song — is useful because it asks the impolite question directly: when does knowledge distillation actually beat supervised fine-tuning?

The answer is exactly the kind builders need: distillation is strongest when the teacher contributes information the student would not get from the dataset alone.

For the foundations, start with what model distillation is and how distillation works. This post is the field note: how to decide whether to distill or simply fine-tune.

What the new post-training study tested

Most classic knowledge-distillation work studies task-specific settings: a teacher model transfers behavior to a smaller student on one benchmark or one narrow task. That is valuable, but modern language models are usually built in stages. After pretraining, they go through post-training: supervised instruction tuning, preference optimization, tool-use training, safety tuning, domain adaptation, and the assorted rites by which a raw next-token machine becomes useful.

The June 2026 paper focuses on that post-training phase. The authors study knowledge distillation with the large-scale Tülu 3 instruction-tuning setup from Ai2 and compare it against ordinary supervised fine-tuning (SFT). In plain English: instead of asking "can a student copy a teacher on one toy task?" they ask "does a teacher-generated training signal help when we are trying to make a general instruction-following model?"

That is the right question for people building local models, support agents, domain copilots, and small specialists. Post-training is where the model becomes a product.

Distillation vs. fine-tuning: the useful distinction

Supervised fine-tuning trains the student on target examples: prompt in, desired answer out. Knowledge distillation trains the student to imitate a teacher's behavior, which might include generated answers, soft probability distributions, critiques, rationales, tool traces, or labels over unlabeled data.

The difference is not ceremony. It is where the information comes from.

Method	Training signal	Best when	Failure mode
Supervised fine-tuning	Human or curated target outputs	You have enough clean task data	The student learns only what the dataset shows
Knowledge distillation	Teacher outputs, scores, logits, rationales, or traces	The teacher adds missing knowledge or structure	The student imitates teacher artifacts without gaining real capability
Two-stage KD + SFT	Synthetic teacher data, then human-labeled refinement	Domain data is scarce but correctness matters	Synthetic coverage can drift unless refined on real labels

A blunt rule: fine-tuning transfers a dataset; distillation transfers a teacher-conditioned view of the dataset. That view can be richer. It can also be redundant.

When knowledge distillation helps

The paper's most practical result is that distillation beats supervised fine-tuning most clearly in low-data regimes. That tracks with field intuition. If you only have a small set of task examples, a strong teacher can expand the training signal: generate additional labeled examples, expose reasoning paths, normalize style, or show the student what the sparse dataset only implies.

This is the same reason reasoning distillation became important after DeepSeek-R1-style trace recipes: the answer alone is thin. The path to the answer often carries the transferable skill.

Distillation also becomes useful again when the teacher is meaningfully stronger than the student and contributes capability the student cannot easily infer from more SFT rows. The study's abstract phrases it cleanly: distilling from a stronger instruction-tuned teacher restores substantial gains even with abundant data. Translation: if the teacher knows something, copy the teacher. If the teacher is only restating the label, maybe do not build a shrine around it.

The third useful case is domain-specific scarcity. The authors propose a two-stage strategy: use synthetic teacher-labeled data first, then refine on human annotations. That is a very deployable pattern. Let the teacher flood the zone cheaply, then let scarce human labels correct the distribution and calibrate the last mile.

When distillation fails to earn its keep

As the amount of ordinary training data grows, the paper reports that distillation's advantage can diminish. This is the part everyone should write on the wall before spending a month on a complicated teacher pipeline.

If the student can learn the target behavior directly from enough clean examples, the teacher may add little. Worse, it may add noise: stylistic quirks, overconfident wrong answers, hidden biases, or brittle reasoning patterns. A distilled student is not morally superior to a fine-tuned student. It is just trained through another model's shadow.

That is why distillation should be evaluated against a plain SFT baseline. Not a strawman. A real baseline, same student, same budget discipline, same held-out test set. If KD wins, wonderful. Bottle it. If SFT wins, also wonderful. You found the cheaper path.

This connects directly to the trust problem in distilled models need receipts: a distilled model should report not only its benchmark score, but whether it beat the obvious non-distillation alternative.

A practical decision rule for builders

Before distilling a small model, ask four questions.

1. Is the bottleneck data quantity or model capacity?

If you have too little data, a teacher can help synthesize coverage. If you have plenty of clean data and the student is large enough for the task, start with SFT. Boring is a feature when boring works.

2. Does the teacher expose hidden structure?

Distillation is strongest when the teacher gives more than final answers: rationales, tool-call trajectories, critique labels, uncertainty signals, or domain judgments. That is what made our tool-using support agent distillation interesting: the useful object was not a single answer, but a sequence of actions and escalation decisions.

3. Can you measure against SFT?

If the only comparison is "distilled model versus base model," you have not proven distillation was necessary. Compare base, SFT, KD, and — if budget allows — KD followed by human-label refinement.

4. Will the student run where it matters?

The whole point is portable capability. A distilled model that barely fits the target box is only halfway useful. Pair the training decision with the local runtime decision: quantization, memory footprint, latency, and hardware profile. The local models guide and VRAM calculator are the unglamorous half of the still.

The takeaway

Knowledge distillation is not the opposite of fine-tuning. It is a bet that a teacher can add useful signal beyond the labels you already have.

The June 2026 post-training study sharpens the bet: use distillation when data is scarce, when the teacher is genuinely stronger, or when synthetic teacher labels can bootstrap a domain before human annotations refine it. Use supervised fine-tuning when clean data is abundant and the teacher is not adding new information.

That distinction matters commercially. Distillation is how frontier capability becomes small enough to own. But ownership does not require mysticism. It requires a student, a teacher, a baseline, and the discipline to admit when the cheaper method won.

The still works best when you know what you are trying to extract.