How distillation works: the three kinds of knowledge
Response, feature, and relation-based distillation — plus self, online, and offline variants. The conceptual map of how knowledge actually moves from teacher to student.
If the primer gave you the intuition, this article gives you the map. "Distillation" is not one technique but a family of them, and knowing the family tree is what lets you read papers, choose tools, and reason about trade-offs.
The standard taxonomy comes from Gou et al.'s 2021 survey, Knowledge Distillation: A Survey. It splits distillation along two questions: what knowledge you transfer, and how teacher and student relate during training.
What you transfer: three kinds of knowledge
1. Response-based — learn from the teacher's answers
The original, simplest form. The student matches the teacher's output distribution — the soft probabilities over classes or next tokens. This is the soft-label idea from the primer, and for language models it's still the workhorse.
The mechanics, made concrete:
- The teacher's logits are passed through a softmax with a temperature
Tthat softens the distribution:q_i = exp(z_i / T) / Σ exp(z_j / T). - A higher
Texaggerates the small probabilities on "wrong" answers — the dark knowledge that encodes how the teacher sees similarity. - The student trains to match that softened distribution, usually with a KL-divergence loss, blended with ordinary cross-entropy on the real labels:
L = α·L_distill + β·L_label. - Because soft-target gradients shrink as
1/T², the distillation term is scaled byT²to keep things balanced.
Smaller students often do better with a lower temperature — they don't have the capacity to absorb a very soft distribution. Temperature is a dial, not a constant.
2. Feature-based — learn from the teacher's intermediate thinking
Instead of only matching final outputs, the student matches the teacher's internal activations. The founding paper is FitNets (Romero et al., 2015): a middle layer of the teacher acts as a "hint" guiding a corresponding student layer, bridged by a small regressor to handle the size mismatch. A famous variant, attention transfer, matches the teacher's attention maps.
Feature-based distillation is more informative — you're teaching how the teacher computes, not just what it concludes — but it's fiddlier: you must choose which layers to align and bridge their differing dimensions.
3. Relation-based — learn the relationships, not the values
The most abstract kind transfers relationships rather than raw values: how two layers relate (FSP, "A Gift from Knowledge Distillation," Yim et al. 2017), or how samples relate to each other in representation space (RKD, "Relational Knowledge Distillation," Park et al. 2019, matching distances and angles between examples). The idea: the structure of the teacher's representation can be more transferable than any single activation.
How they relate: offline, online, self
A second axis describes the training relationship:
- Offline distillation — the default. The teacher is pre-trained and frozen; the student trains against its fixed outputs. Simple and stable.
- Online distillation — teacher and student (or a pool of peers) train simultaneously, teaching each other as they go (e.g. Deep Mutual Learning). Useful when no strong teacher exists yet.
- Self-distillation — the student has the same architecture as the teacher. Surprisingly, an identical-capacity student trained to imitate a converged teacher can outperform it (Born-Again Networks, Furlanello et al. 2018). Here distillation isn't compressing anything — the richer training signal alone is the benefit. Exactly why this works is still not fully settled theoretically.
The LLM-era addition: it's often just data
Much of what's called "distillation" for large language models today is sequence-level: the teacher generates sequences and the student trains on them with ordinary fine-tuning. Kim & Rush introduced this as Sequence-Level KD (SeqKD) in 2016, and it's the direct ancestor of modern synthetic-data distillation — Alpaca, Orca, Phi, and the DeepSeek-R1 distilled models all descend from this idea.
This matters because it lowers the bar enormously: you don't need access to a teacher's logits or internals. If a strong model can generate good data, you can distill from it with nothing more than a fine-tuning loop. That's the bridge to the modern frontier — reasoning distillation and on-policy distillation.
The map, in one breath
Transfer the teacher's outputs (response), internals (feature), or relationships (relation); do it with a frozen teacher (offline), a co-trained one (online), or an identical one (self); and for LLMs, increasingly, do it by having the teacher generate data the student learns from. Everything else is detail.
Next: see how these ideas separate from their cousins in Distillation vs. quantization vs. pruning.