← Knowledge base
Frontier· 3 min read

On-policy distillation: learning from your own mistakes

Why letting the student generate its own attempts and having the teacher grade them — rather than imitating fixed teacher data — became the dominant post-training paradigm of 2025–2026.

If reasoning distillation is what the frontier transfers, on-policy distillation is how it transfers best. It's the technique a 2026 survey called "the indispensable post-training paradigm for scaling reasoning," and it's worth understanding precisely.

The problem with imitation

Standard ("off-policy") distillation trains the student on a fixed corpus the teacher generated. This is sequence-level KD, and it works — but it has a structural flaw called exposure bias or compounding error:

  • The student only ever sees contexts the teacher would visit.
  • At inference, the student inevitably wanders into states the teacher never would — because it's not as good.
  • In those states it has no training signal, so a small early error snowballs into a wrong answer.

You trained it on a perfect driver's route; the moment it drifts off that route, it has never seen the shoulder.

The fix: train on the student's own trajectories

On-policy distillation flips the data source. The student generates its own outputs, and the teacher scores them — token by token — typically with a mode-seeking reverse-KL objective. Now:

  • The student trains on its own mistakes, learning to recover from the states it actually reaches.
  • Feedback is dense (a signal on every token) rather than sparse like RL's single final reward.
  • It's hard to "hack": low reverse-KL essentially means behaving like the teacher.

The unifying framework: GKD

The clean formalization is Generalized Knowledge Distillation (GKD) (Agarwal et al., Google DeepMind, 2023). GKD generalizes distillation along two dials:

  1. Data source — interpolate between fixed teacher data (off-policy) and the student's own self-generated sequences (on-policy).
  2. Divergence — forward KL (mode-covering), reverse KL (mode-seeking), or a generalized JSD in between.

This framework contains older methods as special cases — SeqKD, token-level KD, and MiniLLM (which introduced reverse-KL distillation for LLMs) all fall out of it. In practice you reach for it through Hugging Face TRL's GKDTrainer, where lmbda sets the fraction of on-policy data and beta selects the divergence.

Why it won in 2025–2026

The economics are decisive. Reported results:

  • Thinking Machines Lab (Oct 2025) hit ~70–74% on AIME'24 at roughly one-tenth the cost of reinforcement learning, and 9–30× cheaper than off-policy distillation — while also recovering instruction-following lost to domain fine-tuning (a cure for catastrophic forgetting).
  • Qwen3 distilled a 32B teacher's reasoning into an 8B student at ~1/10 the GPU hours.
  • On-policy distillation is now reported across DeepSeek-V4, Qwen3, Gemma 2, Nemotron, and MiMo.

The catch: diversity collapse

Reverse-KL is mode-seeking by design, and that's a double-edged sword. Push it too hard and the student drops the high-entropy "branch points" where reasoning legitimately has multiple valid paths. One 2026 study found standard on-policy distillation retained only 6.8% of high-entropy tokens versus 18.5% in the teacher — showing up as pass@1 improving while pass@k degrades. That's bad news if you rely on best-of-N sampling or inference-time scaling. The emerging fix is to blend mode-seeking reverse-KL with mass-covering forward-KL.

The takeaway

Off-policy distillation teaches the student the teacher's answers. On-policy distillation teaches the student to be the teacher in the situations the student actually gets itself into. That difference — training on your own mistakes rather than someone else's successes — is why it became the default. Just keep an eye on diversity while you do it.