The knowledge base

Learn model distillation

A living curriculum on the craft of shrinking intelligence. Read it top to bottom, or drop into any level. We update it as the field moves.

Primer

Start here — no background assumed.

014 min

What is model distillation?

A plain-language primer on knowledge distillation — how a small student model learns to think like a giant teacher, and why it's the key to running AI on your own hardware.

Foundations

The core theory and vocabulary.

024 min

How distillation works: the three kinds of knowledge

Response, feature, and relation-based distillation — plus self, online, and offline variants. The conceptual map of how knowledge actually moves from teacher to student.

033 min

Distillation vs. quantization vs. pruning

Three different ways to shrink a model — knowledge transfer, precision reduction, and removal — what each one actually changes, and how to stack them into one local-ready pipeline.

503 min

The distillation glossary

Every term a newcomer to model distillation needs — soft labels, dark knowledge, reverse KL, GGUF, on-policy distillation, the capacity gap, and more — each in one sentence.

Practitioner

Hands-on: tools, recipes, workflows.

043 min

The distiller's toolkit

The frameworks people actually use to distill models in 2026 — from Hugging Face TRL and Arcee DistillKit to synthetic-data pipelines and managed cloud services.

053 min

Run distilled models locally

Put a distilled model on your own machine — Ollama, llama.cpp, MLX, and LM Studio, plus how to read GGUF quant names and pick the right one for your hardware.

Frontier

The research edge and open problems.

063 min

Reasoning distillation: teaching small models to think

How chain-of-thought traces turned distillation from a compression trick into a way to transfer reasoning itself — the DeepSeek-R1 recipe and why it changed the field.

073 min

On-policy distillation: learning from your own mistakes

Why letting the student generate its own attempts and having the teacher grade them — rather than imitating fixed teacher data — became the dominant post-training paradigm of 2025–2026.