The distiller's toolkit
The frameworks people actually use to distill models in 2026 — from Hugging Face TRL and Arcee DistillKit to synthetic-data pipelines and managed cloud services.
There's a gap between understanding distillation and actually running one. This is the field guide to the tools that close it. They cluster into five groups: logit/hidden-state trainers, synthetic-data generation, on-policy trainers, prune-then-distill, and managed cloud services.
1. Logit & hidden-state trainers
These do "classical" distillation — matching the teacher's outputs or internals.
- Hugging Face TRL —
GKDTrainer— the mainstream open-source path. It implements Generalized Knowledge Distillation, wraps the familiarSFTTrainer, and takes ateacher_modelargument.lmbdacontrols how much on-policy (student-generated) data to use;betaselects the divergence. If you want one tool to start with, start here. - Arcee AI — DistillKit — an open-source toolkit offering two methods: logit-based (KL on soft targets) and hidden-states-based (aligning intermediate representations, which enables cross-architecture distillation). Arcee used offline top-K logit distillation to build SuperNova-70B from Llama-3.1-405B — and published the logits dataset.
- torchtune — PyTorch/Meta's native fine-tuning library with built-in KD recipes (single-device and distributed via FSDP). Its flagship example distills Llama-3.1-8B into Llama-3.2-1B.
- unsloth — memory- and speed-optimized fine-tuning (up to ~80% less VRAM). It's the popular choice for the DeepSeek-R1 recipe: plain SFT on a teacher's reasoning traces.
A caveat worth knowing: axolotl, despite its popularity, treats distillation as not a first-class feature. People pair its SFT with teacher-generated data rather than doing true logit distillation. Don't reach for it expecting a
teacher_modelknob.
2. Synthetic-data generation
Much modern distillation is really "have the teacher write the training set, then fine-tune." The dedicated tool here is:
- distilabel (Argilla / Hugging Face) — a framework for building synthetic-data pipelines: chains of Steps and Tasks where a teacher LLM generates and annotates data — instructions, preference pairs for DPO, reasoning traces. It's not a trainer; it's the data factory that feeds one.
3. On-policy trainers
For the on-policy frontier, TRL's GKDTrainer (above) is again the primary open path — set lmbda toward the student's own generations. This is where most 2025–2026 reasoning work happens.
4. Prune-then-distill
- NVIDIA NeMo + TensorRT Model Optimizer (ModelOpt) + Minitron — the production prune-then-distill stack. ModelOpt prunes (linear layers, heads, MLP, depth) and then distills against the unpruned teacher to recover accuracy. Minitron is the resulting model family (Llama-3.1-Minitron-4B, Nemotron Nano v2 9B). The recipe reaches a strong small model with up to 40× fewer training tokens than from scratch.
5. Managed cloud distillation-as-a-service
If you'd rather not run training yourself:
- OpenAI Model Distillation (in the API) — turn on Stored Completions to auto-capture a large model's inputs/outputs, fine-tune a smaller model on them, and grade the result with Evals, all in-platform.
- Azure OpenAI mirrors this in Azure AI Foundry.
- Google Vertex AI offers a "Distilling step-by-step" tuning option.
- AWS Bedrock Model Distillation provides managed teacher→student fine-tuning.
Read the terms first. OpenAI, Anthropic, Mistral, and xAI all include clauses restricting the use of their model outputs to train competing models. Distilling within a provider's own service is sanctioned; distilling one vendor's model to build a rival is contractually murky. See Is distilling from GPT-4 legal? for the full picture.
A sensible starting path
- Generate or gather teacher data with distilabel (or just collect traces).
- Run your first distillation with TRL's
GKDTrainer— start off-policy, then turn uplmbda. - Once you have a good small student, quantize it to a GGUF and run it locally.
That's a complete loop, all open-source, all runnable on modest hardware.