The distiller's toolkit

There's a gap between understanding distillation and actually running one. This is the field guide to the tools that close it. They cluster into five groups: logit/hidden-state trainers, synthetic-data generation, on-policy trainers, prune-then-distill, and managed cloud services.

1. Logit & hidden-state trainers

These do "classical" distillation — matching the teacher's outputs or internals.

Hugging Face TRL — GKDTrainer — the mainstream open-source path. It implements Generalized Knowledge Distillation, wraps the familiar SFTTrainer, and takes a teacher_model argument. lmbda controls how much on-policy (student-generated) data to use; beta selects the divergence. If you want one tool to start with, start here.
Arcee AI — DistillKit — an open-source toolkit offering two methods: logit-based (KL on soft targets) and hidden-states-based (aligning intermediate representations, which enables cross-architecture distillation). Arcee used offline top-K logit distillation to build SuperNova-70B from Llama-3.1-405B — and published the logits dataset.
torchtune — PyTorch/Meta's native fine-tuning library with built-in KD recipes (single-device and distributed via FSDP). Its flagship example distills Llama-3.1-8B into Llama-3.2-1B.
unsloth — memory- and speed-optimized fine-tuning (up to ~80% less VRAM). It's the popular choice for the DeepSeek-R1 recipe: plain SFT on a teacher's reasoning traces.

A caveat worth knowing: axolotl, despite its popularity, treats distillation as not a first-class feature. People pair its SFT with teacher-generated data rather than doing true logit distillation. Don't reach for it expecting a teacher_model knob.

2. Synthetic-data generation

Much modern distillation is really "have the teacher write the training set, then fine-tune." The dedicated tool here is:

distilabel (Argilla / Hugging Face) — a framework for building synthetic-data pipelines: chains of Steps and Tasks where a teacher LLM generates and annotates data — instructions, preference pairs for DPO, reasoning traces. It's not a trainer; it's the data factory that feeds one.

3. On-policy trainers

For the on-policy frontier, TRL's GKDTrainer (above) is again the primary open path — set lmbda toward the student's own generations. This is where most 2025–2026 reasoning work happens.

4. Prune-then-distill

NVIDIA NeMo + TensorRT Model Optimizer (ModelOpt) + Minitron — the production prune-then-distill stack. ModelOpt prunes (linear layers, heads, MLP, depth) and then distills against the unpruned teacher to recover accuracy. Minitron is the resulting model family (Llama-3.1-Minitron-4B, Nemotron Nano v2 9B). The recipe reaches a strong small model with up to 40× fewer training tokens than from scratch.

5. Managed cloud distillation-as-a-service

If you'd rather not run training yourself:

OpenAI Model Distillation (in the API) — turn on Stored Completions to auto-capture a large model's inputs/outputs, fine-tune a smaller model on them, and grade the result with Evals, all in-platform.
Azure OpenAI mirrors this in Azure AI Foundry.
Google Vertex AI offers a "Distilling step-by-step" tuning option.
AWS Bedrock Model Distillation provides managed teacher→student fine-tuning.

Read the terms first. OpenAI, Anthropic, Mistral, and xAI all include clauses restricting the use of their model outputs to train competing models. Distilling within a provider's own service is sanctioned; distilling one vendor's model to build a rival is contractually murky. See Is distilling from GPT-4 legal? for the full picture.

A sensible starting path

Generate or gather teacher data with distilabel (or just collect traces).
Run your first distillation with TRL's GKDTrainer — start off-policy, then turn up lmbda.
Once you have a good small student, quantize it to a GGUF and run it locally.

That's a complete loop, all open-source, all runnable on modest hardware.