Run distilled models locally

The whole point of distillation is a model you can run yourself. This is the practical guide to getting a small, distilled model onto your own hardware — and understanding the choices along the way.

Why local

The four reasons to run locally all reward the small models distillation produces: latency (no network round-trip), privacy (your data never leaves the device), cost (inference on hardware you already own), and availability (it works offline). A good distilled 4–8B model is now genuinely useful for real work — that's new, and it's largely thanks to better data plus distillation.

The runtimes

Ollama — the easiest on-ramp. A background daemon with an OpenAI-compatible API; ollama run deepseek-r1:8b and you're talking to a distilled reasoning model. It wraps llama.cpp, and on Apple Silicon (since v0.19) uses MLX under the hood.
llama.cpp — the C++ engine most things are built on. The de-facto GGUF quant format, an OpenAI-compatible llama-server, and CPU+GPU hybrid offload. Reach for it when you want control.
MLX (Apple) — built for Apple Silicon's unified memory; meaningfully faster than llama.cpp for sub-14B models on a Mac, and the only realistic on-device fine-tuning path of the two.
LM Studio — a friendly GUI for non-coders; supports both GGUF and MLX.
vLLM — not for single-user local use, but if you ever serve a distilled model to many users, its throughput (PagedAttention + continuous batching) dwarfs the single-user runtimes.

Reading a quant name

You'll download GGUF files with cryptic suffixes. Here's the decoder:

Label	Meaning
`Q8_0`	8-bit — near-lossless, largest
`Q6_K`	6-bit k-quant — excellent quality
`Q5_K_M`	5-bit medium — high quality, smaller
`Q4_K_M`	~4.5-bit medium — the sweet spot
`Q3_K_M`	3-bit — noticeably degraded
`IQ2_*`	~2-bit importance-matrix — last resort

The pattern: Q{bits} + _K (k-quant, smarter per-block scaling) + _S/_M/_L (small/medium/large mixed-precision tier). For most people on most models, Q4_K_M is the right default — roughly a 75% size cut for a small quality hit.

On the GPU side you'll also meet AWQ and GPTQ (calibration-based, GPU-oriented) and FP8 / NVFP4 (float formats that keep quality high on newer hardware).

Picking by hardware (rough guide)

8 GB RAM/VRAM — a 3–4B model at Q4_K_M (e.g. a distilled Llama 3.2 3B, Gemma 3 4B, Qwen3 4B).
16 GB — a 7–8B at Q4_K_M comfortably (DeepSeek-R1-Distill-Qwen-7B, Llama-3.1-8B distillates), or a 14B tight.
24 GB+ — 14B at good quants, or a 32B at Q4_K_M.
Apple Silicon — unified memory is your friend; use MLX builds and lean on the larger end of what your RAM allows.

A two-minute first run

# Install Ollama, then pull a distilled reasoning model:
ollama run deepseek-r1:8b
 
# Or grab a specific GGUF quant for llama.cpp:
#   download a *.Q4_K_M.gguf, then:
llama-server -m model.Q4_K_M.gguf -c 8192

That's a distilled, frontier-derived reasoning model running entirely on your machine — no API, no tokens metered, no data leaving the building. Which is the whole idea.

Where distilled models live

Most distilled models are published on Hugging Face, often with community GGUF conversions ready to download. Ollama's model library mirrors many of them with one-line pulls. When you build your own with the toolkit, the last step — distill, then quantize to GGUF — is what makes it land here, on your own hardware.