Run distilled models locally
Put a distilled model on your own machine — Ollama, llama.cpp, MLX, and LM Studio, plus how to read GGUF quant names and pick the right one for your hardware.
To run a distilled model locally, pick a runtime (Ollama is easiest, llama.cpp for control, MLX on Apple Silicon) and a quantized GGUF that fits your RAM — Q4_K_M is the quality-vs-size sweet spot. Rough guide: a 3–4B model fits ~8GB, a 7–8B fits ~16GB, and a 32B fits ~24GB. Then `ollama run deepseek-r1:8b` gives you a frontier-derived reasoning model running entirely offline.
The whole point of distillation is a model you can run yourself. This is the practical guide to getting a small, distilled model onto your own hardware — and understanding the choices along the way.
Why local
The four reasons to run locally all reward the small models distillation produces: latency (no network round-trip), privacy (your data never leaves the device), cost (inference on hardware you already own), and availability (it works offline). A good distilled 4–8B model is now genuinely useful for real work — that's new, and it's largely thanks to better data plus distillation.
The runtimes
- Ollama — the easiest on-ramp. A background daemon with an OpenAI-compatible API;
ollama run deepseek-r1:8band you're talking to a distilled reasoning model. It wraps llama.cpp, and on Apple Silicon (since v0.19) uses MLX under the hood. - llama.cpp — the C++ engine most things are built on. The de-facto GGUF quant format, an OpenAI-compatible
llama-server, and CPU+GPU hybrid offload. Reach for it when you want control. - MLX (Apple) — built for Apple Silicon's unified memory; meaningfully faster than llama.cpp for sub-14B models on a Mac, and the only realistic on-device fine-tuning path of the two.
- LM Studio — a friendly GUI for non-coders; supports both GGUF and MLX.
- vLLM — not for single-user local use, but if you ever serve a distilled model to many users, its throughput (PagedAttention + continuous batching) dwarfs the single-user runtimes.
Reading a quant name
You'll download GGUF files with cryptic suffixes. Here's the decoder:
| Label | Meaning |
|---|---|
Q8_0 | 8-bit — near-lossless, largest |
Q6_K | 6-bit k-quant — excellent quality |
Q5_K_M | 5-bit medium — high quality, smaller |
Q4_K_M | ~4.5-bit medium — the sweet spot |
Q3_K_M | 3-bit — noticeably degraded |
IQ2_* | ~2-bit importance-matrix — last resort |
The pattern: Q{bits} + _K (k-quant, smarter per-block scaling) + _S/_M/_L (small/medium/large mixed-precision tier). For most people on most models, Q4_K_M is the right default — roughly a 75% size cut for a small quality hit.
On the GPU side you'll also meet AWQ and GPTQ (calibration-based, GPU-oriented) and FP8 / NVFP4 (float formats that keep quality high on newer hardware).
Picking by hardware (rough guide)
- 8 GB RAM/VRAM — a 3–4B model at
Q4_K_M(e.g. a distilled Llama 3.2 3B, Gemma 3 4B, Qwen3 4B). - 16 GB — a 7–8B at
Q4_K_Mcomfortably (DeepSeek-R1-Distill-Qwen-7B, Llama-3.1-8B distillates), or a 14B tight. - 24 GB+ — 14B at good quants, or a 32B at
Q4_K_M. - Apple Silicon — unified memory is your friend; use MLX builds and lean on the larger end of what your RAM allows.
A two-minute first run
# Install Ollama, then pull a distilled reasoning model:
ollama run deepseek-r1:8b
# Or grab a specific GGUF quant for llama.cpp:
# download a *.Q4_K_M.gguf, then:
llama-server -m model.Q4_K_M.gguf -c 8192That's a distilled, frontier-derived reasoning model running entirely on your machine — no API, no tokens metered, no data leaving the building. Which is the whole idea.
Where distilled models live
Most distilled models are published on Hugging Face, often with community GGUF conversions ready to download. Ollama's model library mirrors many of them with one-line pulls. When you build your own with the toolkit, the last step — distill, then quantize to GGUF — is what makes it land here, on your own hardware.