Distilling a customer-support agent onto a single consumer GPU

Most distillation demos prove a small model can answer a question. That is the easy part. A production support agent has to do something harder: read a ticket, ground its answer in one specific company's policies, and (the part everyone underestimates) know when to stop and hand the conversation to a human.

So we ran the experiment. We distilled a frontier teacher into a 4-billion-parameter student, trained it to be a support agent for ten different fictional companies, and then tested it on two companies it had never seen. This is a write-up of what worked, what the real bottleneck turned out to be, and the number we think actually predicts whether a distilled support agent is safe to deploy.

A support agent is a skill, not a knowledge base

The first design decision is the one that makes everything else work: distill the behavior, not the facts.

A support agent that "knows" a business is really two separate things. There is the knowledge (the company's pricing, refund windows, and policies), which changes constantly and differs per company. And there is the skill: reading a ticket plus some retrieved context and deciding how to respond. (For background on what distillation transfers, see what is model distillation.)

You do not want the knowledge baked into the weights. You want it supplied at runtime through retrieval (RAG), so one model can serve any business by swapping the knowledge base. What the student should actually learn is the transferable skill. We framed that skill as a single structured decision per turn:

{
  "action": "reply | clarify | escalate",
  "reason": "<rationale; for escalate, a policy code>",
  "message": "<what to send the customer>"
}

Use reply when the provided context answers it and it is in self-serve scope. Use clarify when one missing detail blocks a correct answer. Use escalate, with one of seven reason codes (billing exception, security-sensitive, legal/compliance, not-covered-by-KB, churn risk, likely bug, enterprise sales), when a human must take over. The escalation decision is the whole game.

What we built

Teacher: a strong open-weight frontier model (DeepSeek V4 Pro), used to generate synthetic training data. We deliberately chose an open-licensed teacher so the result is reproducible and publishable; the legal reasons for that are their own topic, covered in is distilling legal?.
Student: Qwen/Qwen3-4B, trained with QLoRA via Unsloth. Offline trace-supervised fine-tuning, meaning we train only the small student on the teacher's pre-generated responses, which keeps the whole job on one card.
Hardware: a single NVIDIA RTX 5060 Ti (16GB), a consumer GPU. (If you want to know what fits in 16GB, we built a VRAM calculator; QLoRA fine-tuning of a 4-14B model fits comfortably.)
Data: the teacher wrote a full knowledge base and an escalation policy for each of ten invented companies spanning very different domains (SaaS, e-commerce, fintech, food delivery, an ISP, telehealth, a game, B2B logistics, travel, and music streaming), then generated thousands of realistic tickets grounded in each. Crucially, each company's policy lives in the retrieved context, not in the system prompt, so the model is forced to learn "apply the policy I am given" rather than memorizing one company's rules.

The entire run (generate data, train, evaluate) cost about $6 of teacher API and a few hours on the one GPU.

The experiment: does training on diversity make it generalize?

Here is the question we actually cared about. If you train a support agent on one company, does it learn the general skill, or does it just memorize that company? We compared three models on two held-out companies neither student ever trained on (a travel platform and a streaming service):

Base: the un-distilled Qwen3-4B, as a floor.
Single-company student: distilled on just one company's tickets.
Firebolt (multi-company student): distilled across all ten companies, with policy-in-context.

We scored each on the same retrieved context, measuring action accuracy (did it pick the right reply/clarify/escalate?), over-escalation rate (how often it escalated a ticket it should have handled), and valid-JSON rate (did it even produce a usable structured decision?).

Results

Numbers below are our own measurements on the two held-out companies, the "in the wild" test.

Metric	Base 4B	Single-company	Firebolt (multi-company)
Action accuracy, travel co.	47%	74%	86%
Action accuracy, streaming co.	60%	88%	90%
Over-escalation rate, travel co.	11%	36%	0.0%
Over-escalation rate, streaming co.	3%	9%	1.6%
Valid-JSON rate	59-66%	100%	~100%

Two things jump out. The base model is unusable: it fails to produce a valid structured decision a third to a half of the time. It "knows" the answer but cannot follow the contract. And Firebolt wins action accuracy on both unseen companies. But the headline is the over-escalation row.

What is over-escalation, and why does it matter?

Over-escalation is the rate at which the model kicks a ticket to a human that it should have resolved itself. It is the quiet killer of support automation, and it is invisible if you only look at one number.

Look at the single-company model on the travel company: its raw escalation recall was excellent. It caught nearly every ticket that genuinely needed a human. But it did that by escalating 36% of everything, including refund questions and how-tos it was perfectly capable of answering. Faced with a company it didn't recognize, it stopped trusting the policy in front of it and just punted. A support agent that escalates a third of its tickets is not automation; it is an expensive routing layer that annoys customers and doesn't lighten the queue.

Firebolt, trained across ten companies, escalated 0.0% of the resolvable travel-company tickets while still catching ~90% of the ones that truly needed a human. It learned the skill (apply the provided policy) instead of memorizing one company's reflexes.

This is the same failure mode the distillation literature warns about: a student learns the path through the teacher's maze, not the whole city. Train on one company and the model overfits to that company's shape. Train on ten and it generalizes. (We argued that distilled models need this kind of distribution-robustness testing in distilled models need receipts; this is what that looks like in practice.)

Does a bigger, smarter teacher make a better support agent?

Not the way you'd expect. The base Qwen3-4B already had the world knowledge to understand these tickets. Its problem was never intelligence, it was discipline: producing valid structured output and respecting an escalation boundary it was told about. Distillation fixed the discipline. And the thing that fixed generalization was not a stronger teacher or a bigger student. It was diversity in the training data. A frontier teacher gives you clean labels; training across many companies is what teaches the student to trust the policy in its context window instead of guessing from memory.

That is an encouraging result for anyone working with small models: you do not need the biggest model to get a deployable specialist. You need the right behavior, calibrated, and trained on enough variety to survive contact with a new customer. This is the whole humanist promise of distillation: frontier-grade capability that runs locally, on hardware you own, for the cost of a sandwich.

The honest caveats

Accuracy is this site's whole brand, so here is what these numbers do not prove:

The evaluation is synthetic. Both the training companies and the held-out test companies were generated by the teacher, and the eval labels are the teacher's. This measures whether the skill transfers across our distribution; it is not yet a human-rated, real-world test. That is the honest next milestone.
The single-company baseline is a directional comparison, not a clean ablation. It differed from Firebolt in more than data diversity (it used a different teacher and a company-specific prompt), so the over-escalation gap conflates a few variables. The direction is robust and matches the theory, but the clean version (same teacher, varying only the number of companies) is future work.
Real-world quality is bounded by retrieval. The model is only as grounded as the snippets RAG feeds it. A weak retriever produces a weak agent no matter how good the weights.

None of these undercut the core finding. They are exactly the receipts a distilled model should ship with.

The punchline

You can put a usable customer-support specialist into a 4B model on a single consumer GPU for a few dollars. The hard part isn't making it smart; small models are already smart enough for this. The hard part is making it trustworthy: emit a clean decision every time, and escalate the right things, not everything.

The number that tells you whether you've succeeded is not accuracy. It is over-escalation, and the way you drive it to near-zero is to train on the messy variety of the real world, not one tidy example of it.

Distillation makes frontier capability portable. Diversity is what makes it generalize. And calibration is what makes it safe to deploy.