Distilling a tool-using support agent: the hard part is knowing when to escalate

In an earlier experiment we distilled a single-turn support agent: read one ticket, pick one action. That proved a small model could learn the behavior. But real support is not single-turn. A real agent holds a conversation, looks things up, opens tickets, and decides, turn by turn, whether it can keep going or needs a human.

So we built that version, and it taught us something we did not expect: making the agent more capable made it less safe in a specific way. This is the write-up of what broke, why, and how we fixed it.

What we built

The goal was a support agent that works like an agent, not a classifier. At each step it thinks, then takes exactly one action, emitted as a single JSON object:

{"thought": "...", "action": "search_kb | reply | clarify | escalate | create_ticket | update_ticket",
 "args": {"...": "..."}, "message": "text for the customer, or empty"}

search_kb looks up the company's knowledge base (retrieval as a tool the model decides to call). create_ticket and update_ticket log work. reply and clarify talk to the customer. escalate hands off to a human with a reason code. After a tool call the agent receives the result and continues, so a single customer message can trigger several internal steps before a reply comes back.

The build:

Teacher: GLM-5.2, an open-weight frontier model released by Z.ai under an MIT license. MIT matters: it means we can train on its outputs and publish the result cleanly, which a closed model's terms do not allow (see is distilling legal?).
Student: Qwen3.5-9B, trained with QLoRA. It runs, and trains, on a single NVIDIA RTX 5060 Ti (16GB).
Method: offline trace distillation. The teacher generated full multi-turn conversations grounded in ten fictional companies, we filtered them with rule-based verification, and we trained the student only on its own turns (the train_on_responses_only trick). The whole project cost about $16 in teacher API.

For background on what distillation transfers and how to run the result, see what is model distillation and running distilled models locally.

The experiment

As before, the real question was generalization: we tested on two companies the model never trained on (a travel platform and a streaming service), comparing three models at every agent turn:

base: the un-distilled Qwen3.5-9B.
v2: distilled, multi-turn and tool-using.
v2.1: the same, after one targeted fix (below).

We measured action accuracy, escalation recall (of the turns that should escalate, how many did), and over-escalation rate (of the turns that should not escalate, how many wrongly did).

What broke: capability traded against calibration

The first distilled model (v2) was, on paper, good. It produced valid structured output every time, beat the base model on action accuracy, and almost never over-escalated. But look at the escalation recall:

held-out company	metric	base	v2
travel	escalate recall	50%	37.5%
streaming	escalate recall	82.4%	41.2%

The distilled agent escalated less than the raw base model, and missed most of the cases that genuinely needed a human. That is the dangerous failure mode. Over-escalation annoys customers; under-escalation means a security issue or an out-of-policy refund quietly gets handled wrong by a bot.

The cause was structural, and it is worth remembering if you build one of these. When we gave the agent tools and multi-turn memory, a conversation became many turns of search_kb and reply and only occasionally one escalate. So in the training data, escalation was a small minority of all agent turns. The model dutifully learned that distribution: when in doubt, resolve. We made it more capable, and in doing so we taught it to over-trust itself. Adding capability did not improve calibration, it shifted the failure mode.

The fix: rebalance toward the rare action

The fix did not require a better teacher or a bigger model. It required changing what the student saw. For v2.1 we oversampled the conversations that contained an escalation, so the rare-but-critical action carried more weight in training. One lever, and it worked:

held-out company	metric	base	v2	v2.1
travel	action accuracy	64.8%	75.4%	73.8%
travel	escalate recall	50%	37.5%	87.5%
travel	over-escalation	16.7%	3.5%	7.0%
streaming	action accuracy	70.1%	71.7%	78.7%
streaming	escalate recall	82.4%	41.2%	70.6%
streaming	over-escalation	20.9%	4.5%	4.5%

v2.1 is the model you would ship. Escalation recall recovered (it catches the cases that need a person again), over-escalation stayed far below the base model, and action accuracy beat the base on both unseen companies. That is the right operating point for a support agent: resolve confidently, escalate the genuinely hard cases, and rarely cry wolf.

Q&A: Is exact-match the right way to score an agent?

No, and this surprised us. Our first quality number was exact-match: did the model pick the same action as the teacher's gold trajectory at each turn? That came out around 74%, which sounds mediocre. But in a multi-turn agent, several actions are often equally valid. Searching the knowledge base before answering, or answering directly when you already know, can both be reasonable. Exact-match punishes the model for choosing a different-but-fine path.

So we re-scored with an LLM judge that asked a fairer question: was this action a reasonable next step? The acceptable-action rate came out around 85%, well above the 74% exact-match. The lesson, which echoes our argument that distilled models need receipts: pick a metric that matches how the model is actually used. For an agent, "did it do something reasonable" beats "did it match one gold answer."

Q&A: What metric actually decides if a support agent is deployable?

Over-escalation, paired with recall. Here is a trap we walked into and want to flag. On the streaming company, the base model's acceptable-action rate (88%) slightly beat v2.1's (83%). Taken alone, that reads as "base is better." It is not. The judge rates each action in isolation, and escalating is almost always a defensible single action, so the base model's habit of escalating one ticket in five (its 20.9% over-escalation) quietly racks up "acceptable" verdicts. The judge does not see the business cost of escalating a fifth of your tickets, which is that you no longer have automation, you have an expensive routing layer.

v2.1 escalates about 1 in 22 of the streaming tickets it should not (4.5%). That is the number that makes it a product. The point generalizes: never read one evaluation cell in isolation. A high "acceptable" or "recall" number bought with rampant over-escalation is not a good agent.

The honest caveats

The evaluation is synthetic. The companies and the judge are model-generated. This measures whether the skill transfers across our distribution, not yet a human-rated, real-world test. That remains the honest next milestone.
It is human-in-the-loop by design. The point of calibrated escalation is that a person catches what the model hands off. We would not run this fully autonomous, and no narrow 9B support model should be.
Real-world quality is bounded by retrieval. The agent is only as grounded as what search_kb returns.

The takeaway

A tool-using, multi-turn support specialist now fits in a 9B model on a single consumer GPU, distilled from an open frontier teacher for about the price of a few coffees. But the headline is not the size or the cost. It is that the hard part of an agent is not answering questions, which small models already do well. The hard part is calibration: knowing when to act and when to stop. Capability does not give you that for free, and it can quietly take it away. You get it by training for the rare action and by measuring the failure mode that actually costs you, not the one that is easy to count.

Distillation makes frontier capability portable. Calibration is what makes it safe to hand a customer.