Distilled models need receipts

A distilled model is a promise in a small file: this student kept the useful part of the teacher. The awkward question is: how do you know?

A benchmark number on a model card is not enough. A leaderboard rank is not enough. A viral demo is definitely not enough. If distilled models are going to become something people discover, buy, rent, fine-tune, and deploy, the market needs more than models. It needs receipts.

That means evaluation harnesses.

In short

The next serious layer in model distillation is not another clever compression trick. It is the machinery that proves what survived compression: task suites, fresh test sets, reproducible prompts, hardware measurements, license metadata, and failure reports. Without that, a distilled-model marketplace becomes a flea market with nicer typography.

With it, small models become legible products.

Distillation changes the trust problem

When you download a general open model, you usually ask three questions:

Is it capable enough?
Can I run it on my hardware?
Is the license safe for my use?

A distilled model adds three more:

What exactly was transferred? Math reasoning? Code repair? Medical intake? Customer-support style?
What was lost? Long-context robustness? Calibration? Refusal behavior? Multilingual performance?
What was the teacher and data path? Open teacher, closed API, synthetic traces, human data, proprietary corpus?

That is the core marketplace problem. Distillation creates specialized students, and specialization is only valuable if it is measurable.

A 7B student that claims “GPT-4-class legal drafting” is not a product. It is a dare. A 7B student with a reproducible eval pack — tasks, prompts, judge rubric, baseline models, latency, quantization, and known failures — starts to look like something a company can actually assess.

The eval harness is the bottle label

A distilled model without an eval harness is like a bottle from an unlabeled still. Maybe it is excellent. Maybe it makes you blind. Exciting either way, but not enterprise procurement material.

The label should say:

Teacher lineage: what model or models generated the traces, labels, critiques, or logits.
Student base: architecture, size, context length, license, and training method.
Task boundary: what the model is meant to be good at — and what it is not claiming.
Eval suite: exact datasets, prompt templates, judges, scoring scripts, and versions.
Local footprint: quantization used, RAM/VRAM required, throughput, latency, and context-window behavior.
Failure modes: the places where the distilled student diverges from the teacher.

This is not bureaucratic garnish. It is how a buyer tells “useful specialist” from “overfit demo goblin.”

The open-source pieces already exist

The good news: we do not need to invent the whole stack.

The open evaluation ecosystem is already rich. EleutherAI's lm-evaluation-harness gives researchers a common way to run language-model benchmarks. Hugging Face LightEval is built for configurable, reproducible model evaluation. Stanford's HELM pushed the idea that models should be measured across scenarios, metrics, and transparency dimensions instead of one leaderboard score.

Those tools are not “distillation marketplace” tools out of the box. But they are the raw material.

The missing layer is packaging them around the questions distillation specifically raises:

Did the student preserve the teacher's reasoning style, or only the final answers?
Does performance survive quantization from BF16 to Q4_K_M?
Does the model stay good when prompts drift away from the teacher-generated training distribution?
Does the small student fail gracefully, or confidently imitate the teacher's tone while losing the teacher's judgment?
Is the model actually better than fine-tuning the same base directly on the task data?

That last one matters. Distillation has to earn its keep. Sometimes the right answer is not “distill harder.” Sometimes it is “your base model plus ordinary supervised fine-tuning was enough.” A good eval harness should be willing to say that. Rude, but useful.

What a distilled-model scorecard should include

If we were designing the minimum useful scorecard for a distilled model, it would have five sections.

1. Capability retention

Compare teacher, base student, and distilled student on the same task suite.

The key number is not just the student's score. It is the retention ratio: how much of the teacher's advantage over the base model survived distillation.

If the base model scores 40, the teacher scores 90, and the distilled student scores 75, the student retained 70% of the teacher lift. That tells you more than “75” by itself.

2. Local cost

A distilled model exists to run somewhere smaller than the teacher. So the scorecard should report:

model size and quantization
RAM or VRAM footprint
tokens per second
first-token latency
maximum usable context on realistic hardware

This connects directly to running distilled models locally. A model that is brilliant but too slow for the target hardware missed the point.

3. Distribution robustness

Distillation often works beautifully on the teacher's distribution and gets weird off it. The student learns the path through the maze, not the whole city.

So the eval needs near-domain, edge-domain, and adversarial-ish tasks:

Same format, new examples.
Same task, different wording.
Same domain, messier user input.
Inputs that expose shortcuts in the synthetic traces.

This is where many shiny distilled models start sweating through their little synthetic suits.

4. Calibration and abstention

Small students can inherit the teacher's confidence style without inheriting the teacher's competence. That is dangerous.

For real deployments, we should measure whether the model knows when it does not know: refusal quality, uncertainty expression, and whether confidence tracks correctness. A specialized medical triage model, legal assistant, or finance model that sounds certain while being wrong is worse than a weaker model with brakes.

5. Provenance and license

Distilled models are commercial objects. The scorecard has to include the boring stuff because the boring stuff is where lawsuits and procurement reviews live:

teacher license
student base license
synthetic data source
whether closed-model outputs were used
redistribution rights
intended commercial-use status

This is the practical follow-on to the legal distillation question. Capability without provenance is not a product. It is a liability wearing a benchmark hat.

Why this matters for a marketplace

A marketplace for distilled models cannot look like a generic model directory with a “buy” button taped on. The unit being sold is not just weights. It is trusted capability in a small runtime envelope.

That means the marketplace object should be something like:

model weights or API access
eval pack
model card
license/provenance packet
recommended hardware profiles
known failure modes
update history

The marketplace should let a buyer ask: “I have 16GB of RAM, need contract-clause extraction, require commercial-use rights, and care more about precision than recall. What model has receipts?”

That is a very different product from “sort by likes.”

The punchline

Distillation makes frontier capability portable. Evaluation makes it trustworthy.

The first wave of the field proved that small students can inherit surprising ability from large teachers. The next wave has to prove, cleanly and repeatably, which ability survived, what it costs to run, and where it breaks.

That is how distilled models become more than clever artifacts. That is how they become things people can choose, compare, buy, and deploy.

The still is only half the system. The other half is the label on the bottle.