Skip to content
WhySoGeek.
AI

Model Distillation: Shrinking LLMs Without Losing Smarts

How knowledge distillation transfers a large LLM's behavior to a small, fast student model in 2026, and when it beats fine-tuning.

Sam Carter 7 min read
Cover image for Model Distillation: Shrinking LLMs Without Losing Smarts
Photo: Badagnani / wikimedia (BY 3.0)

A frontier model is expensive to run, slow to respond, and overkill for most production tasks. You rarely need a model that can write poetry, prove theorems, and debug Rust just to classify support tickets. Model distillation is how teams capture the part of a big model's intelligence they actually use and pour it into a small model that runs an order of magnitude cheaper. By 2026 it has become standard practice rather than a research flourish.

Quick answer

Model distillation trains a small "student" model to mimic a large "teacher" model's behavior, transferring most of the capability you actually use into a model that runs roughly an order of magnitude cheaper and faster. The student learns from the teacher's soft probability distributions, which carry more information than plain right/wrong labels, so it often generalizes better than a model trained on hard labels alone. It beats fine-tuning when your goal is a smaller, faster model for a narrow task at high volume or on-device, and it pairs well with quantization and RAG. The catch: the student is only as good as the data it learns from, so curate the distillation set to match real traffic.

Key takeaways

  • Distillation trains a small student model to mimic a large teacher model's behavior, transferring capability without retraining from scratch.
  • The student learns from the teacher's soft probabilities, which carry far more information than plain right/wrong labels.
  • The result is faster inference, lower latency, and cheaper deployment, ideal for edge and high-volume workloads.
  • Distillation differs from fine-tuning: fine-tuning adapts a model to a task, distillation compresses a model into a smaller one.
  • It works best when the teacher is genuinely good at the target task and you have or can generate plenty of representative inputs.

Soft labels carry hidden knowledge

The core insight is subtle. When a teacher model processes an input, it does not just output the single correct answer, it produces a full probability distribution over possibilities. For a sentiment task it might assign 92% to "positive," 6% to "neutral," and 2% to "negative." Those "wrong" probabilities are informative: they tell the student that this example is mostly positive but has a faint neutral undertone.

Training a student on these soft targets teaches it the teacher's nuanced sense of similarity and confidence, not just the hard answer. The student learns how the teacher thinks, which is why a distilled model often generalizes better than one trained on hard labels alone.

Silhouette of a teacher guiding a smaller student figure, suggesting knowledge transfer
Photo: Truus, Bob & Jan too! / flickr (BY-NC 2.0)

The main flavors of distillation

Distillation comes in several forms, and 2026 production pipelines mix them:

  • Soft-target (response) distillation, the student matches the teacher's output distributions. The classic, general-purpose approach.
  • Feature-based distillation, the student is trained to reproduce the teacher's internal representations, not just its final outputs.
  • Relational distillation, the student learns the relationships the teacher draws between examples, preserving its structural understanding.
  • On-policy distillation, the student generates its own outputs, the teacher critiques them, and the student learns from that feedback, which keeps training aligned with what the student will actually produce at inference time.

For most teams, response distillation on a large set of representative prompts is the practical starting point: run your real inputs through the teacher, collect its outputs, and train the student to match.

Distillation vs fine-tuning vs RAG

These get conflated, so be precise:

  • Fine-tuning adapts an existing model to a task or style using labeled examples. The model size stays the same.
  • Distillation compresses a large model's behavior into a smaller model. The point is a size and speed reduction.
  • RAG does not change the model at all; it feeds relevant external context at inference time.

Side by side, here is what each technique changes and when to reach for it:

TechniqueChanges model size?What it doesBest for
DistillationYes, smaller studentTransfers a teacher's behavior to a small modelCutting cost/latency at high volume or on-device
Fine-tuningNoAdapts a model to a task or styleDomain accuracy and consistent tone
QuantizationNo (lower precision)Stores weights at lower bit-depthSqueezing the same model smaller/faster
RAGNoFeeds fresh external context at inferenceUp-to-date facts without retraining

They are complementary. You might distill a frontier model down to a fast student, then fine-tune that student on your domain, then wrap it in RAG for fresh facts. The trade-offs between the first three are laid out in fine-tuning vs RAG vs prompting.

Tip

A distilled model is only as good as the data it learns from. If your teacher rarely sees a certain kind of input during distillation, the student will be weak there. Curate the distillation set to cover the real distribution of production traffic, including the awkward edge cases.

When distillation pays off

Distillation is worth the effort when these line up:

  1. High request volume. The per-call savings of a smaller model only matter at scale, but at scale they are enormous.
  2. Latency sensitivity. A distilled student responds faster, which is the difference between a usable and a sluggish interactive product.
  3. Edge or on-device deployment. When a model has to run on a phone or a constrained server, a distilled small model may be the only option.
  4. A narrow, well-defined task. Distillation shines when you need a fraction of the teacher's range. It is poor at preserving the full general-purpose breadth of a frontier model.

Cloud providers have made this easy, Amazon Bedrock, among others, offers managed distillation where you supply prompts and it produces a tuned student. And many of the strongest small models on the market today, including some open-weight reasoners, were themselves distilled from larger siblings. The on-device economics that make this attractive are covered in small language models on-device agents.

Frequently asked questions

How is distillation different from quantization?

Quantization shrinks a model by storing its weights at lower precision, keeping the same architecture. Distillation trains a genuinely smaller model to imitate a larger one. They are complementary, you can distill a model and then quantize the student for even smaller, faster inference.

Will a distilled model match the teacher's quality?

On the specific tasks it was distilled for, often very closely. Across the teacher's full general-purpose range, no, distillation trades breadth for size. The art is distilling on data that covers everything your application actually does.

Do I need access to the teacher's internals?

Not for response distillation. You only need the teacher's outputs, which you can collect through its normal API by running your prompts through it. Feature-based distillation does need internal access, so it is limited to models you control.

Is distillation legal with a commercial API model?

It depends entirely on that provider's terms of service. Many commercial APIs explicitly prohibit using their outputs to train competing models. Check the license before distilling from a closed model, and prefer open-weight teachers when the terms are uncertain.

#ai#training

Sources & further reading

Keep reading