A model compression technique where a smaller "student" model is trained to replicate the outputs and behavior of a larger, more capable "teacher" model.
Model distillation transfers knowledge from a large, expensive model (the teacher) to a smaller, faster model (the student). Instead of training the student on raw data labels, it trains on the teacher's output probability distributions — the "soft targets" that contain richer information about the teacher's understanding than hard labels alone.
Distillation is one of the most commercially important techniques in AI deployment. It enables organizations to build small, fast, domain-specific models that approach the quality of frontier models on specific tasks — at a fraction of the inference cost. Meta's Llama series and Microsoft's Phi models were trained using distillation-inspired techniques and outperform much larger models on targeted benchmarks.
For enterprises, distillation is the primary path to building custom, on-premises AI models. Using GPT-4o or Claude to generate high-quality synthetic training data, then distilling that knowledge into a smaller open-source model, allows deployment on private infrastructure with predictable costs and no ongoing API dependency.
Weekly AI tool reviews, news digests, and how-to guides.
Join 12,000+ builders