The process of running a trained AI model on new input data to generate predictions or outputs, as distinct from the training process itself.
Inference is what happens when an AI model is put to work: given an input (a user's question, an image, a document), the model processes it through its learned parameters and produces an output. This is distinct from training, where the model's parameters are being updated by learning from data.
Inference efficiency is one of the most commercially important factors in deploying LLMs. Key metrics include latency (time to first token), throughput (tokens per second), and cost per token. Optimizations like quantization, speculative decoding, KV cache management, and batching are specifically designed to make inference faster and cheaper.
Inference infrastructure has become a major industry in itself. Dedicated inference providers like Together AI, Groq, and Fireworks AI specialize in serving open-source models at high speed. GPU chips from NVIDIA, AMD, and custom silicon from Google (TPUs) and AWS (Trainium) are all optimized for inference workloads.
Weekly AI tool reviews, news digests, and how-to guides.
Join 12,000+ builders