GLOSSARY

Latency

DEFINITION

The time elapsed between submitting a request to an AI model and receiving the first token or complete response, a key metric for production AI application performance.

Latency in AI systems is typically measured in two ways: Time to First Token (TTFT) — how long before the response starts streaming — and total response time. For conversational AI, TTFT is often the more important user experience metric, since streaming allows users to begin reading while generation continues.

Latency is affected by many factors: model size, hardware, network distance to the inference server, prompt length, output length, and whether reasoning (thinking) is enabled. Reasoning models have significantly higher latency than standard models due to their extended thinking process. Smaller, distilled models are often deployed specifically to reduce latency in time-sensitive applications.

For production AI applications, latency requirements vary widely by use case. Real-time voice assistants need sub-500ms TTFT. Customer support chatbots can tolerate 1–2 seconds. Batch document processing pipelines may accept minutes per document. Choosing the right model and infrastructure tier for your latency requirements is a critical deployment decision.

Related Terms

Stay in the loop

Weekly AI tool reviews, news digests, and how-to guides.

Join 12,000+ builders