The neural network architecture underlying all modern large language models. Introduced by Google in the 2017 paper "Attention Is All You Need," the Transformer uses self-attention mechanisms to process text in parallel rather than sequentially.
Before Transformers, language models processed text word by word using RNNs, making it difficult to capture long-range dependencies. The Transformer's self-attention mechanism allows every token to "attend" to every other token simultaneously, capturing global context in one pass.
The Transformer architecture is the foundation of every major LLM: GPT, Claude, Gemini, Llama, Mistral, and Falcon are all Transformer-based models. Differences between them come from training data, scale, fine-tuning approach, and architectural variations like grouped-query attention.
Anthropic's AI assistant with industry-leading reasoning and safety
OpenAI's AI assistant powering 100M+ users worldwide
Weekly AI tool reviews, news digests, and how-to guides.
Join 12,000+ builders