GLOSSARY

RLHF

DEFINITION

Reinforcement Learning from Human Feedback — the training technique used to align language models with human preferences. Human raters score model outputs, and the model is trained to produce outputs that humans prefer.

RLHF is why ChatGPT, Claude, and Gemini feel different from raw pre-trained models: it shapes tone, instruction-following, and safety behaviors. Without RLHF, LLMs produce incoherent or unhelpful outputs despite impressive raw capability.

The RLHF process: (1) Generate model outputs for many prompts. (2) Have human raters rank the outputs from best to worst. (3) Train a reward model on these preferences. (4) Use reinforcement learning (PPO) to fine-tune the LLM to maximize the reward model's score.

Anthropic extended RLHF with Constitutional AI (CAI), using a written set of principles to guide model behavior rather than relying entirely on human rater preferences — making Claude's safety behaviors more systematic.

Tools That Use RLHF

Claude

9.4/10

Anthropic's AI assistant with industry-leading reasoning and safety

Free / $20/mo Pro / API from $3/M tokensView Review →

Related Terms

Large Language Model

A type of AI trained on massive text data that can understand, generate, and manipulate human language. LLMs are the foundation of Claude, ChatGPT, Gemini, and similar tools.

Fine-Tuning

Further training a pre-trained LLM on a domain-specific dataset to improve performance on specialized tasks. Fine-tuning adjusts model weights to better match a target behavior or domain.

Transformer Architecture

The neural network architecture underlying all modern large language models. Introduced by Google in the 2017 paper "Attention Is All You Need," the Transformer uses self-attention mechanisms to process text in parallel rather than sequentially.

Stay in the loop

Weekly AI tool reviews, news digests, and how-to guides.

Join 12,000+ builders