Reinforcement Learning from Human Feedback — the training technique used to align language models with human preferences. Human raters score model outputs, and the model is trained to produce outputs that humans prefer.
RLHF is why ChatGPT, Claude, and Gemini feel different from raw pre-trained models: it shapes tone, instruction-following, and safety behaviors. Without RLHF, LLMs produce incoherent or unhelpful outputs despite impressive raw capability.
The RLHF process: (1) Generate model outputs for many prompts. (2) Have human raters rank the outputs from best to worst. (3) Train a reward model on these preferences. (4) Use reinforcement learning (PPO) to fine-tune the LLM to maximize the reward model's score.
Anthropic extended RLHF with Constitutional AI (CAI), using a written set of principles to guide model behavior rather than relying entirely on human rater preferences — making Claude's safety behaviors more systematic.
Anthropic's AI assistant with industry-leading reasoning and safety
Weekly AI tool reviews, news digests, and how-to guides.
Join 12,000+ builders