AI Codex
Foundation Models & LLMs ClaudeCTOsDevelopers

RLHF

Also: Reinforcement Learning from Human Feedback

A training method where humans rate AI responses — "this answer was better than that one" — and those ratings are used to teach the model to give more helpful, less harmful answers over time. It's a big reason why Claude writes naturally and declines harmful requests rather than just spitting out whatever is statistically likely. Anthropic used RLHF as part of how Claude was trained, alongside their own Constitutional AI approach.

In practice

Early Claude responses were technically coherent but often unhelpful or tone-deaf. Human trainers rated thousands of response pairs — "this one is better than that one" — and those preferences trained the model to generate more helpful, appropriately-toned responses. That rating process is RLHF. It's a major reason Claude feels like a thoughtful assistant rather than a text predictor.

Related concepts