Question 1

What is RLHF?

Accepted Answer

A training method where humans rate AI responses — "this answer was better than that one" — and those ratings are used to teach the model to give more helpful, less harmful answers over time. It's a big reason why Claude writes naturally and declines harmful requests rather than just spitting out whatever is statistically likely. Anthropic used RLHF as part of how Claude was trained, alongside their own Constitutional AI approach.

Question 2

What is another name for RLHF?

Accepted Answer

RLHF is also known as: Reinforcement Learning from Human Feedback.