AI Codex
Foundation Models & LLMsCore Definition

Why Claude has values instead of just rules

Most AI safety is a list of don'ts. Constitutional AI is the method Anthropic used to teach Claude to reason about right and wrong — the same way you'd want a thoughtful colleague to.

5 min read·Constitutional AI

There are two ways to teach someone to behave well.

The first: give them a rulebook. "Don't do X. Don't say Y. If Z happens, do W." This works until they encounter a situation the rulebook didn't anticipate — and then they're lost, or they find a loophole, or they apply the rule so rigidly they miss the point.

The second: teach them why the rules exist. Help them understand the values behind the rules so they can reason their way through new situations, even ones nobody wrote a rule for.

Constitutional AI is Anthropic's approach to the second method. It's the training technique that gave Claude something closer to judgment than a filter.

How it works

The "constitution" in Constitutional AI is a set of principles — plain-language statements about what it means to be helpful, honest, and harmless. Things like: avoid content that could be used to harm people, respect human autonomy, be honest even when the truth is uncomfortable.

During training, Claude was asked to evaluate its own outputs against these principles. Not just "does this break a rule" — but "does this response reflect good values? Would a thoughtful person reading this think it was the right thing to say?"

This self-critique loop ran thousands of times, across a huge variety of situations. The result is a model that has internalized a sense of what good judgment looks like, not just memorized a list of forbidden phrases.

Why this is different from other approaches

Most AI safety systems work by pattern matching. They detect certain words or request types and block them. This creates two problems:

Overcorrection. Legitimate requests get blocked because they superficially resemble harmful ones. You've probably seen this — asking about medication interactions for a safety guide and getting refused because the words triggered a filter.

Undercorrection. Bad actors learn to phrase things differently to bypass the filter. The pattern gets worked around.

Constitutional AI sidesteps both problems because Claude is reasoning about intent and impact, not just pattern-matching on words. It can engage with difficult topics in appropriate contexts while still exercising genuine judgment about harm.

What this means when you build with Claude

Constitutional AI is why Claude behaves differently from models trained only on human approval ratings, which tend to optimize for "sounds good" rather than "is good."

Claude will push back on requests it finds problematic — not because a rule fired, but because it made a judgment call. It will also engage thoughtfully with difficult topics in contexts where that's appropriate.

This creates a more collaborative dynamic. You can discuss why Claude responded a certain way. You can provide context that changes the assessment. Claude's position isn't a hard wall — it's a considered view that can update with new information.

The honest trade-off

Constitutional AI doesn't make Claude perfect. No training method does. Claude makes mistakes, misjudges context, and occasionally declines things it shouldn't.

But it does make Claude's safety behavior more like principled judgment and less like a brittle filter. For anyone building applications where nuance matters — which is everyone — that's a meaningful difference.


Further reading