AI Codex
Evaluation & Safety Claude

AI Safety

The field and practice of ensuring AI systems do what they're supposed to do, without causing unintended harm — at every scale from today's applications to future, more powerful systems. At the deployment level: guardrails, safety testing, content filtering. At the research level: alignment, interpretability, and understanding how AI models make decisions. Anthropic was founded specifically to work on AI safety and publishes extensively on the topic.

In practice

Anthropic deliberately builds Claude to refuse requests that could cause serious harm, even when cleverly framed. AI safety is the field behind those decisions — researchers working to ensure AI systems do what humans actually want, don't cause unintended harm, and remain controllable as they become more capable.

Related concepts