AI Codex
Evaluation & Safety ClaudeDevelopersCTOs

Interpretability

Also: mechanistic interpretability

The research challenge of understanding what's actually happening inside an AI model — which internal components activate for which inputs, and why the model makes specific decisions. Different from explainability (getting Claude to explain its output in words): interpretability is about mechanically understanding the model's internals. Anthropic has a dedicated interpretability research team and publishes findings publicly. It's foundational for making AI systems trustworthy at scale.

In practice

Claude gives you an answer, but you want to understand why it gave that answer — which parts of your prompt influenced the response, what concepts it activated internally. Interpretability research tries to open the black box. For most users it's academic; for companies deploying Claude in high-stakes decisions, it's about being able to audit and trust the system.

Related concepts