AI Codex
Evaluation & SafetyDevelopersCTOs

Red Teaming

Deliberately trying to get an AI to do something it shouldn't — produce harmful content, reveal confidential instructions, behave inconsistently — in order to find and fix vulnerabilities before real users encounter them. Borrowed from cybersecurity. In AI, red teaming helps identify failure modes, safety gaps, and unexpected behaviors. Anthropic does extensive red teaming before releasing new Claude models. Companies deploying Claude in sensitive applications should also red team their own system prompts and use cases.

In practice

Before deploying Claude for customer service, your team spends a week trying to make it say harmful things, reveal the system prompt, make false promises, and bypass your guardrails. Everything they find gets fixed before launch. That adversarial testing process is red teaming — finding failure modes before customers do.

Related concepts