AI Codex
Prompt EngineeringDevelopersCTOs

Jailbreaking

Techniques people use to get AI models to bypass their safety guidelines — often by framing requests as hypothetical, fictional, or through elaborate roleplays. Jailbreaking is an ongoing challenge: as AI companies tighten safeguards, people find new ways around them. Claude is designed to maintain its values across different framings. Understanding jailbreaking matters for operators because some users will try it on your deployed AI product.

In practice

Someone asks Claude to roleplay as an AI with no restrictions, hoping it will then provide harmful instructions it would normally refuse. That attempt to circumvent Claude's safety training is jailbreaking. Anthropic actively works to make Claude resistant to these attempts — both because the outputs can be harmful and because it undermines trust in the system.

Related concepts