AI Codex
Evaluation & SafetyDevelopersCTOs

Content Moderation

Filtering AI inputs and outputs to detect and block harmful content — hate speech, graphic violence, personal information, or anything that violates platform rules. In consumer AI products, this happens automatically on every request. For companies building on the Claude API, you're responsible for moderation appropriate to your use case and user base. Anthropic provides default safety filtering; you can add additional layers on top.

In practice

You're building a community platform and want to automatically flag harmful posts. You use Claude to review each submission — checking for hate speech, threats, or spam — before it goes live. Content moderation is using AI to police what's allowed in a system, either as a first pass before human review or as the final gate.

Related concepts