AI Codex
Evaluation & SafetyDevelopersCTOs

Benchmark

A standardized test used to compare model capabilities — useful for general capability comparison but often misaligned with real-world task performance.