Evaluation & SafetyDevelopersCTOs
Benchmark
A standardized test used to compare model capabilities — useful for general capability comparison but often misaligned with real-world task performance.
A standardized test used to compare model capabilities — useful for general capability comparison but often misaligned with real-world task performance.