What is Benchmark?

Question

What is Benchmark?

Accepted Answer

A standardized test used to compare AI models against each other on specific capabilities — math, reasoning, coding, reading comprehension. Benchmarks like MMLU or HumanEval give you an apples-to-apples comparison across models. Useful for getting a general sense of capability, but often misleading for real-world use: a model that scores highest on a benchmark isn't always the best for your specific task. Your own evals matter more than general benchmarks.