Benchmark
A standardized test used to compare AI models against each other on specific capabilities — math, reasoning, coding, reading comprehension. Benchmarks like MMLU or HumanEval give you an apples-to-apples comparison across models. Useful for getting a general sense of capability, but often misleading for real-world use: a model that scores highest on a benchmark isn't always the best for your specific task. Your own evals matter more than general benchmarks.
In practice
You're deciding between Claude and a competitor for your legal document tool. You run both models on 200 real contract clauses and measure accuracy, hallucination rate, and output format compliance. That test is a benchmark — a structured way to compare models on tasks that actually matter to you, rather than relying on published leaderboards.