Benchmarking LLMs: What Leaderboards Actually Measure

The Benchmark Game

Every major LLM release comes with a table of benchmark scores. The model beats the previous state-of-the-art on MMLU, HumanEval, and GSM8K. The next model's announcement features the same charts. Benchmark scores have become the primary language of LLM comparison, but understanding what these scores actually measure — and where they systematically mislead — is essential for making good model selection decisions.

The Major Benchmarks

MMLU (Massive Multitask Language Understanding): 57 subjects, 14,000+ multiple-choice questions. Tests general knowledge across academic and professional domains. Widely reported; widely criticized for being saturated (most frontier models score 85%+) and for having noisy labels.
HumanEval / MBPP: Code generation benchmarks. HumanEval asks models to write Python functions; MBPP tests broader programming knowledge. More reliable than MMLU for code-specific use cases, but still relatively saturated.
GSM8K: Grade-school math word problems. Tests multi-step arithmetic reasoning. Performance here differentiates models well at the 7B scale but is saturated for frontier models.
MATH: Hard competition math problems. Still differentiates frontier models; currently scores range from 40% to 90%+ depending on model size and specialization.
Chatbot Arena (LMSYS): Human preference evaluation via side-by-side comparisons. Considered the most reliable indicator of real-world chat quality because it uses actual users with real questions.

The Contamination Problem

Benchmark contamination — when training data includes benchmark test questions — is a persistent problem. Models that have seen benchmark questions during training will score higher without being more capable. Several studies have demonstrated this: model performance drops significantly when evaluated on "held-out" versions of MMLU and GSM8K that weren't in training data. The extent of contamination in closed models is unknown; in open models, analysis of pretraining data is possible but still tedious.

What Benchmarks Don't Measure

Benchmark scores are poor predictors of performance on:

Your specific task and domain (if it's different from the benchmark's distribution)
Long-form generation quality (most benchmarks use short-form outputs)
Instruction following and format compliance
Factual accuracy on niche topics
Latency and throughput at scale
Alignment and safety properties
Consistency across many calls with the same input

What You Should Actually Do

The right approach to model selection:

Start with Chatbot Arena rankings for general quality signal
Look at task-specific leaderboards if your use case has one (code, math, languages)
Build a representative eval set from your own data and measure what matters to you
Test your top 3-4 candidates on your eval set
Factor in latency, cost, and reliability requirements

Benchmark scores are a starting point for narrowing the candidate list, not a final answer.