Benchmarking LLMs: What Leaderboards Actually Measure

The Benchmark Game

Every major LLM release comes with a table of benchmark scores. The model beats the previous state-of-the-art on MMLU, HumanEval, and GSM8K. The next model's announcement features the same charts. Benchmark scores have become the primary language of LLM comparison, but understanding what these scores actually measure — and where they systematically mislead — is essential for making good model selection decisions.

The Major Benchmarks

  • MMLU (Massive Multitask Language Understanding): 57 subjects, 14,000+ multiple-choice questions. Tests general knowledge across academic and professional domains. Widely reported; widely criticized for being saturated (most frontier models score 85%+) and for having noisy labels.
  • HumanEval / MBPP: Code generation benchmarks. HumanEval asks models to write Python functions; MBPP tests broader programming knowledge. More reliable than MMLU for code-specific use cases, but still relatively saturated.
  • GSM8K: Grade-school math word problems. Tests multi-step arithmetic reasoning. Performance here differentiates models well at the 7B scale but is saturated for frontier models.
  • MATH: Hard competition math problems. Still differentiates frontier models; currently scores range from 40% to 90%+ depending on model size and specialization.
  • Chatbot Arena (LMSYS): Human preference evaluation via side-by-side comparisons. Considered the most reliable indicator of real-world chat quality because it uses actual users with real questions.

The Contamination Problem

Benchmark contamination — when training data includes benchmark test questions — is a persistent problem. Models that have seen benchmark questions during training will score higher without being more capable. Several studies have demonstrated this: model performance drops significantly when evaluated on "held-out" versions of MMLU and GSM8K that weren't in training data. The extent of contamination in closed models is unknown; in open models, analysis of pretraining data is possible but still tedious.

What Benchmarks Don't Measure

Benchmark scores are poor predictors of performance on:

  • Your specific task and domain (if it's different from the benchmark's distribution)
  • Long-form generation quality (most benchmarks use short-form outputs)
  • Instruction following and format compliance
  • Factual accuracy on niche topics
  • Latency and throughput at scale
  • Alignment and safety properties
  • Consistency across many calls with the same input

What You Should Actually Do

The right approach to model selection:

  1. Start with Chatbot Arena rankings for general quality signal
  2. Look at task-specific leaderboards if your use case has one (code, math, languages)
  3. Build a representative eval set from your own data and measure what matters to you
  4. Test your top 3-4 candidates on your eval set
  5. Factor in latency, cost, and reliability requirements

Benchmark scores are a starting point for narrowing the candidate list, not a final answer.