Back to the wiki

Benchmarks

The exams used to compare models.

The analogy

To compare cars you look at fuel use, trunk space and safety in standardized tests. Models work the same way: batteries of exams — math, code, reasoning — that every model takes under equal conditions so they can be compared.

In detail

A benchmark is a set of tasks with known answers measuring specific abilities (MMLU for knowledge, HumanEval for code…). Useful but imperfect: models may have “seen” the questions during training (contamination), and a good score doesn't guarantee good performance on your real use case. The best benchmark is always your own.

An example

Two models score similarly on public exams, but when you test them on your real documents, one understands your tables far better. For you, that private benchmark is the one that counts.

Related concepts