How this site defines benchmark saturation, where the data comes from, and the academic work that informs our approach.
We follow the framework established by Akhtar et al. (2026), which defines benchmark saturation as the “loss of reliable discriminative power among state-of-the-art models.” A benchmark is considered saturated when top-performing models cluster so tightly that their performance differences fall within the expected margin of evaluation uncertainty, rendering them statistically indistinguishable.
This means saturation does not require a model to score 100%. A benchmark can saturate well below its theoretical maximum if the remaining gap reflects noise, ambiguous questions, or data errors rather than meaningful capability differences. For example, PubMedQA plateaued around 81% and WMDP-Bio around 87% — both below 100% but with top models no longer meaningfully differentiable.
For each benchmark, we compute activeLifespanMonths as the number of months from the benchmark's introduction to its approximate saturation date. For benchmarks classified as “nearing” saturation, lifespan is computed as time from introduction to April 2026 (a lower bound that will increase if the benchmark resists full saturation).
Limitation: Saturation dates are approximate, based on when published analyses identified performance plateaus rather than exact mathematical inflection points. Different researchers may identify slightly different saturation dates for the same benchmark.
Benchmark scores, saturation dates, and metadata are compiled from the following publicly available sources.
| Source | Used for | URL |
|---|---|---|
| Epoch AI Benchmarking Hub | Historical SOTA scores for GPQA Diamond, MMLU, MedQA, HLE | epoch.ai/benchmarks |
| Justen (2025) biology-benchmarks | 27-model evaluation data across 8 bio benchmarks (Nov 2022 - Apr 2025) | github.com/lennijusten/biology-benchmarks |
| CASP Prediction Center | GDT-TS and lDDT scores across 16 rounds (1994-2024) | predictioncenter.org |
| MedHELM / HELM | Clinical benchmark scores across 121 tasks | crfm.stanford.edu/helm/medhelm |
| HuggingFace | WMDP, MMLU, and various benchmark datasets | huggingface.co |
| Therapeutics Data Commons | Drug discovery leaderboards across 29 tasks | tdcommons.ai |
| ProteinGym | Protein fitness prediction baselines, 217 DMS assays, 90+ models | proteingym.org |
| Papers With Code archive | Historical SOTA data across 9,327 benchmarks (frozen July 2025) | github.com/paperswithcode/paperswithcode-data |
| Vals.ai | MedQA saturation tracking (archived as saturated) | vals.ai |
| BioASQ | Annual biomedical semantic QA competition results since 2013 | bioasq.org |
| BLURB | Biomedical NLP leaderboard (13 datasets, 6 task types) | microsoft.github.io/BLURB |
| Nucleotide Transformer Benchmark | Genomic prediction leaderboard (18 tasks) | HuggingFace Space (InstaDeepAI) |
| Scale AI / HLE Leaderboard | Humanity's Last Exam scores | labs.scale.com/leaderboard |
| SecureBio / VCT | Virology Capabilities Test results | virologytest.ai |
| Open Problems in Single-Cell Biology | Single-cell benchmark results | openproblems.bio |
| Open Graph Benchmark | Protein and molecular graph benchmark scores | ogb.stanford.edu |