AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance
Researchers have catalogued 195 AI safety benchmarks released since 2018, revealing that rapid proliferation of evaluation tools has outpaced standardization efforts. The study identifies critical fragmentation: inconsistent metric definitions, limited language coverage, poor repository maintenance, and lack of shared measurement standards across the field.
The AI safety evaluation landscape has experienced explosive growth without corresponding institutional coordination. While 195 benchmarks represent substantial research effort, their fragmentation creates a paradoxical problem: abundance without coherence. Most benchmarks use medium complexity levels, yet only seven have achieved widespread adoption, indicating that quantity doesn't translate to utility or consensus around best practices.
This fragmentation stems from the decentralized nature of AI safety research and the rapid acceleration of LLM development. As models evolved quickly, researchers created benchmarks addressing emerging risks independently, without mechanisms for standardization. The concentration around English-only evaluation (165/195 benchmarks) and reliance on arXiv preprints rather than peer-reviewed venues reflects the speed-over-rigor dynamic dominating the field. Repository decay—with 137 benchmarks on stale GitHub repositories—demonstrates inadequate post-publication stewardship.
For the industry, this fragmentation creates real risks. Developers cannot easily compare safety claims across models when benchmark standards diverge significantly. Investors evaluating AI companies face inconsistent safety metrics and unclear benchmark reliability. The lack of durable infrastructure means yesterday's benchmarks become unmaintainable artifacts rather than living evaluation tools. Standardization failures could undermine trust in safety evaluations as regulators increasingly demand transparency around model capabilities and limitations.
The field must move toward governance structures similar to other technical standards bodies. This means establishing consensus around metric definitions, creating incentives for benchmark maintenance, and developing criteria for benchmark selection. Without institutional mechanisms supporting standardization, safety evaluation will remain fragmented despite continued proliferation.
- →195 AI safety benchmarks exist but lack standardized metrics and shared measurement language, creating evaluation fragmentation rather than clarity.
- →Most benchmarks (170/195) are evaluation-only resources with stale repositories, indicating poor long-term maintenance and sustainability practices.
- →English-language dominance (165/195) limits benchmark applicability for global AI deployment and multilingual model evaluation.
- →Only 7 benchmarks achieve popular adoption, suggesting most new benchmarks fail to gain traction or outlast initial publication interest.
- →The field needs institutional governance structures and standardized metrics to move beyond benchmark proliferation toward coherent measurement infrastructure.