COMPOSITE-Stem
Researchers introduced COMPOSITE-STEM, a new benchmark containing 70 expert-written scientific tasks across physics, biology, chemistry, and mathematics to evaluate AI agents. The top-performing model achieved only 21% accuracy, indicating the benchmark effectively measures capabilities beyond current AI reach and addresses the saturation of existing evaluation frameworks.
COMPOSITE-STEM addresses a critical gap in AI evaluation methodology by moving beyond saturated benchmarks that measure constrained outputs. Traditional expert-written benchmarks have proven valuable for assessing AI reasoning, but most have reached performance ceilings that no longer differentiate between advanced models. This new benchmark introduces a hybrid grading approach combining exact-match metrics with criterion-based rubrics and LLM-as-jury protocols, enabling more nuanced assessment of scientifically meaningful but non-deterministic outputs—a challenge that constrains real-world AI deployment in research contexts.
The benchmark's design reflects a broader trend toward frontier evaluation in AI research. As models improve, the evaluation infrastructure must evolve in parallel to remain meaningful. By curating tasks through doctoral-level researchers and limiting the benchmark to 70 carefully selected problems rather than thousands of generic items, the creators emphasize quality over quantity. This approach mirrors recent shifts in the AI community toward benchmarks that genuinely stress-test reasoning capabilities rather than pattern matching.
For developers and research institutions, COMPOSITE-STEM provides a more realistic assessment framework for deploying AI agents in scientific discovery workflows. The 21% top performance indicates substantial headroom for improvement, making the benchmark valuable for longitudinal tracking of AI progress in high-stakes domains. The open-sourced nature of all tasks enables reproducibility and competitive benchmarking across organizations.
The immediate impact centers on research methodology rather than commercial applications. As AI agents increasingly integrate into scientific workflows, having robust evaluation frameworks becomes essential for validation and trust-building with domain experts. Future iterations may expand scope or introduce domain-specific variants.
- →COMPOSITE-STEM provides a new frontier benchmark that current AI agents struggle with, achieving only 21% top performance across 70 expert-designed scientific tasks.
- →The benchmark uses hybrid grading combining exact-match metrics, rubrics, and LLM-jury protocols to assess scientifically meaningful outputs beyond constrained answers.
- →Tasks span physics, biology, chemistry, and mathematics—domains critical for scientific discovery acceleration.
- →Open-sourcing all tasks enables reproducible evaluation and supports the broader research community's ability to benchmark AI progress.
- →The benchmark addresses saturation in existing evaluation frameworks, providing meaningful differentiation between frontier AI models.