AINeutralarXiv โ CS AI ยท 14h ago6/10
๐ง
COMPOSITE-Stem
Researchers introduced COMPOSITE-STEM, a new benchmark containing 70 expert-written scientific tasks across physics, biology, chemistry, and mathematics to evaluate AI agents. The top-performing model achieved only 21% accuracy, indicating the benchmark effectively measures capabilities beyond current AI reach and addresses the saturation of existing evaluation frameworks.