AINeutralarXiv โ CS AI ยท 5h ago1
๐ง
Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
Researchers released the ERI benchmark, a comprehensive dataset spanning 9 engineering fields and 55 subdomains to evaluate large language models' engineering capabilities. The benchmark tested 7 LLMs across 57,750 records, revealing a clear three-tier performance structure with frontier models like GPT-5 and Claude Sonnet 4 significantly outperforming mid-tier and smaller models.