AINeutralHugging Face Blog · May 277/10
🧠Artificial Analysis and IBM released ITBench-AA, the first comprehensive benchmark for evaluating frontier AI models on enterprise IT task automation. The benchmark reveals that leading models score below 50%, exposing significant gaps in agentic AI capabilities for real-world business operations and highlighting the gap between marketing claims and actual performance.
AIBearishDecrypt – AI · May 277/10
🧠Huawei has introduced Claw-Anything, a benchmark that tests AI agents' ability to handle complex digital tasks over extended simulated timeframes. GPT-5.5, currently the best-performing model, achieved only 34.5% on the benchmark, highlighting significant limitations in current AI agents' capacity to maintain performance during long-horizon tasks.
🧠 GPT-5
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers introduce LMUnit, a new evaluation framework for language models that uses natural language unit tests to assess AI behavior more precisely than current methods. The system breaks down response quality into explicit, testable criteria and achieves state-of-the-art performance on evaluation benchmarks while improving inter-annotator agreement.
AINeutralarXiv – CS AI · Mar 46/102
🧠Researchers have released LiveAgentBench, a comprehensive benchmark featuring 104 real-world scenarios to evaluate AI agent performance across practical applications. The benchmark uses a novel Social Perception-Driven Data Generation method to ensure tasks reflect actual user requirements and includes 374 total tasks for testing various AI models and frameworks.
AINeutralarXiv – CS AI · Mar 37/103
🧠Researchers introduce InnoGym, the first benchmark designed to evaluate AI agents' innovation potential rather than just correctness. The framework measures both performance gains and methodological novelty across 18 real-world engineering and scientific tasks, revealing that while AI agents can generate novel approaches, they lack robustness for significant performance improvements.
AIBullisharXiv – CS AI · Feb 277/107
🧠Researchers have developed Exgentic, a new framework for evaluating general-purpose AI agents that can perform tasks across different environments without domain-specific tuning. The study benchmarked five prominent agent implementations and found that general agents can achieve performance comparable to specialized agents, establishing the first Open General Agent Leaderboard.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers argue that current AI evaluation benchmarks fail to reflect real-world performance in low-resource environments, where factors like noisy inputs, poor connectivity, and low-end hardware significantly impact usability. The paper proposes a new evaluation framework that assesses deployed systems holistically rather than isolated models, with standardized reporting cards designed for policymakers and implementers.
AINeutralarXiv – CS AI · Apr 66/10
🧠Researchers introduce StructEval, a comprehensive benchmark for evaluating Large Language Models' ability to generate structured outputs across 18 formats including JSON, HTML, and React. Even state-of-the-art models like o1-mini only achieve 75.58% average scores, with open-source models performing approximately 10 points lower.