y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#benchmark News & Analysis

253 articles tagged with #benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

253 articles
AINeutralarXiv – CS AI · Mar 277/10
🧠

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

Researchers introduce ARC-AGI-3, a new benchmark for testing agentic AI systems that focuses on fluid adaptive intelligence without relying on language or external knowledge. While humans can solve 100% of the benchmark's abstract reasoning tasks, current frontier AI systems score below 1% as of March 2026.

AINeutralarXiv – CS AI · Mar 277/10
🧠

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

Researchers introduced WebTestBench, a new benchmark for evaluating automated web testing using AI agents and large language models. The study reveals significant gaps between current AI capabilities and industrial deployment needs, with LLMs struggling with test completeness, defect detection, and long-term interaction reliability.

AIBearishDecrypt · Mar 267/10
🧠

Is AGI Here? Not Even Close, New AI Benchmark Suggests

A new AI benchmark called ARC-AGI-3 was released the same week Jensen Huang claimed AGI was achieved, showing dramatically poor performance from leading AI models. While humans scored 100% on the benchmark, advanced models like Gemini and GPT scored less than 0.4%, suggesting artificial general intelligence remains far from reality.

Is AGI Here? Not Even Close, New AI Benchmark Suggests
🧠 GPT-5🧠 Gemini
AIBearisharXiv – CS AI · Mar 267/10
🧠

Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments

Researchers introduced EnterpriseArena, the first benchmark testing whether AI agents can function as CFOs by allocating resources in complex enterprise environments over 132 months. Testing on eleven advanced LLMs revealed poor performance, with only 16% of runs surviving the full simulation period, highlighting significant capability gaps in long-term resource allocation under uncertainty.

AINeutralarXiv – CS AI · Mar 177/10
🧠

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Researchers introduced WebCoderBench, the first comprehensive benchmark for evaluating web application generation by large language models, featuring 1,572 real-world user requirements and 24 evaluation metrics. The benchmark tests 12 representative LLMs and shows no single model dominates across all metrics, providing opportunities for targeted improvements.

AINeutralarXiv – CS AI · Mar 177/10
🧠

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Researchers introduce AVA-Bench, a new benchmark that evaluates vision foundation models (VFMs) by testing 14 distinct atomic visual abilities like localization and depth estimation. This approach provides more precise assessment than traditional VQA benchmarks and reveals that smaller 0.5B language models can evaluate VFMs as effectively as 7B models while using 8x fewer GPU resources.

AIBullisharXiv – CS AI · Mar 177/10
🧠

To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Researchers introduced PriCoder, a new approach that improves Large Language Models' ability to generate code using private library APIs by over 20%. The method uses automatically synthesized training data through graph-based operators to teach LLMs private library usage, addressing a key limitation in current AI coding capabilities.

AIBullisharXiv – CS AI · Mar 177/10
🧠

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

Researchers introduce BevAD, a new lightweight end-to-end autonomous driving architecture that achieves 72.7% success rate on the Bench2Drive benchmark. The study systematically analyzes architectural patterns in closed-loop driving performance, revealing limitations of open-loop dataset approaches and demonstrating strong data-scaling behavior through pure imitation learning.

AINeutralarXiv – CS AI · Mar 177/10
🧠

CCTU: A Benchmark for Tool Use under Complex Constraints

Researchers introduce CCTU, a new benchmark for evaluating large language models' ability to use tools under complex constraints. The study reveals that even state-of-the-art LLMs achieve less than 20% task completion rates when strict constraint adherence is required, with models violating constraints in over 50% of cases.

AIBearisharXiv – CS AI · Mar 167/10
🧠

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

Researchers have released MalURLBench, the first benchmark to evaluate how LLM-based web agents handle malicious URLs, revealing significant vulnerabilities across 12 popular models. The study found that existing AI agents struggle to detect disguised malicious URLs and proposed URLGuard as a defensive solution.

AIBearisharXiv – CS AI · Mar 167/10
🧠

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

Researchers introduced OffTopicEval, a benchmark revealing that all major LLMs suffer from poor operational safety, with even top performers like Qwen-3 and Mistral achieving only 77-80% accuracy in staying on-topic for specific use cases. The study proposes prompt-based steering methods that can improve performance by up to 41%, highlighting critical safety gaps in current AI deployment.

🧠 Llama
AIBearisharXiv – CS AI · Mar 167/10
🧠

Large language models show fragile cognitive reasoning about human emotions

Researchers introduced CoRE, a benchmark testing whether large language models can reason about human emotions through cognitive dimensions rather than just labels. The study found that while LLMs capture systematic relations between cognitive appraisals and emotions, they show misalignment with human judgments and instability across different contexts.

AIBullisharXiv – CS AI · Mar 167/10
🧠

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Researchers developed a new reinforcement learning approach for training diffusion language models that uses entropy-guided step selection and stepwise advantages to overcome challenges with sequence-level likelihood calculations. The method achieves state-of-the-art results on coding and logical reasoning benchmarks while being more computationally efficient than existing approaches.

AIBullisharXiv – CS AI · Mar 117/10
🧠

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

Researchers introduce SATURN, a new reinforcement learning framework that uses Boolean Satisfiability (SAT) problems to improve large language models' reasoning capabilities. The framework addresses key limitations in existing RL approaches by enabling scalable task construction, automated verification, and precise difficulty control through curriculum learning.

AINeutralarXiv – CS AI · Mar 117/10
🧠

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Researchers introduce OOD-MMSafe, a new benchmark revealing that current Multimodal Large Language Models fail to identify hidden safety risks up to 67.5% of the time. They developed CASPO framework which dramatically reduces failure rates to under 8% for risk identification in consequence-driven safety scenarios.

AINeutralarXiv – CS AI · Mar 97/10
🧠

LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs

Researchers introduced LLMTM, a comprehensive benchmark to evaluate Large Language Models' performance on temporal motif analysis in dynamic graphs. The study tested nine different LLMs and developed a structure-aware dispatcher that balances accuracy with cost-effectiveness for graph analysis tasks.

🧠 GPT-4
AIBullisharXiv – CS AI · Mar 67/10
🧠

Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models

Researchers introduce the Dynamic Behavioral Constraint (DBC) benchmark, a new governance framework for large language models that reduces AI risk exposure by 36.8% through structured behavioral controls applied at inference time. The system achieves high EU AI Act compliance scores and represents a model-agnostic approach to AI safety that can be audited and mapped to different jurisdictions.

AIBullisharXiv – CS AI · Mar 57/10
🧠

RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

Researchers have released RoboCasa365, a large-scale simulation benchmark featuring 365 household tasks across 2,500 kitchen environments with over 600 hours of human demonstration data. The platform is designed to train and evaluate generalist robots for everyday tasks, providing insights into factors affecting robot performance and generalization capabilities.

← PrevPage 2 of 11Next →