y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#benchmark News & Analysis

253 articles tagged with #benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

253 articles
AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Researchers introduce ToolVQA, a large-scale multimodal dataset with 23K instances designed to improve AI models' ability to use external tools for visual question answering. The dataset features real-world contexts and multi-step reasoning tasks, with fine-tuned 7B models outperforming GPT-3.5-turbo on various benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

Researchers have released RoboCasa365, a large-scale simulation benchmark featuring 365 household tasks across 2,500 kitchen environments with over 600 hours of human demonstration data. The platform is designed to train and evaluate generalist robots for everyday tasks, providing insights into factors affecting robot performance and generalization capabilities.

AINeutralarXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Researchers have developed DBench-Bio, a dynamic benchmark system that automatically evaluates AI's ability to discover new biological knowledge using a three-stage pipeline of data acquisition, question-answer extraction, and quality filtering. The benchmark addresses the critical problem of data contamination in static datasets and provides monthly updates across 12 biomedical domains, revealing current limitations in state-of-the-art AI models' knowledge discovery capabilities.

AINeutralarXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Researchers introduced WebRRSBench, a comprehensive benchmark evaluating multimodal large language models' reasoning, robustness, and safety capabilities for web understanding tasks. Testing 11 MLLMs on 3,799 QA pairs from 729 websites revealed significant gaps in compositional reasoning, UI robustness, and safety-critical action recognition.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO

Researchers propose CoIPO (Contrastive Learning-based Inverse Direct Preference Optimization), a new method to improve Large Language Model robustness against noisy or imperfect user prompts. The approach enhances LLMs' intrinsic ability to handle prompt variations without relying on external preprocessing tools, showing significant accuracy improvements on benchmark tests.

AIBearisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Researchers introduced ฯ„-Knowledge, a new benchmark for evaluating AI conversational agents in knowledge-intensive environments, specifically testing their ability to retrieve and apply unstructured domain knowledge. Even frontier AI models achieved only 25.5% success rates when navigating complex fintech customer support scenarios with 700 interconnected knowledge documents.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Researchers introduce MIKASA, a comprehensive benchmark suite designed to evaluate memory capabilities in reinforcement learning agents, particularly for robotic manipulation tasks. The framework includes MIKASA-Base for general memory RL evaluation and MIKASA-Robo with 32 specialized tasks for tabletop robotic manipulation scenarios.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

Researchers developed MA-RAG, a Multi-Round Agentic RAG framework that improves medical AI reasoning by iteratively refining responses through conflict detection and external evidence retrieval. The system achieved a substantial +6.8 point accuracy improvement over baseline models across 7 medical Q&A benchmarks by addressing hallucinations and outdated knowledge in healthcare AI applications.

AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

HumanLM: Simulating Users with State Alignment Beats Response Imitation

Researchers introduce HumanLM, a novel AI training framework that creates user simulators by aligning psychological states rather than just imitating response patterns. The system achieved 16.3% improvement in alignment scores across six datasets with 26k users and 216k responses, demonstrating superior ability to simulate real human behavior.

AIBullisharXiv โ€“ CS AI ยท Mar 46/103
๐Ÿง 

MEBM-Speech: Multi-scale Enhanced BrainMagic for Robust MEG Speech Detection

Researchers propose MEBM-Speech, a neural decoder that detects speech activity from brain signals using magnetoencephalography (MEG). The system achieved 89.3% F1 score on benchmark tests and could advance brain-computer interfaces for cognitive neuroscience and clinical applications.

AIBearisharXiv โ€“ CS AI ยท Mar 47/104
๐Ÿง 

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

Researchers introduced SANDBOXESCAPEBENCH, a new benchmark that measures large language models' ability to break out of Docker container sandboxes commonly used for AI safety. The study found that LLMs can successfully identify and exploit vulnerabilities in sandbox environments, highlighting significant security risks as AI agents become more autonomous.

AINeutralarXiv โ€“ CS AI ยท Mar 46/104
๐Ÿง 

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Researchers introduce CUDABench, a comprehensive benchmark for evaluating Large Language Models' ability to generate CUDA code from text descriptions. The benchmark reveals significant challenges including high compilation success rates but low functional correctness, lack of domain-specific knowledge, and poor GPU hardware utilization.

AINeutralarXiv โ€“ CS AI ยท Mar 46/102
๐Ÿง 

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Researchers introduce SteerEval, a new benchmark for evaluating how controllable Large Language Models are across language features, sentiment, and personality domains. The study reveals that current steering methods often fail at finer-grained control levels, highlighting significant risks when deploying LLMs in socially sensitive applications.

AINeutralarXiv โ€“ CS AI ยท Mar 47/104
๐Ÿง 

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Researchers have introduced SorryDB, a dynamic benchmark for evaluating AI systems' ability to prove mathematical theorems using the Lean proof assistant. The benchmark draws from 78 real-world formalization projects and addresses limitations of static benchmarks by providing continuously updated tasks that better reflect community needs.

AIBullisharXiv โ€“ CS AI ยท Mar 46/104
๐Ÿง 

ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

Researchers developed a new method to reduce content biases in large language models' reasoning tasks by transforming syllogisms into canonical logical representations with deterministic parsing. The approach achieved top-5 rankings on the multilingual SemEval-2026 Task 11 benchmark while offering a competitive alternative to complex fine-tuning methods.

AIBullisharXiv โ€“ CS AI ยท Mar 47/102
๐Ÿง 

Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification

Researchers have enhanced the Saarthi AI framework for formal verification, achieving 70% better accuracy in generating SystemVerilog assertions and 50% fewer iterations to reach coverage closure. The framework uses multi-agent collaboration and improved RAG techniques to move toward domain-specific AI intelligence for verification tasks.

AINeutralarXiv โ€“ CS AI ยท Mar 46/102
๐Ÿง 

LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

Researchers have released LiveAgentBench, a comprehensive benchmark featuring 104 real-world scenarios to evaluate AI agent performance across practical applications. The benchmark uses a novel Social Perception-Driven Data Generation method to ensure tasks reflect actual user requirements and includes 374 total tasks for testing various AI models and frameworks.

AIBullisharXiv โ€“ CS AI ยท Mar 47/103
๐Ÿง 

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Researchers introduce OptMerge, a new benchmark and method for combining multiple expert Multimodal Large Language Models (MLLMs) into single, more capable models without requiring additional training data. The approach achieves 2.48% average performance gains while reducing storage and serving costs by merging models across different modalities like vision, audio, and video.

AINeutralarXiv โ€“ CS AI ยท Mar 46/102
๐Ÿง 

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Researchers introduce UniG2U-Bench, a comprehensive benchmark testing whether unified multimodal AI models that can generate content actually understand better than traditional vision-language models. The study of over 30 models reveals that unified models generally underperform their base counterparts, though they show improvements in spatial intelligence and visual reasoning tasks.

AIBearisharXiv โ€“ CS AI ยท Mar 47/102
๐Ÿง 

TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Researchers have developed TrustMH-Bench, a comprehensive framework to evaluate the trustworthiness of Large Language Models (LLMs) in mental health applications. Testing revealed that both general-purpose and specialized mental health LLMs, including advanced models like GPT-5.1, significantly underperform across critical trustworthiness dimensions in mental health scenarios.

AINeutralarXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Researchers have introduced WorldSense, the first benchmark for evaluating multimodal AI systems that process visual, audio, and text inputs simultaneously. The benchmark contains 1,662 synchronized audio-visual videos across 67 subcategories and 3,172 QA pairs, revealing that current state-of-the-art models achieve only 65.1% accuracy on real-world understanding tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

MagicAgent: Towards Generalized Agent Planning

Researchers have developed MagicAgent, a series of foundation models designed for generalized AI agent planning that outperforms existing sub-100B models and even surpasses leading ultra-scale models like GPT-5.2. The models achieve superior performance through a novel synthetic data framework and two-stage training paradigm that addresses gradient interference in multi-task learning.

AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

Researchers introduce Self-Harmony, a new test-time reinforcement learning framework that improves AI model accuracy by having models solve problems and rephrase questions simultaneously. The method uses harmonic mean aggregation instead of majority voting to select stable answers, achieving state-of-the-art results across 28 of 30 reasoning benchmarks without requiring human supervision.