y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#benchmark News & Analysis

253 articles tagged with #benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

253 articles
AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

Researchers developed PoliticsBench, a new framework to evaluate political bias in large language models through multi-turn roleplay scenarios. The study found that 7 out of 8 major LLMs (Claude, Deepseek, Gemini, GPT, Llama, Qwen) showed left-leaning political bias, while only Grok exhibited right-leaning tendencies.

๐Ÿง  Claude๐Ÿง  Gemini๐Ÿง  Llama
AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

Revealing Multi-View Hallucination in Large Vision-Language Models

Researchers identify 'multi-view hallucination' as a major problem in large vision-language models (LVLMs), where these AI systems confuse visual information from different viewpoints or instances. They created MVH-Bench benchmark and developed Reference Shift Contrastive Decoding (RSCD) technique, which improved performance by up to 34.6 points without requiring model retraining.

AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation

Researchers introduce GeoSketch, a neural-symbolic AI framework that solves geometric problems through dynamic visual manipulation, including drawing auxiliary lines and applying transformations. The system combines perception, symbolic reasoning, and interactive sketch actions, achieving superior performance on geometric problem-solving benchmarks compared to static image processing methods.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Researchers have developed EvolvR, a self-evolving framework that improves AI's ability to evaluate and generate stories through pairwise reasoning and multi-agent data filtering. The system achieves state-of-the-art performance on three evaluation benchmarks and significantly enhances story generation quality when used as a reward model.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.

๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Researchers introduced QuarkMedBench, a new benchmark for evaluating large language models on real-world medical queries using over 20,000 queries across clinical care scenarios. The benchmark addresses limitations of current medical AI evaluations that rely on multiple-choice questions by using an automated scoring framework that achieves 91.8% concordance with clinical expert assessments.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Researchers introduce VLA-Thinker, a new AI framework that enhances Vision-Language-Action models by enabling dynamic visual reasoning during robotic tasks. The system achieved a 97.5% success rate on LIBERO benchmarks through a two-stage training pipeline combining supervised fine-tuning and reinforcement learning.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Researchers introduced InterveneBench, a new benchmark comprising 744 peer-reviewed studies to evaluate large language models' ability to reason about policy interventions and causal inference in social science contexts. Current state-of-the-art LLMs struggle with this type of reasoning, prompting the development of STRIDES, a multi-agent framework that significantly improves performance on these tasks.

AIBearisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

HEARTS: Benchmarking LLM Reasoning on Health Time Series

Researchers introduce HEARTS, a comprehensive benchmark for evaluating large language models' ability to reason over health time series data across 16 datasets and 12 health domains. The study reveals that current LLMs significantly underperform compared to specialized models and struggle with multi-step temporal reasoning in healthcare applications.

AIBullisharXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

AI Planning Framework for LLM-Based Web Agents

Researchers introduce a formal planning framework that maps LLM-based web agents to traditional search algorithms, enabling better diagnosis of failures in autonomous web tasks. The study compares different agent architectures using novel evaluation metrics and a dataset of 794 human-labeled trajectories from WebArena benchmark.

AIBullisharXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs

Researchers have developed Feynman, an AI agent that generates high-quality diagram-caption pairs at scale for training vision-language models. The system created a dataset of 100k+ well-aligned diagrams and introduced Diagramma, a benchmark for evaluating visual reasoning capabilities.

AIBullisharXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

Visual-ERM: Reward Modeling for Visual Equivalence

Researchers introduce Visual-ERM, a multimodal reward model that improves vision-to-code tasks by evaluating visual equivalence in rendered outputs rather than relying on text-based rules. The system achieves significant performance gains on chart-to-code tasks (+8.4) and shows consistent improvements across table and SVG parsing applications.

AINeutralarXiv โ€“ CS AI ยท Mar 126/10
๐Ÿง 

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Researchers introduce SpreadsheetArena, a platform for evaluating large language models' ability to generate spreadsheet workbooks from natural language prompts. The study reveals that preferred spreadsheet features vary significantly across use cases, and even top-performing models struggle with domain-specific best practices in areas like finance.

AIBullisharXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

Social-R1: Towards Human-like Social Reasoning in LLMs

Researchers introduce Social-R1, a reinforcement learning framework that enhances social reasoning in large language models by training on adversarial examples. The approach enables a 4B parameter model to outperform larger models across eight benchmarks by supervising the entire reasoning process rather than just outcomes.

AIBearisharXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs

Researchers have identified a critical flaw in Large Language Models (LLMs) where they prioritize moral reasoning over commonsense understanding, struggling to detect logical contradictions within moral dilemmas. The study introduces the CoMoral benchmark and reveals a 'narrative focus bias' where LLMs better identify contradictions attributed to secondary characters rather than primary narrators.

AINeutralarXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Researchers introduce SCENEBench, a new benchmark for evaluating Large Audio Language Models (LALMs) beyond speech recognition, focusing on real-world audio understanding including background sounds, noise localization, and vocal characteristics. Testing of five state-of-the-art models revealed significant performance gaps, with some tasks performing below random chance while others achieved high accuracy.

AINeutralarXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering

Researchers introduced OPENXRD, a comprehensive benchmarking framework for evaluating large language models and multimodal LLMs in crystallography question answering. The study tested 74 state-of-the-art models and found that mid-sized models (7B-70B parameters) benefit most from contextual materials, while very large models often show saturation or interference.

๐Ÿง  GPT-4๐Ÿง  GPT-4.5๐Ÿง  GPT-5
AINeutralarXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering

Researchers introduce EgoCross, a new benchmark to evaluate multimodal AI models on egocentric video understanding across diverse domains like surgery, extreme sports, and industrial settings. The study reveals that current AI models, including specialized egocentric models, struggle with cross-domain generalization beyond common daily activities.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Towards Neural Graph Data Management

Researchers introduce NGDBench, a comprehensive benchmark for evaluating neural networks' ability to work with graph databases across five domains including finance and medicine. The benchmark supports full Cypher query language capabilities and reveals significant limitations in current AI models when handling structured graph data, noise, and complex analytical tasks.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

Researchers introduce Tool-Genesis, a new benchmark for evaluating self-evolving AI agents' ability to create and use tools from abstract requirements. The study reveals that even advanced AI models struggle with creating precise tool interfaces and executable logic, with small initial errors causing significant downstream performance degradation.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Researchers have developed ConStory-Bench, a new benchmark to evaluate consistency errors in long-form story generation by Large Language Models. The study reveals that LLMs frequently contradict their own established facts and character traits when generating lengthy narratives, with errors most commonly occurring in factual and temporal dimensions around the middle of stories.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Researchers have identified a critical failure mode in Vision-Language-Action (VLA) robotic models called 'linguistic blindness,' where robots prioritize visual cues over language instructions when they contradict. They developed ICBench benchmark and proposed IGAR, a train-free solution that recalibrates attention to restore language instruction influence without requiring model retraining.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

ContextBench: Modifying Contexts for Targeted Latent Activation

Researchers have developed ContextBench, a new benchmark for evaluating methods that generate targeted inputs to trigger specific behaviors in language models. The study introduces enhanced Evolutionary Prompt Optimization techniques that better balance effectiveness in activating AI model features while maintaining linguistic fluency.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

MERIT Feedback Elicits Better Bargaining in LLM Negotiators

Researchers introduce AgoraBench, a new framework for improving Large Language Models' bargaining and negotiation capabilities through utility-based feedback mechanisms. The study reveals that current LLMs struggle with strategic depth in negotiations and proposes human-aligned metrics and training methods to enhance their performance.

AINeutralarXiv โ€“ CS AI ยท Mar 55/10
๐Ÿง 

M-QUEST -- Meme Question-Understanding Evaluation on Semantics and Toxicity

Researchers developed M-QUEST, a new benchmark for evaluating AI models' ability to understand and detect toxicity in internet memes. The framework identifies 10 key dimensions for meme interpretation and tests 8 open-source language models, finding that instruction-tuned models perform better but still struggle with pragmatic inference.