y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#agent-reasoning News & Analysis

5 articles tagged with #agent-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AINeutralarXiv – CS AI · 5d ago7/10
🧠

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Researchers introduce QUACK, an evaluation framework for auditing whether AI agents in social deduction games actually ground their language in perceived reality or hallucinate claims. Testing three frontier vision-language models reveals that even top performers hallucinate 15% of spatial claims and make accusations without evidence, exposing critical gaps in agent reasoning reliability.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

Researchers introduce RedundancyBench, a new benchmark for detecting redundant steps in LLM-based agent trajectories, revealing that current methods struggle significantly with this task—the best approach achieves only 24.88% accuracy. This work highlights a critical gap in agent evaluation: while task success is commonly measured, execution efficiency and resource optimization remain largely unmeasured, suggesting AI agents require substantial improvements in reasoning efficiency.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Researchers introduce CausaLab, a benchmarking environment that tests whether LLM agents can both solve causal discovery problems and accurately recover the underlying causal mechanisms. Experiments reveal a significant gap between prediction accuracy (92%) and structural causal model recovery (0.471 F1 score), exposing limitations in current AI systems' ability to perform rigorous scientific reasoning.

🧠 GPT-5
AINeutralarXiv – CS AI · 4d ago6/10
🧠

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Researchers introduce VibeSearchBench, a new benchmark that exposes significant gaps between LLM agent performance on existing search tasks and real-world user satisfaction. The benchmark uses multi-turn dialogue and schema-free evaluation across 200 bilingual tasks, revealing that even frontier models achieve only 30.30% F1 scores, indicating fundamental deficiencies in long-context reasoning and intent elicitation.

AIBullisharXiv – CS AI · Mar 37/107
🧠

MetaMind: General and Cognitive World Models in Multi-Agent Systems by Meta-Theory of Mind

Meta researchers introduced MetaMind, a cognitive world model for multi-agent systems that enables agents to understand and predict other agents' behaviors without centralized supervision or communication. The system uses a meta-theory of mind framework allowing agents to reason about goals and beliefs of others through self-reflective learning and analogical reasoning.