#agent-reasoning News & Analysis

9 articles tagged with #agent-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBullisharXiv – CS AI · Jun 117/10

🧠

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

CRANE is a training-free parameter-editing method that merges paired Instruct and Thinking model checkpoints to create superior code agents. By selectively combining reasoning capabilities from Thinking models with the tool-discipline of Instruct models, CRANE achieves significant performance gains—66.2% pass rate on Roo-Eval (+19.5%) and resolves 14 additional instances on SWE-bench—while maintaining computational efficiency.

AIBullisharXiv – CS AI · Jun 107/10

🧠

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

Researchers introduce ActiveMem, a distributed memory framework that decouples storage from reasoning in large language models, enabling agents to handle longer tasks without context overload. The system separates executive planning from memory management—inspired by human brain architecture—and demonstrates state-of-the-art performance on complex reasoning benchmarks while reducing computational overhead.

AINeutralarXiv – CS AI · May 277/10

🧠

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Researchers introduce QUACK, an evaluation framework for auditing whether AI agents in social deduction games actually ground their language in perceived reality or hallucinate claims. Testing three frontier vision-language models reveals that even top performers hallucinate 15% of spatial claims and make accusations without evidence, exposing critical gaps in agent reasoning reliability.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Fact-Augmented Lookahead Planning for LLM Agents

Researchers introduce LWM-Planner, a fact-augmented lookahead planning framework that enhances LLM agent decision-making through in-context learning without parameter updates. The system extracts task-critical facts from agent trajectories, validates them through a predictive-consistency filter, and uses these facts to improve planning accuracy across interactive environments.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Agentic Transformers Provably Learn to Search via Reinforcement Learning

Researchers demonstrate that transformer-based AI agents can learn tree-search capabilities through reinforcement learning without explicit instruction, with attention heads specializing to track action history and detect failures. The findings reveal how agents develop depth-first search mechanisms during training and generalize to deeper problems than they trained on, advancing theoretical understanding of how language models acquire reasoning abilities.

AINeutralarXiv – CS AI · May 296/10

🧠

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

Researchers introduce RedundancyBench, a new benchmark for detecting redundant steps in LLM-based agent trajectories, revealing that current methods struggle significantly with this task—the best approach achieves only 24.88% accuracy. This work highlights a critical gap in agent evaluation: while task success is commonly measured, execution efficiency and resource optimization remain largely unmeasured, suggesting AI agents require substantial improvements in reasoning efficiency.

AINeutralarXiv – CS AI · May 296/10

🧠

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Researchers introduce CausaLab, a benchmarking environment that tests whether LLM agents can both solve causal discovery problems and accurately recover the underlying causal mechanisms. Experiments reveal a significant gap between prediction accuracy (92%) and structural causal model recovery (0.471 F1 score), exposing limitations in current AI systems' ability to perform rigorous scientific reasoning.

🧠 GPT-5

AINeutralarXiv – CS AI · May 286/10

🧠

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Researchers introduce VibeSearchBench, a new benchmark that exposes significant gaps between LLM agent performance on existing search tasks and real-world user satisfaction. The benchmark uses multi-turn dialogue and schema-free evaluation across 200 bilingual tasks, revealing that even frontier models achieve only 30.30% F1 scores, indicating fundamental deficiencies in long-context reasoning and intent elicitation.

AIBullisharXiv – CS AI · Mar 37/107

🧠

MetaMind: General and Cognitive World Models in Multi-Agent Systems by Meta-Theory of Mind

Meta researchers introduced MetaMind, a cognitive world model for multi-agent systems that enables agents to understand and predict other agents' behaviors without centralized supervision or communication. The system uses a meta-theory of mind framework allowing agents to reason about goals and beliefs of others through self-reflective learning and analogical reasoning.