y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#multi-step-reasoning News & Analysis

17 articles tagged with #multi-step-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

17 articles
AINeutralarXiv – CS AI · 4d ago7/10
🧠

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Researchers introduce CHARM, a framework for detecting and mitigating cascading hallucinations in multi-step AI reasoning pipelines where errors compound across stages. The system achieves 89.4% detection accuracy with minimal false positives, addressing a critical vulnerability in agentic RAG systems that existing methods fail to catch.

AIBullisharXiv – CS AI · May 117/10
🧠

Tools as Continuous Flow for Evolving Agentic Reasoning

Researchers propose FlowAgent, a novel approach that reconceptualizes how Large Language Models orchestrate tools by treating tool chaining as continuous trajectory generation rather than step-wise execution. The method uses conditional flow matching to provide global planning perspectives, demonstrating improved robustness and generalization to unseen tools across long-horizon reasoning tasks.

AIBullisharXiv – CS AI · 3d ago6/10
🧠

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

Researchers propose Causal Minimal Tool Filtering (CMTF), a training-free method that improves LLM agent reliability by exposing only necessary tools at each step rather than entire tool menus. The approach reduces token usage by 90% and tool exposure from 100 to 1 per step while maintaining task success rates.

AIBullisharXiv – CS AI · 4d ago6/10
🧠

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Researchers introduce State-Grounded Dynamic Retrieval (SGDR), a new method enabling language agents to dynamically reuse learned skills during web automation tasks. By matching skills to both task goals and current webpage states rather than fixed skill sets, SGDR achieves 10.6% relative performance gains over existing approaches on complex multi-step web tasks.

🧠 GPT-4
AINeutralarXiv – CS AI · Jun 16/10
🧠

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

Researchers introduce COMPASS, a safety alignment framework for LLM-powered search agents that prevents harmful outcomes from seemingly innocent multi-step queries. The method combines cognitive tree exploration and step-wise alignment to achieve robust safety while maintaining utility, requiring less training data than existing approaches.

AIBullisharXiv – CS AI · May 296/10
🧠

Graph-Enhanced Policy Optimization in LLM Agent Training

Researchers present Graph-Enhanced Policy Optimization (GEPO), a new training framework for multi-step LLM agents that improves credit assignment by analyzing state-transition graphs and task relevance. The method achieves 1.1-3.8% performance gains across multiple benchmarks by differentiating the importance of individual steps and trajectories based on their structural and semantic roles.

AIBullisharXiv – CS AI · May 286/10
🧠

HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling

Researchers introduce HGMem, a hypergraph-based working memory system that enhances multi-step retrieval-augmented generation (RAG) for large language models by modeling complex relational dependencies among facts. Unlike traditional RAG systems that treat memory as passive storage, HGMem dynamically structures information as interconnected high-order relationships, demonstrating improved performance on global sense-making benchmarks requiring complex reasoning across extended contexts.

AINeutralarXiv – CS AI · May 126/10
🧠

Why Retrying Fails: Context Contamination in LLM Agent Pipelines

Researchers introduce the Context-Contaminated Restart Model (CCRM) to formally analyze why LLM agents fail at higher rates when retrying tasks after errors, showing that failed attempts pollute the context window and increase subsequent error rates 7.1x. The model provides closed-form formulas for success probability, optimal pipeline depth allocation, and quantifies the exact benefit of clearing context before retry attempts.

AINeutralarXiv – CS AI · Apr 206/10
🧠

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Researchers introduce GTA-2, a hierarchical benchmark that evaluates AI agents on both atomic tool-use tasks and complex, open-ended workflows using real user queries and deployed tools. The study reveals a significant capability cliff where frontier AI models achieve below 50% success on atomic tasks and only 14.39% on realistic workflows, highlighting that execution framework design matters as much as underlying model capacity.

AIBullisharXiv – CS AI · Apr 156/10
🧠

HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

Researchers introduce HintMR, a hint-assisted reasoning framework that improves mathematical problem-solving in small language models by using a separate hint-generating model to provide contextual guidance through multi-step problems. This collaborative two-model system demonstrates significant accuracy improvements over standard prompting while maintaining computational efficiency.

AINeutralarXiv – CS AI · Apr 106/10
🧠

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.

AIBullisharXiv – CS AI · Mar 126/10
🧠

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.

AIBullishOpenAI News · Jan 86/102
🧠

Netomi’s lessons for scaling agentic systems into the enterprise

Netomi demonstrates how to scale enterprise AI agents using GPT-4.1 and GPT-5.2 by implementing concurrency, governance frameworks, and multi-step reasoning capabilities. The approach focuses on creating reliable production workflows that can handle enterprise-scale AI agent deployments.

AINeutralHugging Face Blog · Feb 45/106
🧠

DABStep: Data Agent Benchmark for Multi-step Reasoning

DABStep introduces a new benchmark for evaluating data agents' multi-step reasoning capabilities. The benchmark aims to assess how well AI agents can perform complex, sequential data analysis tasks that require multiple reasoning steps.