AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers introduce CHARM, a framework for detecting and mitigating cascading hallucinations in multi-step AI reasoning pipelines where errors compound across stages. The system achieves 89.4% detection accuracy with minimal false positives, addressing a critical vulnerability in agentic RAG systems that existing methods fail to catch.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers propose FlowAgent, a novel approach that reconceptualizes how Large Language Models orchestrate tools by treating tool chaining as continuous trajectory generation rather than step-wise execution. The method uses conditional flow matching to provide global planning perspectives, demonstrating improved robustness and generalization to unseen tools across long-horizon reasoning tasks.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers propose Causal Minimal Tool Filtering (CMTF), a training-free method that improves LLM agent reliability by exposing only necessary tools at each step rather than entire tool menus. The approach reduces token usage by 90% and tool exposure from 100 to 1 per step while maintaining task success rates.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers introduce State-Grounded Dynamic Retrieval (SGDR), a new method enabling language agents to dynamically reuse learned skills during web automation tasks. By matching skills to both task goals and current webpage states rather than fixed skill sets, SGDR achieves 10.6% relative performance gains over existing approaches on complex multi-step web tasks.
🧠 GPT-4
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers introduce COMPASS, a safety alignment framework for LLM-powered search agents that prevents harmful outcomes from seemingly innocent multi-step queries. The method combines cognitive tree exploration and step-wise alignment to achieve robust safety while maintaining utility, requiring less training data than existing approaches.
AIBullisharXiv – CS AI · May 296/10
🧠Researchers present Graph-Enhanced Policy Optimization (GEPO), a new training framework for multi-step LLM agents that improves credit assignment by analyzing state-transition graphs and task relevance. The method achieves 1.1-3.8% performance gains across multiple benchmarks by differentiating the importance of individual steps and trajectories based on their structural and semantic roles.
AIBullisharXiv – CS AI · May 286/10
🧠Researchers introduce HGMem, a hypergraph-based working memory system that enhances multi-step retrieval-augmented generation (RAG) for large language models by modeling complex relational dependencies among facts. Unlike traditional RAG systems that treat memory as passive storage, HGMem dynamically structures information as interconnected high-order relationships, demonstrating improved performance on global sense-making benchmarks requiring complex reasoning across extended contexts.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce the Context-Contaminated Restart Model (CCRM) to formally analyze why LLM agents fail at higher rates when retrying tasks after errors, showing that failed attempts pollute the context window and increase subsequent error rates 7.1x. The model provides closed-form formulas for success probability, optimal pipeline depth allocation, and quantifies the exact benefit of clearing context before retry attempts.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers introduce GTA-2, a hierarchical benchmark that evaluates AI agents on both atomic tool-use tasks and complex, open-ended workflows using real user queries and deployed tools. The study reveals a significant capability cliff where frontier AI models achieve below 50% success on atomic tasks and only 14.39% on realistic workflows, highlighting that execution framework design matters as much as underlying model capacity.
AIBullisharXiv – CS AI · Apr 156/10
🧠Researchers introduce HintMR, a hint-assisted reasoning framework that improves mathematical problem-solving in small language models by using a separate hint-generating model to provide contextual guidance through multi-step problems. This collaborative two-model system demonstrates significant accuracy improvements over standard prompting while maintaining computational efficiency.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.
AIBearisharXiv – CS AI · Mar 276/10
🧠Researchers introduce MolQuest, a new benchmark for evaluating AI models' ability to perform complex chemical structure elucidation through multi-step reasoning. Even state-of-the-art AI models achieve only 50% accuracy on this real-world scientific task, revealing significant limitations in current AI systems' strategic reasoning capabilities.
AIBullisharXiv – CS AI · Mar 176/10
🧠NormCode Canvas v1.1.3 introduces a case-based reasoning system for LLM agentic workflows using a semi-formal planning language called NormCode. The deployed system demonstrates multi-step AI task automation across presentation generation, code assistance, and plan compilation with self-sustaining capabilities.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers have developed ToolTree, a new Monte Carlo tree search-based planning system for LLM agents that improves tool selection and usage through dual-feedback evaluation and bidirectional pruning. The system achieves approximately 10% performance gains over existing methods while maintaining high efficiency across multiple benchmarks.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.
AIBullishOpenAI News · Jan 86/102
🧠Netomi demonstrates how to scale enterprise AI agents using GPT-4.1 and GPT-5.2 by implementing concurrency, governance frameworks, and multi-step reasoning capabilities. The approach focuses on creating reliable production workflows that can handle enterprise-scale AI agent deployments.
AINeutralHugging Face Blog · Feb 45/106
🧠DABStep introduces a new benchmark for evaluating data agents' multi-step reasoning capabilities. The benchmark aims to assess how well AI agents can perform complex, sequential data analysis tasks that require multiple reasoning steps.