9 articles tagged with #multi-step-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce GTA-2, a hierarchical benchmark that evaluates AI agents on both atomic tool-use tasks and complex, open-ended workflows using real user queries and deployed tools. The study reveals a significant capability cliff where frontier AI models achieve below 50% success on atomic tasks and only 14.39% on realistic workflows, highlighting that execution framework design matters as much as underlying model capacity.
AIBullisharXiv – CS AI · Apr 156/10
🧠Researchers introduce HintMR, a hint-assisted reasoning framework that improves mathematical problem-solving in small language models by using a separate hint-generating model to provide contextual guidance through multi-step problems. This collaborative two-model system demonstrates significant accuracy improvements over standard prompting while maintaining computational efficiency.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.
AIBearisharXiv – CS AI · Mar 276/10
🧠Researchers introduce MolQuest, a new benchmark for evaluating AI models' ability to perform complex chemical structure elucidation through multi-step reasoning. Even state-of-the-art AI models achieve only 50% accuracy on this real-world scientific task, revealing significant limitations in current AI systems' strategic reasoning capabilities.
AIBullisharXiv – CS AI · Mar 176/10
🧠NormCode Canvas v1.1.3 introduces a case-based reasoning system for LLM agentic workflows using a semi-formal planning language called NormCode. The deployed system demonstrates multi-step AI task automation across presentation generation, code assistance, and plan compilation with self-sustaining capabilities.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers have developed ToolTree, a new Monte Carlo tree search-based planning system for LLM agents that improves tool selection and usage through dual-feedback evaluation and bidirectional pruning. The system achieves approximately 10% performance gains over existing methods while maintaining high efficiency across multiple benchmarks.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.
AIBullishOpenAI News · Jan 86/102
🧠Netomi demonstrates how to scale enterprise AI agents using GPT-4.1 and GPT-5.2 by implementing concurrency, governance frameworks, and multi-step reasoning capabilities. The approach focuses on creating reliable production workflows that can handle enterprise-scale AI agent deployments.
AINeutralHugging Face Blog · Feb 45/106
🧠DABStep introduces a new benchmark for evaluating data agents' multi-step reasoning capabilities. The benchmark aims to assess how well AI agents can perform complex, sequential data analysis tasks that require multiple reasoning steps.