#multi-hop-reasoning News & Analysis

25 articles tagged with #multi-hop-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

25 articles

AIBullisharXiv – CS AI · May 297/10

🧠

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Researchers introduce GTA, a scalable framework for automatically generating realistic web agent tasks paired with executable trajectories at scale. The system addresses critical limitations in existing benchmarks by combining crawling, retrieval-based seeding, and automated quality control to create multi-hop, cross-page tasks across 50+ websites, revealing significant performance gaps between human and AI agents.

AIBullisharXiv – CS AI · May 287/10

🧠

Plan Before Search: Search Agents Need Plan

Researchers demonstrate that large language models trained as retrieval-augmented agents benefit from explicit planning—decomposing questions into ordered sub-questions before searching—rather than reactive document-driven responses. They introduce a self-bootstrapping training paradigm that enables smaller seed models to generate filtered trajectories activating this planning behavior across different model sizes without requiring distillation from larger external models.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Reasoning Graphs: Self-Improving, Deterministic RAG through Evidence-Centric Feedback

Researchers introduce reasoning graphs, a persistent knowledge structure that improves language model reasoning accuracy by storing and reusing chains of thought tied to evidence items. The system achieves 47% error reduction on multi-hop questions and maintains deterministic outputs without model retraining, using only context engineering.

AIBullisharXiv – CS AI · Mar 177/10

🧠

APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution

Researchers introduce APEX-Searcher, a new framework that enhances large language models' search capabilities through a two-stage approach combining reinforcement learning for strategic planning and supervised fine-tuning for execution. The system addresses limitations in multi-hop question answering by decoupling retrieval processes into planning and execution phases, showing significant improvements across multiple benchmarks.

AIBullisharXiv – CS AI · Mar 167/10

🧠

Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents

Researchers propose Budget-Aware Value Tree (BAVT), a training-free framework that improves LLM agent efficiency by intelligently managing computational resources during multi-hop reasoning tasks. The system outperforms traditional approaches while using 4x fewer resources, demonstrating that smart budget management beats brute-force compute scaling.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

Researchers developed a new AI training method using knowledge graphs as reward models to improve compositional reasoning in specialized domains. The approach enables smaller 14B parameter models to outperform much larger frontier systems like GPT-5.2 and Gemini 3 Pro on complex multi-hop reasoning tasks in medicine.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 126/10

🧠

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

Researchers present a framework for evaluating procedural reasoning datasets in AI-supported learning systems by comparing three question-generation strategies based on Task-Method-Knowledge (TMK) models. The study demonstrates that strict TMK generation produces the most grounded and usable datasets (96.5% grounded), while transcript-based approaches sacrifice representational alignment for naturalness, highlighting the trade-off between learner-like phrasing and formal grounding in evaluation dataset construction.

AINeutralarXiv – CS AI · Jun 106/10

🧠

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Researchers introduced LakeQA, a new benchmark dataset for evaluating large language models on question-answering tasks over massive data lakes containing 9.5TB of heterogeneous data. The benchmark reveals significant challenges in current LLMs, with GPT-5.2 achieving only 18.37% accuracy, highlighting the gap between reading-comprehension performance and real-world search-and-reasoning requirements.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 106/10

🧠

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

Researchers propose SAFE, an LLM-as-verifier framework that improves multi-hop question answering by validating reasoning steps against evidence during generation rather than only checking final answers. The approach uses Knowledge Graph triples to decompose reasoning into verifiable units and achieves 8.8 percentage point accuracy improvements across three benchmarks.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Agent-Orchestrated Adaptive RAG: A Comparative Study on Structured and Multi-Hop Retrieval

Researchers present Agent-Orchestrated Adaptive RAG, a framework that enhances LLM retrieval through dynamic query decomposition and iterative refinement. Testing shows query decomposition benefits structured domains (+0.04 overall score on DevOps) but reduces accuracy on multi-hop reasoning tasks, suggesting adaptive application is more effective than uniform aggressive reasoning.

AIBullisharXiv – CS AI · Jun 56/10

🧠

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

Researchers introduce A2RAG, an adaptive framework that improves Graph-Retrieval-Augmented Generation (Graph-RAG) for multi-hop question answering by dynamically adjusting retrieval effort based on query difficulty. The system reduces token consumption and latency by ~50% while achieving significant accuracy gains, addressing practical deployment challenges in AI reasoning systems.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Researchers introduce Critic-R, a framework that improves agentic search systems by creating a feedback loop between reasoning agents and retrieval models. The approach uses a critic model to evaluate whether retrieved context supports reasoning steps and includes two mechanisms: Critic-R-Zero for query refinement at inference time, and Critic-Embed for training retrievers without manual annotations, demonstrating significant improvements on multi-hop question-answering benchmarks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

Researchers introduce Soft-NBCE, an improved method for processing ultra-long text contexts in large language models by replacing discrete chunk selection with weighted chunk fusion. The approach demonstrates measurable improvements on multi-hop reasoning tasks while maintaining efficient memory usage, addressing a critical bottleneck in LLM inference.

AINeutralarXiv – CS AI · Jun 26/10

🧠

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Researchers introduced LocalSearchBench, a comprehensive benchmark for testing AI agents in local life services, revealing significant performance gaps even among state-of-the-art large reasoning models. The benchmark comprises 1.3M merchant entries and 900 multi-hop reasoning tasks, exposing critical weaknesses in completeness and faithfulness that underscore the need for domain-specific AI agent development.

AINeutralarXiv – CS AI · May 96/10

🧠

Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG

Researchers introduce TGS-RAG, a framework that combines text and graph-based retrieval to improve how large language models answer complex questions. The system addresses limitations in existing approaches by enabling bidirectional communication between text and structured data, improving both accuracy and computational efficiency in multi-hop reasoning tasks.

AINeutralarXiv – CS AI · May 46/10

🧠

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Researchers demonstrate that quantization—reducing AI model precision to improve efficiency—paradoxically increases energy consumption and degrades reasoning accuracy in multi-hop reasoning tasks, contradicting established neural scaling laws. The study identifies hardware dequantization overhead as a critical bottleneck and proposes a Critical Model Scale metric to predict when quantization becomes counterproductive across different model sizes and hardware configurations.

AIBullisharXiv – CS AI · Apr 156/10

🧠

KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

Researchers introduce KG-Reasoner, an end-to-end framework that uses reinforcement learning to train large language models to perform multi-hop reasoning over knowledge graphs without decomposing tasks into isolated pipeline steps. The approach demonstrates competitive or superior performance across eight reasoning benchmarks by enabling LLMs to dynamically explore reasoning paths and backtrack when necessary.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

Researchers propose a graph-based soft prompting framework that enables LLMs to reason over incomplete knowledge graphs by processing subgraph structures rather than explicit node paths, achieving state-of-the-art results on multi-hop question-answering benchmarks while reducing computational costs through a two-stage inference approach.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds

Researchers demonstrate a zero-shot knowledge graph construction pipeline using local open-source LLMs on consumer hardware, achieving 0.70 F1 on document relations and 0.55 exact match on multi-hop reasoning through ensemble methods. The study reveals that strong model consensus often signals collective hallucination rather than accuracy, challenging traditional ensemble assumptions while maintaining low computational costs and carbon footprint.

AIBullisharXiv – CS AI · Mar 276/10

🧠

UniAI-GraphRAG: Synergizing Ontology-Guided Extraction, Multi-Dimensional Clustering, and Dual-Channel Fusion for Robust Multi-Hop Reasoning

Researchers have developed UniAI-GraphRAG, an enhanced framework that improves upon existing GraphRAG systems for complex reasoning and multi-hop queries. The framework introduces three key innovations including ontology-guided extraction, multi-dimensional clustering, and dual-channel fusion, showing superior performance over mainstream solutions like LightRAG on benchmark tests.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Researchers propose EvalAct, a new method that improves retrieval-augmented AI agents by converting retrieval quality assessment into explicit actions and using Process-Calibrated Advantage Rescaling (PCAR) for optimization. The approach shows superior performance on multi-step reasoning tasks across seven open-domain QA benchmarks by providing better process-level feedback signals.

AIBullisharXiv – CS AI · Mar 116/10

🧠

TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation

Researchers propose TaSR-RAG, a new framework that improves Retrieval-Augmented Generation systems by using taxonomy-guided structured reasoning for better evidence selection. The system decomposes complex questions into triple sub-queries and performs step-wise evidence matching, achieving up to 14% performance improvements over existing RAG baselines on multi-hop question answering benchmarks.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Researchers introduce ReMemR1, a new approach to improve large language models' ability to handle long-context question answering by integrating memory retrieval into the memory update process. The system enables non-linear reasoning through selective callback of historical memories and uses multi-level reward design to strengthen training.

AINeutralarXiv – CS AI · Feb 276/107

🧠

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Researchers introduce SPARTA, an automated framework for generating large-scale Table-Text question answering benchmarks that require complex multi-hop reasoning across structured and unstructured data. The benchmark exposes significant weaknesses in current AI models, with state-of-the-art systems experiencing over 30 F1 point performance drops compared to existing simpler datasets.

AIBullisharXiv – CS AI · Feb 276/107

🧠

RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Researchers introduce RELOOP, a new retrieval-augmented generation framework that improves multi-step question answering across text, tables, and knowledge graphs. The system uses hierarchical sequences and structure-aware iteration to achieve better accuracy while reducing computational costs compared to existing RAG methods.