#multi-hop-qa News & Analysis

6 articles tagged with #multi-hop-qa. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG

Researchers introduce GDP-RAG, a novel retrieval-augmented generation framework that improves multi-hop question answering by focusing computation only on information gaps rather than over-generating reasoning steps. The system achieves 60.63% accuracy on benchmark datasets while reducing computational costs by 22-68% compared to existing approaches.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA

Researchers evaluated Google's Gemini Flash models on the MedHopQA biomedical reasoning challenge, demonstrating that advanced prompt engineering significantly improves LLM performance in complex multi-hop question answering. A sophisticated prompt combining role-playing and chain-of-thought examples achieved a 0.720 score versus 0.565 baseline, with Gemini 2.0 Flash matching newer 2.5 Flash performance.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 26/10

🧠

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

Researchers introduce RASER, a cost-efficient routing system for multi-hop question-answering that reduces token consumption by 51-59% compared to always-escalating methods while maintaining competitive accuracy. The system leverages six features from one-shot retrieval to intelligently decide whether additional retrieval rounds are necessary, eliminating wasteful LLM calls.

AINeutralarXiv – CS AI · May 286/10

🧠

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

Researchers propose a standardized measurement protocol for evaluating retrieval-augmented generation (RAG) systems using LLM judges, addressing inconsistencies in how semantic search quality is assessed. The standard fixes key variables like evidence budget and prompt while requiring cluster-aware statistical testing, revealing that previous comparisons may have overstated progress and that traditional BM25 retrieval outperforms pure semantic methods under controlled conditions.

AINeutralarXiv – CS AI · May 96/10

🧠

Inference-Time Budget Control for LLM Search Agents

Researchers propose a two-stage inference-time budget control system for LLM search agents that optimizes how language models allocate computational resources between tool calls and token generation during multi-hop question answering. The method uses Value-of-Information scoring to decide when to retrieve information, decompose questions, or commit to final answers, demonstrating consistent performance gains across multiple benchmarks and model sizes.

AIBullisharXiv – CS AI · Mar 26/1012

🧠

Democratizing GraphRAG: Linear, CPU-Only Graph Retrieval for Multi-Hop QA

Researchers present SPRIG, a CPU-only GraphRAG system that eliminates expensive LLM-based graph construction and GPU requirements for multi-hop question answering. The system uses lightweight NER-driven co-occurrence graphs with Personalized PageRank, achieving comparable performance while reducing computational costs by 28%.