#question-answering News & Analysis

48 articles tagged with #question-answering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

48 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG

Researchers introduce GDP-RAG, a novel retrieval-augmented generation framework that improves multi-hop question answering by focusing computation only on information gaps rather than over-generating reasoning steps. The system achieves 60.63% accuracy on benchmark datasets while reducing computational costs by 22-68% compared to existing approaches.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Synthetic Contrastive Reasoning for Multi-Table Q&A

Researchers have developed a synthetic dataset and training method that significantly improves multi-table question-answering systems. By generating contrastive reasoning traces and fine-tuning open-weight language models with Contrastive Preference Optimization, the approach achieves 9.7-21 percentage point improvements over standard supervised fine-tuning methods.

🧠 Llama

AINeutralarXiv – CS AI · Jun 17/10

🧠

Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems

A comprehensive research study reveals that Retrieval-Augmented Generation (RAG) systems require context-aware deployment strategies rather than universal approaches. The analysis across multiple LLMs and datasets shows that RAG effectiveness depends heavily on task type, with optimal retrieval volumes and knowledge integration methods varying significantly between question answering and code generation applications.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

Researchers introduce Disco-RAG, a discourse-aware framework that enhances Retrieval-Augmented Generation (RAG) systems by explicitly modeling discourse structures and rhetorical relationships between retrieved passages. The method achieves state-of-the-art results on question answering and summarization tasks without fine-tuning, demonstrating that structural understanding of text significantly improves LLM performance on knowledge-intensive tasks.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

Researchers introduce GRIP, a unified framework that integrates retrieval decisions directly into language model generation through control tokens, eliminating the need for external retrieval controllers. The system enables models to autonomously decide when to retrieve information, reformulate queries, and terminate retrieval within a single autoregressive process, achieving competitive performance with GPT-4o while using substantially fewer parameters.

🧠 GPT-4

AIBullisharXiv – CS AI · Apr 77/10

🧠

PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning

Researchers propose PassiveQA, a new AI framework that teaches language models to recognize when they don't have enough information to answer questions, choosing to ask for clarification or abstain rather than hallucinate responses. The three-action system (Answer, Ask, Abstain) uses supervised fine-tuning to align model behavior with information sufficiency, showing significant improvements in reducing hallucinations.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Agentic DAG-Orchestrated Planner Framework for Multi-Modal, Multi-Hop Question Answering in Hybrid Data Lakes

Researchers introduce A.DOT Planner, an AI framework that enables multi-hop question answering across hybrid data lakes containing both structured and unstructured data. The system uses directed acyclic graphs to orchestrate complex queries, achieving 14.8% better accuracy and 10.7% better completeness than existing solutions.

$DOT

AIBullisharXiv – CS AI · Mar 56/10

🧠

From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

Researchers demonstrate that coreference resolution significantly improves Retrieval-Augmented Generation (RAG) systems by reducing ambiguity in document retrieval and enhancing question-answering performance. The study finds that smaller language models benefit more from disambiguation processes, with mean pooling strategies showing superior context capturing after coreference resolution.

AIBullisharXiv – CS AI · Mar 57/10

🧠

When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

Researchers developed a new training method combining Chain-of-Thought supervision with reinforcement learning to teach large language models when to abstain from answering temporal questions they're uncertain about. Their approach enabled a smaller Qwen2.5-1.5B model to outperform GPT-4o on temporal question answering tasks while improving reliability by 20% on unanswerable questions.

🧠 GPT-4

AIBearisharXiv – CS AI · Mar 56/10

🧠

ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Researchers introduce ObfusQAte, a new framework to test Large Language Model robustness when faced with obfuscated or disguised factual questions. The study reveals that LLMs tend to fail or generate hallucinated responses when confronted with increasingly complex variations of questions across three dimensions of obfuscation.

AINeutralarXiv – CS AI · Jun 236/10

🧠

DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework

Researchers introduce DeALOG, a decentralized multi-agent framework that uses specialized AI agents coordinating through a shared natural-language log to answer complex questions spanning text, tables, and images. The system demonstrates competitive performance on multiple benchmarks while improving robustness through collaborative verification without central control.

AINeutralarXiv – CS AI · Jun 236/10

🧠

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

CalVerT is a new framework that enhances LLM agents by providing calibrated confidence scores and grounding verification, helping agents distinguish between reliable and uncertain knowledge during question-answering tasks. The approach reduces both inaccurate confident answers and wasteful over-retrieval, improving performance across multiple QA benchmarks without requiring additional training.

AINeutralarXiv – CS AI · Jun 236/10

🧠

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

ARCO introduces an adaptive rubric framework that enables large language model agents to receive step-level interpretable rewards during multi-step reasoning tasks. By jointly evolving the reward rubric and policy through co-training, the method achieves stronger performance on question-answering benchmarks while providing explainable feedback that clarifies why each step in a trajectory succeeds or fails.

AIBullisharXiv – CS AI · Jun 116/10

🧠

MSUE: Multi-Modal Soccer Understanding Expert

Researchers developed MSUE, a multi-expert question-answering system that achieved 0.95 accuracy in the 2026 SoccerNet VQA Challenge by combining vision-language models, large language models, and specialized experts. The solution uses an LLM router to dynamically dispatch questions to text, image, and video processing experts, demonstrating advances in multi-modal AI for domain-specific tasks.

AINeutralarXiv – CS AI · Jun 106/10

🧠

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Researchers introduced LakeQA, a new benchmark dataset for evaluating large language models on question-answering tasks over massive data lakes containing 9.5TB of heterogeneous data. The benchmark reveals significant challenges in current LLMs, with GPT-5.2 achieving only 18.37% accuracy, highlighting the gap between reading-comprehension performance and real-world search-and-reasoning requirements.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 106/10

🧠

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

Researchers propose DAC (Divide and Cooperate), a multi-agent training framework that separates evidence retrieval and answer generation into two specialized agents with cross-agent learning signals. This approach addresses credit assignment problems in language models performing multi-step reasoning and achieves competitive performance using parameter-efficient LoRA modules, outperforming full fine-tuning baselines on QA benchmarks.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

Researchers have successfully developed the first Retrieval Augmented Generation (RAG) system for legal question answering in Nepali, addressing a critical gap in AI applications for low-resource languages. The system achieved 91% precision using BM25 retrieval and demonstrated 84% human-evaluated truthfulness, establishing a viable foundation for AI-assisted legal services in non-English speaking jurisdictions.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

Researchers introduce CondMedQA, a new benchmark for biomedical question answering that accounts for patient-specific conditions, and propose Condition-Gated Reasoning (CGR), a framework that builds condition-aware knowledge graphs to ensure medical reasoning adapts to individual patient contexts rather than assuming uniform knowledge application.

AINeutralarXiv – CS AI · Jun 86/10

🧠

ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers

ChemQuests is a new curated dataset containing 952 question-answer pairs extracted from chemistry research papers, designed to advance chemistry-focused natural language processing. The dataset bridges the gap between rapidly expanding chemistry literature and the need for domain-specific training data for AI models and retrieval systems.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 55/10

🧠

Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

Researchers propose an improved question answering system using fine-tuned large language models on the SQuAD dataset, achieving strong performance metrics (ROUGE-L: 86.84%, BERTScore: 95.38%). The work addresses limitations in current LLM-based QA systems' ability to extract accurate answers from given contexts, demonstrating that targeted fine-tuning substantially enhances reliability and precision.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Self-Augmenting Retrieval for Diffusion Language Models

Researchers introduce SARDI, a training-free retrieval-augmented generation framework for discrete diffusion language models that leverages low-confidence token predictions as lookahead signals to guide information retrieval during text generation. The approach achieves significant performance gains on multi-hop question-answering tasks while operating at substantially higher throughput than existing baselines.

AINeutralarXiv – CS AI · Jun 36/10

🧠

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

Researchers demonstrate that visual graph structures serve as more effective reasoning scaffolds for large language models than text-based representations, particularly when abstract guidance is provided without direct answer hints. The findings suggest graphs should be leveraged not merely as external knowledge sources but as internal organizational tools that meaningfully improve both reasoning efficiency and answer quality in multi-hop question-answering tasks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

Researchers present a method to improve error prediction in Large Language Models by distinguishing between genuine model uncertainty and input ambiguity. Using uncertainty quantification metrics on question-answering tasks, they demonstrate that ambiguity information significantly enhances error prediction accuracy, yielding improvements exceeding 10 percentage points across multiple datasets and model families.

AINeutralarXiv – CS AI · Jun 26/10

🧠

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

Researchers introduce RASER, a cost-efficient routing system for multi-hop question-answering that reduces token consumption by 51-59% compared to always-escalating methods while maintaining competitive accuracy. The system leverages six features from one-shot retrieval to intelligently decide whether additional retrieval rounds are necessary, eliminating wasteful LLM calls.

AINeutralarXiv – CS AI · May 296/10

🧠

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

Researchers propose a neuro-symbolic framework for constructing knowledge graphs that combines LLM-based extraction with post-hoc ontology constraint validation, reducing token costs while improving consistency for complex question-answering tasks. The method defers corrections to after extraction rather than during it, enabling SQL-like querying capabilities for multi-hop reasoning across documents.

Page 1 of 2Next →