#question-answering News & Analysis

48 articles tagged with #question-answering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

48 articles

AINeutralarXiv – CS AI · May 296/10

🧠

GrepSeek: Training Search Agents for Direct Corpus Interaction

Researchers introduce GrepSeek, an AI search agent that interacts directly with text corpora using shell commands rather than traditional retrieval indexes. The system combines supervised learning with reinforcement optimization to achieve state-of-the-art results on question-answering benchmarks while operating at scale through parallel execution techniques.

AIBullisharXiv – CS AI · May 296/10

🧠

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

Researchers introduce CRITIC-R1, a structured framework that uses reinforcement learning to improve retrieval-augmented generation (RAG) systems by diagnosing and correcting errors in AI-generated answers. The approach outperforms existing RAG methods by providing fine-grained, multi-dimensional feedback rather than coarse corrections, addressing persistent hallucination and reasoning problems in knowledge-intensive question answering.

AINeutralarXiv – CS AI · May 276/10

🧠

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

Researchers introduce MiRD, a two-stage framework that improves reliable prediction for open-ended question answering by separately addressing sampling failures and selection errors. The approach maintains calibration-set integrity while controlling hallucinations in AI models, outperforming existing conformal prediction methods across multiple datasets and models.

AINeutralarXiv – CS AI · May 126/10

🧠

PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering

PathISE is a novel framework that enables knowledge graph question-answering systems to learn effective supervision signals from answer-level labels alone, eliminating the need for expensive intermediate annotations. By using a transformer-based estimator to identify informative relation paths and distilling them into LLM path generators, the approach achieves competitive state-of-the-art performance while reducing resource requirements for training.

AINeutralarXiv – CS AI · May 126/10

🧠

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

Researchers introduce Sem-ECE, a new framework for evaluating how well large language models calibrate their confidence in open-ended question answering tasks. The method samples multiple answers from LLMs, groups them semantically, and uses answer frequency distributions as confidence measures, outperforming existing evaluation approaches across major commercial models.

AINeutralarXiv – CS AI · May 126/10

🧠

HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

Researchers introduce HOME-KGQA, a new benchmark dataset for evaluating knowledge graph question answering systems on household activities using multimodal data. The dataset reveals significant performance gaps in current LLM-based KGQA methods, highlighting critical challenges for real-world deployment of AI systems that combine language models with structured knowledge.

AINeutralarXiv – CS AI · May 116/10

🧠

From Time Series Analysis to Question Answering: A Survey in the LLM Era

A new survey examines how Large Language Models are transforming time series analysis by shifting from traditional task-specific forecasting toward a unified question-answering framework. The research proposes three alignment paradigms to bridge the gap between LLM capabilities and temporal data analysis, offering practical guidance for selecting appropriate methodologies across domains.

AINeutralarXiv – CS AI · May 16/10

🧠

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Researchers introduce TopBench, a benchmark dataset of 779 samples designed to evaluate how well Large Language Models handle implicit prediction tasks over tabular data—queries requiring inference from historical patterns rather than simple data retrieval. Testing reveals current LLMs struggle with intent recognition and default to lookup-based approaches, indicating that accurate intent disambiguation is critical before predictive reasoning can succeed.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

Researchers present Deliberative Searcher, a framework that enhances large language model reliability by combining certainty calibration with retrieval-based search for question answering. The system uses reinforcement learning with soft reliability constraints to improve alignment between model confidence and actual correctness, producing more trustworthy outputs.

AINeutralarXiv – CS AI · Apr 146/10

🧠

CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

Researchers introduce CFMS, a two-stage framework combining multimodal large language models with symbolic reasoning to improve tabular data comprehension for question answering and fact verification tasks. The approach achieves competitive results on WikiTQ and TabFact benchmarks while demonstrating particular robustness with large tables and smaller model architectures.

AIBullisharXiv – CS AI · Apr 76/10

🧠

GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering

Researchers introduced GroundedKG-RAG, a new retrieval-augmented generation system that creates knowledge graphs directly grounded in source documents to improve long-document question answering. The system reduces resource consumption and hallucinations while maintaining accuracy comparable to state-of-the-art models at lower cost.

AIBullisharXiv – CS AI · Mar 266/10

🧠

Mixture of Demonstrations for Textual Graph Understanding and Question Answering

Researchers propose MixDemo, a new GraphRAG framework that uses a Mixture-of-Experts mechanism to select high-quality demonstrations for improving large language model performance in domain-specific question answering. The framework includes a query-specific graph encoder to reduce noise in retrieved subgraphs and significantly outperforms existing methods across multiple textual graph benchmarks.

AIBullisharXiv – CS AI · Mar 176/10

🧠

GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

GlobalRAG is a new reinforcement learning framework that significantly improves multi-hop question answering by decomposing questions into subgoals and coordinating retrieval with reasoning. The system achieves 14.2% average improvements in performance metrics while using only 42% of the training data required by baseline models.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Researchers propose EvalAct, a new method that improves retrieval-augmented AI agents by converting retrieval quality assessment into explicit actions and using Process-Calibrated Advantage Rescaling (PCAR) for optimization. The approach shows superior performance on multi-step reasoning tasks across seven open-domain QA benchmarks by providing better process-level feedback signals.

AIBullisharXiv – CS AI · Mar 116/10

🧠

TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation

Researchers propose TaSR-RAG, a new framework that improves Retrieval-Augmented Generation systems by using taxonomy-guided structured reasoning for better evidence selection. The system decomposes complex questions into triple sub-queries and performs step-wise evidence matching, achieving up to 14% performance improvements over existing RAG baselines on multi-hop question answering benchmarks.

AIBullisharXiv – CS AI · Mar 96/10

🧠

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Researchers introduced RAMoEA-QA, a new AI system that uses hierarchical specialization to answer questions about respiratory audio recordings from mobile devices. The system employs a two-stage routing approach with Audio Mixture-of-Experts and Language Mixture-of-Adapters to handle diverse recording conditions and query types, achieving 0.72 test accuracy compared to 0.61-0.67 for existing baselines.

AIBullisharXiv – CS AI · Mar 36/1010

🧠

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Researchers have released DeepResearch-9K, a large-scale dataset with 9,000 questions across three difficulty levels designed to train and benchmark AI research agents. The accompanying open-source framework DeepResearch-R1 supports multi-turn web interactions and reinforcement learning approaches for developing more sophisticated AI research capabilities.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Researchers introduce ReMemR1, a new approach to improve large language models' ability to handle long-context question answering by integrating memory retrieval into the memory update process. The system enables non-linear reasoning through selective callback of historical memories and uses multi-level reward design to strengthen training.

AIBullisharXiv – CS AI · Mar 36/103

🧠

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Researchers propose HIMM, a new memory framework for AI embodied agents that separates episodic and semantic memory to improve long-term performance. The system achieves significant gains on benchmarks, with 7.3% improvement in LLM-Match and 11.4% in LLM MatchXSPL, addressing key challenges in deploying multimodal language models as embodied agent brains.

AINeutralarXiv – CS AI · Feb 276/107

🧠

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Researchers introduce SPARTA, an automated framework for generating large-scale Table-Text question answering benchmarks that require complex multi-hop reasoning across structured and unstructured data. The benchmark exposes significant weaknesses in current AI models, with state-of-the-art systems experiencing over 30 F1 point performance drops compared to existing simpler datasets.

AIBullisharXiv – CS AI · Feb 276/107

🧠

RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Researchers introduce RELOOP, a new retrieval-augmented generation framework that improves multi-step question answering across text, tables, and knowledge graphs. The system uses hierarchical sequences and structure-aware iteration to achieve better accuracy while reducing computational costs compared to existing RAG methods.

AINeutralarXiv – CS AI · Mar 115/10

🧠

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Researchers introduce MA-EgoQA, a benchmark for evaluating AI models' ability to understand multiple egocentric video streams from embodied agents simultaneously. The benchmark includes 1.7k questions across five categories and reveals current approaches struggle with multi-agent system-level understanding.

AINeutralarXiv – CS AI · Mar 44/104

🧠

ConEQsA: Concurrent and Asynchronous Embodied Questions Scheduling and Answering

Researchers introduce ConEQsA, an AI framework that enables embodied agents to handle multiple questions simultaneously in 3D environments with urgency-aware scheduling. The system uses shared memory to reduce redundant exploration and includes a new benchmark with 200 questions across 40 indoor scenes.

← PrevPage 2 of 2