#qa-systems News & Analysis

8 articles tagged with #qa-systems. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AINeutralarXiv – CS AI · Jun 195/10

🧠

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

Researchers propose an optimal scheduling system for question-answering forums staffed by paid knowledge workers rather than volunteers. The study calculates system capacity, designs efficient schedulers, and explores how expert collaboration can improve request-handling throughput.

AIBullisharXiv – CS AI · Jun 106/10

🧠

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

Researchers propose SAFE, an LLM-as-verifier framework that improves multi-hop question answering by validating reasoning steps against evidence during generation rather than only checking final answers. The approach uses Knowledge Graph triples to decompose reasoning into verifiable units and achieves 8.8 percentage point accuracy improvements across three benchmarks.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Answer Presence Drives RAG Rewriting Gains

A new research audit challenges the assumed benefits of LLM rewriters in retrieval-augmented QA systems, finding that performance gains stem primarily from the presence of gold answer strings in rewritten context rather than from genuine passage curation. The study introduces controlled intervention methods to test rewriter claims, revealing that conventional evaluation probes are sensitive to methodology choices and may report misleading results.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Researchers introduce Harness-1, a 20B parameter search agent that separates semantic decision-making from state management by externalizing working memory to a stateful harness environment. The system achieves 73% average curated recall across eight retrieval benchmarks, outperforming comparable open-source searchers by 11.4 points while generalizing well to held-out transfer tasks.

AINeutralarXiv – CS AI · May 296/10

🧠

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Researchers introduce a benchmark for evaluating how AI systems handle conflicting information across multiple memory sources, addressing a critical gap in testing personal AI agents. The study compares various approaches including fusion methods and LLMs, revealing that trained fusion models outperform prompt-based LLMs by 10+ percentage points on accuracy, with selective abstention improving performance further.

AIBullisharXiv – CS AI · May 126/10

🧠

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

SearchSkill is a new framework that teaches language models to perform more effective web searches by explicitly planning queries through reusable skill cards rather than treating search as an undifferentiated action. The system maintains an evolving skill bank that improves from failure patterns, demonstrating better performance on knowledge-intensive QA tasks with fewer wasted queries and improved reasoning accuracy.

AINeutralarXiv – CS AI · May 126/10

🧠

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

Researchers introduce PiCA (Pivot-Based Credit Assignment), a novel reinforcement learning mechanism that improves how LLM-based search agents learn from long sequences of actions. By identifying key pivot steps and anchoring rewards to final task outcomes, PiCA addresses critical challenges in credit assignment, delivering 15.2% performance gains on knowledge-intensive QA tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

A new study compares Retrieval-Augmented Generation (RAG) and fine-tuning approaches for adapting Large Language Models to enterprise question-answering tasks in the automotive industry. The research finds that RAG offers superior cost-efficiency while maintaining comparable answer quality, even enabling open-source models to match premium model performance.