#rag-systems News & Analysis

96 articles tagged with #rag-systems. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

96 articles

AINeutralarXiv – CS AI · Jun 257/10

🧠

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

Researchers demonstrate that language models with corrupted memory systems produce confident false answers, while models without memory abstain appropriately. A source-first compression strategy that preserves reasoning steps over conclusions restores correctability and prevents error propagation through chained interactions.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices

Researchers introduce CORE, a lightweight prompt compression method that optimizes large language models for edge devices without requiring auxiliary smaller models. The approach achieves 30% accuracy improvements while reducing memory usage by 50% and cutting energy consumption by 95% on smartphones compared to existing methods.

🏢 Nvidia

AINeutralarXiv – CS AI · Jun 237/10

🧠

When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

Researchers identify 'retrieval-state lock-in,' a failure mode in retrieval-augmented generation (RAG) systems where multiple sampled answers agree despite being wrong because they condition on the same defective retrieval state. The study proposes decomposing confidence scores into three components—answer surface, evidence, and retrieval state—achieving 91.9% precision by requiring all three to agree, though this certifies only 7.7% of answers as low-risk.

AIBullisharXiv – CS AI · Jun 237/10

🧠

TTFT-Aware Graph Chain-of-Thought:Distance-Indexed Neural A* for Low-Hallucination Multi-Hop Medical Reasoning

Researchers present GraphRAG, a production-grade system for medical LLMs that reduces hallucinations by constraining answers to verifiable paths within a 700K-node medical knowledge graph. Using Pruned Landmark Labeling and AStarNet heuristics, the system improves clinical reasoning accuracy while reducing latency and hallucination rates in fertility assistant applications.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Ghost Vectors: Soft-Deleted Embeddings Remain Reconstructible in HNSW Vector Databases

Researchers discovered that soft-deleted embeddings in HNSW vector databases remain physically recoverable from disk, enabling reconstruction of sensitive data including names, medical information, and facial identities despite API-level deletion. The study demonstrates a critical compliance gap under GDPR and HIPAA, recovering up to 99% of certain personal identifiers, and proposes Epoch Key Rotation as a cryptographic solution that eliminates recovery risk while maintaining audit trails.

AINeutralarXiv – CS AI · Jun 117/10

🧠

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

Researchers present the Minimum Viable Evaluation Suite (MVES), a framework for systematically testing LLM applications, revealing that generic prompt improvements often fail to deliver consistent gains and can cause significant performance regressions. Testing on local models showed that adding generic rules to prompts degraded RAG citation compliance by up to 70%, underscoring the need for rigorous, task-specific evaluation before deployment.

🧠 Llama

AIBullisharXiv – CS AI · Jun 117/10

🧠

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

NightFeats, a multi-agent retrieval-augmented generation system, won Best Dynamic Evaluation at NeurIPS 2025's MMU-RAGent competition by prioritizing architectural transparency and evidence grounding over benchmark optimization. The system outperformed proprietary models like Claude-SonnetV2 and Nova-Pro through a three-phase pipeline combining retrieval, curation, and composition with explicit intermediate representations.

🧠 Claude

AIBullisharXiv – CS AI · Jun 107/10

🧠

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

Researchers propose a conflict-aware paradigm for large language models that dynamically balances external context against parametric knowledge, addressing failures in existing contrastive decoding methods. The work introduces Adaptive Regime Routing (ARR) to resolve fundamental asymmetries in how models handle contradictory information, improving resistance to erroneous context by 3-5x while maintaining performance on correct context.

AIBullisharXiv – CS AI · Jun 107/10

🧠

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Researchers introduce Latent Memory, a novel memory paradigm that compresses multimodal evidence (text and images) into single high-dimensional tokens for retrieval-augmented generation systems. The approach achieves competitive QA performance while reducing token consumption by 3-10x, addressing critical efficiency constraints in resource-limited deployments.

AIBullisharXiv – CS AI · Jun 97/10

🧠

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

ConflictRAG introduces a novel framework for detecting and resolving contradictory information in Retrieval-Augmented Generation systems, achieving 88.7% conflict-detection accuracy while reducing API costs by 62%. The system combines cost-efficient embedding-based detection with selective LLM refinement and demonstrates 5.3-6.1% improvements in answer correctness across multiple benchmarks.

AIBearisharXiv – CS AI · Jun 97/10

🧠

Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

A research paper identifies fundamental architectural flaws in Retrieval-Augmented Generation (RAG) systems for legal AI, showing that probabilistic similarity-based retrieval cannot adequately capture the hierarchical, temporal, and causal structure inherent in legal knowledge. The authors propose a deterministic-by-design framework addressing mereological blindness, diachronic blindness, and causal opacity to prevent persistent failures like fabricated citations and anachronistic legal content.

AIBullisharXiv – CS AI · Jun 57/10

🧠

HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation

Researchers introduce HypRAG, a novel dense retrieval system for retrieval-augmented generation that operates in hyperbolic space rather than traditional Euclidean space. The approach achieves up to 29% performance gains over Euclidean baselines by better preserving the hierarchical structure of natural language, reducing hallucination risks in AI systems.

AIBullisharXiv – CS AI · Jun 27/10

🧠

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

Sherlock is an AI framework that combines Large Language Models with structured domain knowledge to automate e-commerce fraud investigation and risk management. Deployed at JD.com, it achieved an 82% expert acceptance rate and 386.7% throughput increase while continuously adapting to evolving fraud tactics through a self-improving data flywheel.

AINeutralarXiv – CS AI · Jun 27/10

🧠

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

Researchers introduce PAVE, a diagnostic framework for evaluating how large language models arbitrate between their parametric knowledge and retrieved evidence in RAG-based fact-checking systems. Testing across seven LLMs reveals inconsistent and model-dependent behavior when prior knowledge conflicts with retrieved context, prompting the development of a lightweight test-time correction method to improve factual reliability.

AIBullisharXiv – CS AI · Jun 17/10

🧠

DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval

DynaTree is a two-stage framework for efficient news retrieval that combines offline agentic reasoning with lightweight online subtree selection, achieving significant improvements in real-world deployment. The system demonstrated a 59-73% survival rate versus 32-53% for fixed approaches in production A/B testing, highlighting the practical value of persistent semantic expansion for time-sensitive information retrieval.

AINeutralarXiv – CS AI · Jun 17/10

🧠

Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems

A comprehensive research study reveals that Retrieval-Augmented Generation (RAG) systems require context-aware deployment strategies rather than universal approaches. The analysis across multiple LLMs and datasets shows that RAG effectiveness depends heavily on task type, with optimal retrieval volumes and knowledge integration methods varying significantly between question answering and code generation applications.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA

Researchers propose DCRC, a data-centric framework addressing numerical hallucinations in LLM-based financial question-answering systems. The approach combines adversarial data construction, multi-stage training, and executable reasoning programs to improve reliability in high-stakes financial applications where accuracy is critical.

AIBullisharXiv – CS AI · May 297/10

🧠

Less Is More: Elevating RAG via Performance-Driven Context Compression

Researchers introduce CORE-RAG, a novel framework that compresses context in Retrieval-Augmented Generation systems using performance-driven learning rather than predefined heuristics. The approach achieves a 97% compression ratio while improving accuracy by 3.3 points on exact match scores, addressing a critical bottleneck in LLM efficiency.

AINeutralarXiv – CS AI · May 297/10

🧠

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Researchers identify source-dependence as a critical failure mode in retrieval-augmented generation (RAG) systems, where multi-source medical AI systems provide different answers to identical questions based on which institutional source is retrieved. The study introduces TransplantQA, HERO-QA, and evaluation frameworks to audit this phenomenon, revealing that source disagreement is far more prevalent than previously measured.

AIBullisharXiv – CS AI · May 297/10

🧠

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

Researchers have developed a method to improve how large language models verify factual claims by framing fact-checking as a true/false reading comprehension task with explicit test-taking strategies. The approach reduces token usage by over 80% while maintaining competitive performance, and enables smaller language models to perform similarly to larger ones through fine-tuning and self-revision mechanisms.

AINeutralarXiv – CS AI · May 297/10

🧠

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

Researchers introduce DistractionIF, a benchmark revealing that larger language models are paradoxically less robust to instruction-like noise in reference text, with performance degrading up to 30 points as scale increases. The study demonstrates that reinforcement learning via Group Relative Policy Optimization can restore robustness by 15.5% while maintaining instruction-following capability.

🏢 Perplexity

AIBullisharXiv – CS AI · May 287/10

🧠

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

GroundedCache proposes a safety-first framework for reusing cached answers in retrieval-augmented generation systems by validating four conditions before serving cached responses. The system achieves near-zero unsafe-served rates (0-1.5%) across benchmarks while maintaining minimal latency overhead, addressing critical vulnerabilities in current caching approaches that can serve incorrect answers.

AIBearisharXiv – CS AI · May 277/10

🧠

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

Researchers discovered that retrieval-augmented language models exhibit a critical safety gap: they can detect contradictory information in accumulated evidence but fail to incorporate this awareness into their final recommendations. Testing across model families showed single-turn safety evaluations significantly overestimate real-world robustness in multi-turn scenarios where evidence accumulates.

AIBearisharXiv – CS AI · May 277/10

🧠

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Researchers identify a critical vulnerability in retrieval-augmented generation systems where language models produce faithful-looking outputs from memory rather than retrieved context, making it impossible to verify source attribution through output analysis alone. They propose Computational Reality Monitoring (CRM), a technique that detects internal representational differences to identify when models rely on pretraining data versus external evidence.

AINeutralarXiv – CS AI · May 127/10

🧠

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Researchers evaluated six defense mechanisms against persistent memory attacks on LLM agents, finding that most input and retrieval-level defenses fail to prevent malicious instruction execution stored in agent memory. Only Memory Sandbox, a memory-layer tool-gating approach, effectively blocked attacks across eight of nine models while maintaining zero utility cost, though it paradoxically increased attack success in one reasoning model by forcing reliance on alternative execution pathways.

Page 1 of 4Next →