AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers introduce DistractionIF, a benchmark revealing that larger language models are paradoxically less robust to instruction-like noise in reference text, with performance degrading up to 30 points as scale increases. The study demonstrates that reinforcement learning via Group Relative Policy Optimization can restore robustness by 15.5% while maintaining instruction-following capability.
🏢 Perplexity
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers have developed a method to improve how large language models verify factual claims by framing fact-checking as a true/false reading comprehension task with explicit test-taking strategies. The approach reduces token usage by over 80% while maintaining competitive performance, and enables smaller language models to perform similarly to larger ones through fine-tuning and self-revision mechanisms.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce CORE-RAG, a novel framework that compresses context in Retrieval-Augmented Generation systems using performance-driven learning rather than predefined heuristics. The approach achieves a 97% compression ratio while improving accuracy by 3.3 points on exact match scores, addressing a critical bottleneck in LLM efficiency.
AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers identify source-dependence as a critical failure mode in retrieval-augmented generation (RAG) systems, where multi-source medical AI systems provide different answers to identical questions based on which institutional source is retrieved. The study introduces TransplantQA, HERO-QA, and evaluation frameworks to audit this phenomenon, revealing that source disagreement is far more prevalent than previously measured.
AIBullisharXiv – CS AI · 3d ago7/10
🧠GroundedCache proposes a safety-first framework for reusing cached answers in retrieval-augmented generation systems by validating four conditions before serving cached responses. The system achieves near-zero unsafe-served rates (0-1.5%) across benchmarks while maintaining minimal latency overhead, addressing critical vulnerabilities in current caching approaches that can serve incorrect answers.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers identify a critical vulnerability in retrieval-augmented generation systems where language models produce faithful-looking outputs from memory rather than retrieved context, making it impossible to verify source attribution through output analysis alone. They propose Computational Reality Monitoring (CRM), a technique that detects internal representational differences to identify when models rely on pretraining data versus external evidence.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers discovered that retrieval-augmented language models exhibit a critical safety gap: they can detect contradictory information in accumulated evidence but fail to incorporate this awareness into their final recommendations. Testing across model families showed single-turn safety evaluations significantly overestimate real-world robustness in multi-turn scenarios where evidence accumulates.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers introduce MedMeta, a benchmark evaluating how well large language models can synthesize conclusions from medical meta-analyses using only study abstracts. The study reveals that retrieval-augmented generation (RAG) significantly outperforms parametric-only approaches, but all current models struggle with evidence synthesis and fail to properly reject contradictory findings, achieving only marginally above-average performance even under ideal conditions.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers evaluated six defense mechanisms against persistent memory attacks on LLM agents, finding that most input and retrieval-level defenses fail to prevent malicious instruction execution stored in agent memory. Only Memory Sandbox, a memory-layer tool-gating approach, effectively blocked attacks across eight of nine models while maintaining zero utility cost, though it paradoxically increased attack success in one reasoning model by forcing reliance on alternative execution pathways.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers introduce CloudWeb, an adversarial attack that manipulates remote sensing images with realistic cloud and haze patterns to hijack vision-language retrieval systems in multimodal RAG pipelines. The attack achieves significant success rates—increasing weather-related evidence injection from 0.71% to 43.29% on benchmark tests—demonstrating that input-space threats to retrieval stages remain largely undefended in production systems.
🏢 OpenAI
AIBullisharXiv – CS AI · May 117/10
🧠LARAG introduces a link-aware retrieval strategy that improves RAG systems by leveraging hyperlink structures already present in technical documentation, rather than treating documents as flat text collections. The approach achieves better answer quality with fewer computational resources, demonstrating that implicit graph-like retrieval through existing metadata can enhance AI system performance.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce FinAgent-RAG, an advanced AI framework designed to answer complex financial questions by combining iterative retrieval, reasoning, and self-verification. The system achieves 76-78% accuracy on financial benchmarks while reducing computational costs by 41%, demonstrating practical viability for institutional financial analysis.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers introduce NeocorRAG, a new framework that optimizes retrieval quality in Retrieval-Augmented Generation (RAG) systems by using Evidence Chains, achieving state-of-the-art performance while reducing token consumption by 80% compared to comparable methods. The framework addresses a critical gap where improvements in retrieval metrics don't consistently translate to better reasoning accuracy.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers have developed a multi-agent AI system that autonomously generates machine learning pipelines from datasets and natural-language instructions, achieving 84.7% success rate across 150 diverse tasks. The architecture integrates self-healing mechanisms and adaptive learning to reduce manual development time and improve robustness.
AINeutralarXiv – CS AI · May 17/10
🧠Researchers release NanoKnow, a benchmark dataset that reveals how large language models acquire and encode knowledge by leveraging nanochat's fully transparent pre-training data. The study demonstrates that LLM accuracy depends heavily on answer frequency in training data, and that parametric knowledge and external evidence serve complementary roles in model outputs.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce reasoning graphs, a persistent knowledge structure that improves language model reasoning accuracy by storing and reusing chains of thought tied to evidence items. The system achieves 47% error reduction on multi-hop questions and maintains deterministic outputs without model retraining, using only context engineering.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce DocSeeker, a multimodal AI system designed to improve long document understanding by implementing structured analysis, localization, and reasoning workflows. The breakthrough addresses critical limitations in existing large language models that struggle with lengthy documents due to high noise levels and weak training signals, achieving superior performance on both short and ultra-long documents.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce Disco-RAG, a discourse-aware framework that enhances Retrieval-Augmented Generation (RAG) systems by explicitly modeling discourse structures and rhetorical relationships between retrieved passages. The method achieves state-of-the-art results on question answering and summarization tasks without fine-tuning, demonstrating that structural understanding of text significantly improves LLM performance on knowledge-intensive tasks.
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have developed ADAM, a novel privacy attack that exploits vulnerabilities in Large Language Model agents' memory systems through adaptive querying, achieving up to 100% success rates in extracting sensitive information. The attack highlights critical security gaps in modern LLM-based systems that rely on memory modules and retrieval-augmented generation, underscoring the urgent need for privacy-preserving safeguards.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce soul.py, an open-source architecture addressing catastrophic forgetting in AI agents by distributing identity across multiple memory systems rather than centralizing it. The framework implements persistent identity through separable components and a hybrid RAG+RLM retrieval system, drawing inspiration from how human memory survives neurological damage.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.
AIBearisharXiv – CS AI · Mar 277/10
🧠Researchers have developed PIDP-Attack, a new cybersecurity threat that combines prompt injection with database poisoning to manipulate AI responses in Retrieval-Augmented Generation (RAG) systems. The attack method demonstrated 4-16% higher success rates than existing techniques across multiple benchmark datasets and eight different large language models.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers have developed an open-source benchmark dataset to evaluate AI systems' compliance with the EU AI Act, specifically focusing on NLP and RAG systems. The dataset enables automated assessment of risk classification, article retrieval, and question-answering tasks, achieving 0.87 and 0.85 F1-scores for prohibited and high-risk scenarios.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers introduce CRITIC-R1, a structured framework that uses reinforcement learning to improve retrieval-augmented generation (RAG) systems by diagnosing and correcting errors in AI-generated answers. The approach outperforms existing RAG methods by providing fine-grained, multi-dimensional feedback rather than coarse corrections, addressing persistent hallucination and reasoning problems in knowledge-intensive question answering.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce RefWalk, a novel framework and RegOps-Bench benchmark for improving Large Language Model compliance with regulatory question-answering tasks. The system addresses critical gaps in citation traceability and attribution accuracy by traversing multi-document regulatory structures, enabling more reliable AI deployment in compliance-critical domains.