y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-reliability News & Analysis

138 articles tagged with #ai-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

138 articles
AIBullisharXiv – CS AI · 3d ago6/10
🧠

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.

AIBearisharXiv – CS AI · 3d ago6/10
🧠

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

A comprehensive study reveals that multimodal large language models exhibit significant hallucination problems in agricultural imaging tasks, with image interpretation achieving only 63-75% zero-shot accuracy and text-to-image generation producing up to 91% biologically inconsistent scenes. These findings highlight critical reliability gaps that could undermine the trustworthiness of AI-driven agricultural platforms.

🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · 3d ago6/10
🧠

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

Researchers propose Calibrated Entropy Score (CES), a novel method for detecting hallucinations in large language models using entropy distribution patterns from a single forward pass. The technique achieves performance comparable to computationally expensive multi-sample methods while requiring only black-box access to token logits, with formal mathematical guarantees for detection accuracy.

🏢 Perplexity
AINeutralarXiv – CS AI · 3d ago6/10
🧠

CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

Researchers introduce CiteCheck, a hybrid framework that detects when large language models fabricate or corrupt scientific citations by combining scholarly database retrieval with structured LLM verification. The system achieves 88.7% macro-F1 on a new 982-citation physics benchmark, outperforming GPT, Claude, and Gemini, addressing a critical reliability problem as LLMs become integrated into scientific research workflows.

🧠 Claude🧠 Gemini
AIBearishTechCrunch – AI · 4d ago6/10
🧠

Why Google’s AI can’t spell Google (or anything else)

Google's AI systems have demonstrated a surprising inability to accurately spell basic words, including Google itself, exposing fundamental limitations in current large language models despite their apparent sophistication. This incident highlights ongoing challenges in AI reliability and raises questions about the robustness of AI systems being deployed at scale.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Researchers introduce Anchor, a task-generation pipeline that addresses 'artifact drift' in AI agent benchmarking by automatically creating consistent instructions, environments, solutions, and verifiers from formal specifications. The team releases ERP-Bench, a 300-task benchmark for enterprise workflows, finding frontier AI models solve only 17.4% of tasks optimally despite meeting explicit constraints 26.1% of the time.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

A new study comparing three LLM approaches to mathematical reasoning found that pure chain-of-thought prompting outperforms code execution methods in robustness across problem variations. When math problems were modified with simple changes like different names or numbers, code-based approaches showed greater accuracy drops, challenging the assumption that code execution improves reasoning reliability.

🧠 Claude🧠 Haiku
AINeutralarXiv – CS AI · 4d ago6/10
🧠

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

Researchers introduce MemFail, a diagnostic benchmark for testing failure modes in LLM memory systems by isolating three core operations: summarization, storage, and retrieval. The benchmark evaluates state-of-the-art memory systems across five adversarially-designed datasets to empirically understand architectural tradeoffs, moving beyond aggregate accuracy metrics.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Generating Robust Portfolios of Optimization Models using Large Language Models

Researchers propose an algorithm that uses large language models to generate portfolios of optimization models rather than single outputs, addressing the reliability gap in LLM-generated solutions. The method leverages LLMs in dual roles—as generative and evaluative components—with theoretical guarantees that high-quality candidates appear in the portfolio as long as either role aligns with human preferences.

$MKR
AINeutralarXiv – CS AI · 4d ago6/10
🧠

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

Researchers introduce MiRD, a two-stage framework that improves reliable prediction for open-ended question answering by separately addressing sampling failures and selection errors. The approach maintains calibration-set integrity while controlling hallucinations in AI models, outperforming existing conformal prediction methods across multiple datasets and models.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

Researchers introduce Context-Driven Decomposition (CDD), a diagnostic tool that reveals how retrieval-augmented generation (RAG) systems blindly follow retrieved context even when it contradicts their underlying knowledge. Testing across multiple AI models shows CDD can improve accuracy to 64% on adversarial scenarios, though improvements don't consistently transfer across different model families, suggesting RAG systems resolve conflicts through fundamentally different mechanisms.

🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Researchers present a rigorous statistical framework for measuring AI agent reliability through U-statistics and kernel-based metrics, moving beyond traditional pass@1 evaluation methods. The study reveals that agents can possess requisite knowledge yet fail catastrophically under minor task variations, with trajectory-level consistency metrics providing significantly better diagnostic sensitivity for identifying failure modes in high-stakes deployments.

AINeutralarXiv – CS AI · May 126/10
🧠

Interactive Critique-Revision Training for Reliable Structured LLM Generation

Researchers propose DPA-GRPO, a novel training method for large language models that improves structured decision-making by using a generator-verifier framework where one model produces outputs and another validates them through safety assurance cases. The method demonstrates improved accuracy on tax calculation benchmarks and addresses the challenge of ensuring LLM outputs are locally correct, globally consistent, and auditable.

AINeutralarXiv – CS AI · May 126/10
🧠

Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

Researchers demonstrate that execution-based voting methods for LLM code generation significantly outperform text-based majority voting by 18-52 percentage points. The study reveals that input quality—particularly sketch-based generation—matters far more than the aggregation algorithm itself, challenging assumptions about how to select optimal code outputs.

AINeutralarXiv – CS AI · May 126/10
🧠

Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

Researchers investigate why visual grounding models fail when image captions are semantically mismatched, hypothesizing that embedding anisotropy may be responsible. Testing two transformer-based models with different embedding geometries reveals no meaningful correlation between cosine similarity and approximation errors, suggesting the problem requires investigation of deeper geometric properties.

AINeutralarXiv – CS AI · May 126/10
🧠

HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

Researchers introduce HOME-KGQA, a new benchmark dataset for evaluating knowledge graph question answering systems on household activities using multimodal data. The dataset reveals significant performance gaps in current LLM-based KGQA methods, highlighting critical challenges for real-world deployment of AI systems that combine language models with structured knowledge.

AINeutralarXiv – CS AI · May 116/10
🧠

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

Researchers introduce MemoRepair, a system that addresses cascade failures in agentic memory by preventing stale or invalidated information from corrupting downstream AI agent decisions. Using a barrier-first approach and graph-based optimization, the system reduces invalid memory exposure from 69-94% to 0% while maintaining 91-94% of valid successor states with significantly lower repair costs.

AINeutralarXiv – CS AI · May 116/10
🧠

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

TraceFix is a verification-first framework that uses TLA+ model checking to automatically repair and validate multi-agent LLM coordination protocols, achieving 100% verification success on 48 test tasks with 62.5% passing on first attempt. The approach reduces deadlock/livelock failures from 31.1% to 14.1% and improves task completion rates to 89.4% compared to unverified baselines.

AINeutralarXiv – CS AI · May 116/10
🧠

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Researchers evaluated metacognitive monitoring across 33 frontier LLMs using 47,151 MMLU benchmark items, finding significant domain-level variation masked by aggregate performance scores. Applied/Professional knowledge domains showed consistently strong self-monitoring (AUROC .742), while Formal Reasoning and Natural Science proved most challenging, with implications for targeted model deployment.

🏢 OpenAI🏢 Anthropic🧠 Gemini
AINeutralarXiv – CS AI · May 116/10
🧠

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

Researchers introduce a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of a small open-weight reader model rather than the generator itself. The system achieves competitive or superior performance compared to existing methods across multiple model architectures, with notably consistent results showing that model size has minimal impact on detection accuracy.

🧠 GPT-4
AIBearisharXiv – CS AI · May 46/10
🧠

Impact of Task Phrasing on Presumptions in Large Language Models

Researchers at arXiv studied how task phrasing influences the decision-making of large language models, using the iterated prisoner's dilemma as a test case. The findings reveal that LLMs are prone to making presumptions based on how tasks are worded, which can impair their adaptability and reasoning—a safety concern for real-world deployment. Neutral task phrasing significantly reduced these presumptions, suggesting that prompt design is critical for reliable LLM performance.

AINeutralarXiv – CS AI · May 16/10
🧠

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

Researchers introduce RSCB-MC, a risk-sensitive contextual bandit system that improves how LLM-based coding agents decide whether to use external memory for debugging tasks. Rather than treating memory retrieval as a simple similarity-matching problem, the system treats it as a safety-critical control problem, achieving 62.5% success rate with zero false positives in testing.

AINeutralarXiv – CS AI · Apr 206/10
🧠

DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation

Researchers propose DALM, a Domain-Algebraic Language Model that constrains token generation through structured denoising across domain lattices rather than unconstrained decoding. The framework uses algebraic constraints across three phases—domain, relation, and concept resolution—to prevent cross-domain knowledge interference and improve factual accuracy in specialized domains.

AIBearisharXiv – CS AI · Apr 206/10
🧠

The threat of analytic flexibility in using large language models to simulate human data

A new study reveals that using large language models to generate synthetic datasets ("silicon samples") produces highly variable results depending on configuration choices, with correlation outcomes ranging from r=.23 to r=.84 on the same task. This demonstrates that analytic flexibility in LLM-based data generation poses a significant threat to research validity and reproducibility in social science applications.

AIBullisharXiv – CS AI · Apr 206/10
🧠

Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation

Researchers demonstrate that LLMs can be used as lossless encoders and decoders for invertible problems in hardware design, significantly reducing hallucinations and omissions. By generating HDL code from Logic Condition Tables and reconstructing the original tables to verify accuracy, the approach improves developer productivity and catches both AI-generated errors and design specification flaws.

← PrevPage 4 of 6Next →