y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-reliability News & Analysis

37 articles tagged with #llm-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

37 articles
AINeutralarXiv – CS AI · May 276/10
🧠

How Reliable are LLMs for Reasoning on the Re-ranking task?

Researchers investigate whether Large Language Models reliably perform re-ranking tasks by analyzing how different training methods affect semantic understanding and reasoning transparency. The study reveals that some training approaches produce better explainability than others, suggesting LLMs may optimize for evaluation metrics rather than genuine semantic comprehension, raising concerns about their actual reliability in ranking applications.

AINeutralarXiv – CS AI · May 126/10
🧠

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

NoisyCoconut is an inference-time method that improves LLM reliability by injecting controlled noise into internal representations to generate diverse reasoning paths, enabling models to abstain when uncertain without requiring retraining. The technique reduces error rates from 40-70% to below 15% on mathematical reasoning tasks through unanimous agreement among noise-perturbed paths, offering practical reliability improvements compatible with existing models.

AINeutralarXiv – CS AI · May 126/10
🧠

A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

Researchers present a communication-theoretic framework that unifies LLM reliability techniques (retry, majority voting, self-consistency) under classical information theory, introducing a cost-aware router that achieves 56% lower costs than fixed approaches while maintaining quality. The work demonstrates that no single reliability technique dominates across all tasks, supporting dynamic per-task allocation strategies.

AINeutralarXiv – CS AI · May 116/10
🧠

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

Researchers present a scale-conditioned evaluation protocol for AI agent memory systems that tests whether stored evidence remains usable as irrelevant data accumulates. Testing across multiple memory architectures and language models reveals that reliability degrades unpredictably with scale, with some models exceeding computational budgets while others maintain performance, suggesting memory scalability claims must be conditioned on specific agent-interface-scale combinations.

AINeutralarXiv – CS AI · May 116/10
🧠

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

Researchers propose using multidimensional self-assessment based on cognitive appraisal theory to predict LLM failures more reliably than confidence alone. Testing across 12 models and 38 tasks, they find effort and ability dimensions consistently outperform confidence, with task type determining which dimension proves most predictive.

AIBullisharXiv – CS AI · May 96/10
🧠

Towards Dependable Retrieval-Augmented Generation Using Factual Confidence Prediction

Researchers propose a two-stage approach to improve reliability in retrieval-augmented generation (RAG) systems by using conformal prediction to filter retrieved content and an attention-based classifier to detect factual inconsistencies. The framework achieves up to 6% answer quality improvement and 77% inconsistency detection, advancing toward certified RAG systems for production AI applications.

AINeutralarXiv – CS AI · May 96/10
🧠

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

Researchers propose CITE, an algorithm that enables reliable certification of Large Language Model outputs through multiple sampling while controlling error rates under data-dependent stopping conditions. The method addresses a critical challenge in LLM reliability by providing statistical guarantees without requiring advance knowledge of possible answer categories.

AIBearisharXiv – CS AI · May 16/10
🧠

Epistemic reflections on AI answering our questions: overwatch, erudite, logician, interlocutor

A research paper examines epistemological risks in relying on large language models for critical advice in finance, law, and healthcare. The article argues that uncritical acceptance of AI outputs violates established principles of logical reasoning and fair judgment, and proposes that trustworthy AI systems require integrated inference capabilities and awareness of how human biases shape interpretation.

🏢 Meta
AINeutralarXiv – CS AI · Apr 206/10
🧠

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Researchers propose a conformal prediction framework for large language models that uses internal neural representations rather than surface-level outputs to assess reliability and uncertainty. The Layer-Wise Information scoring method improves prediction validity under distribution shift while maintaining competitive performance, addressing a critical challenge in deploying LLMs where traditional uncertainty signals become unreliable.

AINeutralarXiv – CS AI · Apr 206/10
🧠

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

Researchers present Deliberative Searcher, a framework that enhances large language model reliability by combining certainty calibration with retrieval-based search for question answering. The system uses reinforcement learning with soft reliability constraints to improve alignment between model confidence and actual correctness, producing more trustworthy outputs.

AINeutralarXiv – CS AI · Mar 36/103
🧠

FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

Researchers introduce FaithCoT-Bench, the first comprehensive benchmark for detecting unfaithful Chain-of-Thought reasoning in large language models. The benchmark includes over 1,000 expert-annotated trajectories across four domains and evaluates eleven detection methods, revealing significant challenges in identifying unreliable AI reasoning processes.

← PrevPage 2 of 2