y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-reliability News & Analysis

50 articles tagged with #ai-reliability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

50 articles
AIBullisharXiv – CS AI · Apr 76/10
🧠

I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

Researchers developed I-CALM, a prompt-based framework that reduces AI hallucinations by encouraging language models to abstain from answering when uncertain, rather than providing confident but incorrect responses. The method uses verbal confidence assessment and reward schemes to improve reliability without model retraining.

🧠 GPT-5
AIBearishTechCrunch – AI · Apr 56/10
🧠

Copilot is ‘for entertainment purposes only,’ according to Microsoft’s terms of use

Microsoft's terms of service classify Copilot as being 'for entertainment purposes only,' indicating that even AI companies themselves warn users against blindly trusting AI model outputs. This aligns with broader industry cautions about AI reliability and the need for human oversight when using AI tools.

🏢 Microsoft
AIBearisharXiv – CS AI · Mar 276/10
🧠

Back to Basics: Revisiting ASR in the Age of Voice Agents

Researchers introduced WildASR, a multilingual diagnostic benchmark revealing that current ASR systems suffer severe performance degradation in real-world conditions despite achieving near-human accuracy on curated tests. The study found that ASR models often hallucinate plausible but unspoken content under degraded inputs, creating safety risks for voice agents.

AINeutralarXiv – CS AI · Mar 266/10
🧠

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Researchers developed a Markovian framework to measure reliability and oversight costs for AI agents in organizational workflows before deployment. Testing on enterprise procurement data showed that workflows appearing reliable at the state level can have substantial decision-making blind spots when refined with contextual information.

AIBullisharXiv – CS AI · Mar 176/10
🧠

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Researchers propose Latent Entropy-Aware Decoding (LEAD), a new method to reduce hallucinations in multimodal large reasoning models by switching between continuous and discrete token embeddings based on entropy states. The technique addresses issues where transition words correlate with high-entropy states that lead to unreliable outputs in visual question answering tasks.

AINeutralarXiv – CS AI · Mar 126/10
🧠

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

Researchers have developed the System Hallucination Scale (SHS), a human-centered tool for evaluating hallucination behavior in large language models. The instrument showed strong statistical validity in testing with 210 participants and provides a practical method for assessing AI model reliability from a user perspective.

AIBearishDecrypt · Mar 106/10
🧠

There's a Benchmark Test That Measures AI 'Bullshit'—Most Models Fail

BullshitBench, a new benchmark test, evaluates AI models' ability to detect nonsensical questions versus confidently providing incorrect answers. The results show most AI models fail this test, highlighting a significant reliability issue in current AI systems.

There's a Benchmark Test That Measures AI 'Bullshit'—Most Models Fail
AIBearisharXiv – CS AI · Mar 96/10
🧠

The Fragility Of Moral Judgment In Large Language Models

Researchers tested the stability of moral judgments in large language models using nearly 3,000 ethical dilemmas, finding that narrative framing and evaluation methods significantly influence AI decisions. The study reveals that LLM moral reasoning is highly dependent on how questions are presented rather than underlying moral substance, with only 35.7% consistency across different evaluation protocols.

🧠 GPT-4🧠 Claude
AIBearishDecrypt · Mar 46/104
🧠

Colombian Court Rejects Appeal for AI Writing, Then Gets Flagged By Its Own AI Detector

Colombia's highest criminal court rejected a lawyer's appeal citing AI detector evidence, but when the attorney tested the court's own ruling with the same AI detection software, it flagged the court's decision as 93% AI-generated. This highlights the unreliability and potential hypocrisy of using AI detectors as evidence in legal proceedings.

Colombian Court Rejects Appeal for AI Writing, Then Gets Flagged By Its Own AI Detector
AIBullisharXiv – CS AI · Mar 37/108
🧠

AI Runtime Infrastructure

Researchers introduce AI Runtime Infrastructure, a new execution layer that sits between AI models and applications to optimize agent performance in real-time. This infrastructure actively monitors and intervenes in agent behavior during execution to improve task success, efficiency, and safety across long-running workflows.

AIBearisharXiv – CS AI · Mar 37/108
🧠

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

Research reveals that Large Language Models (LLMs) systematically fail at code review tasks, frequently misclassifying correct code as defective when matching implementations to natural language requirements. The study found that more detailed prompts actually increase misjudgment rates, raising concerns about LLM reliability in automated development workflows.

AIBullisharXiv – CS AI · Mar 36/108
🧠

IDER: IDempotent Experience Replay for Reliable Continual Learning

Researchers propose IDER (Idempotent Experience Replay), a new continual learning method that addresses catastrophic forgetting in neural networks while improving prediction reliability. The approach uses idempotent properties to help AI models retain previously learned knowledge when acquiring new tasks, with demonstrated improvements in accuracy and reduced computational overhead.

AINeutralarXiv – CS AI · Mar 36/103
🧠

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Researchers introduced WebDevJudge, a benchmark for evaluating how well AI models can judge web development quality compared to human experts. The study reveals significant gaps between AI judges and human evaluation, highlighting fundamental limitations in AI's ability to assess complex, interactive web development tasks.

AIBearisharXiv – CS AI · Mar 36/104
🧠

Are LLMs Ready to Replace Bangla Annotators?

A comprehensive study of 17 Large Language Models as automated annotators for Bangla hate speech detection reveals significant bias and instability issues. The research found that larger models don't necessarily perform better than smaller, task-specific ones, raising concerns about LLM reliability for sensitive annotation tasks in low-resource languages.

AI × CryptoBearishCoinTelegraph – AI · Mar 37/107
🤖

OpenZeppelin finds data contamination in OpenAI’s EVMbench

OpenZeppelin discovered significant flaws in OpenAI's EVMbench dataset, including data contamination from training leaks and at least four incorrectly classified high-severity vulnerabilities. This finding raises concerns about the reliability of AI tools used for blockchain security auditing.

OpenZeppelin finds data contamination in OpenAI’s EVMbench
AIBearishBeInCrypto · Mar 26/105
🧠

Anthropic’s Claude Suffers Widespread Outage, Exposing AI Reliance

Anthropic's Claude AI chatbot experienced a widespread service outage, leaving thousands of users unable to access the claude.ai platform. The incident, labeled 'Elevated errors on claude.ai' by Anthropic's status page, began at 11:49 and sparked significant reactions across developer and tech communities, highlighting growing dependence on AI services.

Anthropic’s Claude Suffers Widespread Outage, Exposing AI Reliance
AINeutralarXiv – CS AI · Mar 27/1014
🧠

Demystifying the Lifecycle of Failures in Platform-Orchestrated Agentic Workflows

Researchers present AgentFail, a dataset of 307 real-world failure cases from agentic workflow platforms, analyzing how multi-agent AI systems fail and can be repaired. The study reveals that failures in these low-code orchestrated AI workflows propagate differently than traditional software, making them harder to diagnose and fix.

AINeutralarXiv – CS AI · Feb 276/105
🧠

Evaluating Stochasticity in Deep Research Agents

Researchers identified stochasticity (variability) as a critical barrier to deploying Deep Research Agents in real-world applications like financial decision-making and medical analysis. The study proposes mitigation strategies that reduce output variance by 22% while maintaining research quality, addressing a key obstacle for enterprise AI agent adoption.

AIBearishIEEE Spectrum – AI · Jan 86/104
🧠

AI Coding Assistants Are Getting Worse

AI coding assistants like GPT-5 are experiencing a decline in quality, with newer models generating code that runs without syntax errors but produces incorrect results silently. This represents a shift from easily debuggable crashes to more dangerous silent failures that are harder to detect and fix.

AINeutralarXiv – CS AI · Apr 75/10
🧠

Measuring LLM Trust Allocation Across Conflicting Software Artifacts

Researchers developed TRACE, a framework to evaluate how LLMs allocate trust between conflicting software artifacts like code, documentation, and tests. The study found that current LLMs are better at identifying natural-language specification issues than detecting subtle code-level problems, with models showing systematic blind spots when implementations drift while documentation remains plausible.

AINeutralarXiv – CS AI · Mar 275/10
🧠

From Untestable to Testable: Metamorphic Testing in the Age of LLMs

A research paper introduces metamorphic testing as a solution for testing AI and LLM-integrated software systems. The approach addresses the challenge of unreliable LLM outputs and limited labeled ground truth by using relationships between multiple test executions as test oracles.

AINeutralarXiv – CS AI · Mar 54/10
🧠

Generative AI in Managerial Decision-Making: Redefining Boundaries through Ambiguity Resolution and Sycophancy Analysis

A research study examined how generative AI models perform in business decision-making contexts, particularly their ability to detect ambiguity and resist sycophantic behavior. The study found that while AI excels at identifying contradictions and contextual ambiguities, it struggles with linguistic nuances and requires human oversight to function as a reliable strategic partner.

AINeutralApple Machine Learning · Mar 35/103
🧠

Learning to Reason for Hallucination Span Detection

Researchers are developing new methods to detect hallucinations in large language models by identifying specific spans of unsupported content rather than making binary decisions. The study evaluates Chain-of-Thought reasoning approaches to improve the complex multi-step process of hallucination span detection in LLMs.

← PrevPage 2 of 2