#reliability-engineering News & Analysis

6 articles tagged with #reliability-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AINeutralarXiv – CS AI · Apr 147/10

🧠

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

Researchers introduce Accelerated Prompt Stress Testing (APST), a new evaluation framework that reveals safety vulnerabilities in large language models through repeated prompt sampling rather than traditional broad benchmarks. The study finds that models appearing equally safe in conventional testing show significant reliability differences when repeatedly queried, indicating current safety benchmarks may mask operational risks in deployed systems.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

Researchers present layer-isolated evaluation, a deterministic testing framework that decomposes LLM agents into eight functional layers, each validated independently without requiring LLM execution. Testing across 238 cases reveals that aggregate end-to-end metrics mask localized regressions, with targeted layer failures causing 25-91 percentage point drops in component-specific tests while barely affecting overall pass rates.

AINeutralarXiv – CS AI · May 296/10

🧠

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Researchers introduce Think Fast, Talk Smart, a hybrid system that combines deterministic computation with bounded LLM calls for generating health text from structured data. The approach achieves lower errors and costs than pure LLM-based alternatives by reserving neural computation for expression tasks while delegating analysis, comparison, and ranking to deterministic code.

AINeutralarXiv – CS AI · May 276/10

🧠

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Researchers introduce AgingBench, a longitudinal reliability benchmark that evaluates how AI agents degrade over time in production environments rather than just at deployment. The study reveals that agent reliability decays through four distinct mechanisms—compression, interference, revision, and maintenance aging—and that fixes must target specific failure stages rather than assuming stronger base models solve the problem.

AINeutralarXiv – CS AI · May 116/10

🧠

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

SREGym is a new open-source benchmark platform that enables realistic evaluation of AI agents designed to diagnose and fix failures in production systems. The framework simulates high-fidelity failure scenarios across cloud-native stacks and currently includes 90 SRE problems, revealing significant performance variations among frontier AI models.

AINeutralarXiv – CS AI · Apr 146/10

🧠

VeriTrans: Fine-Tuned LLM-Assisted NL-to-PL Translation via a Deterministic Neuro-Symbolic Pipeline

VeriTrans is a machine learning system that converts natural language requirements into formal logic suitable for automated solvers, using a validator-gated pipeline to ensure reliability. Achieving 94.46% correctness on 2,100 specifications, the system combines fine-tuned language models with round-trip verification and deterministic execution, enabling auditable translation for critical applications.

$PL$NL$CNF