#safety-evaluation News & Analysis

12 articles tagged with #safety-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles

AINeutralarXiv – CS AI · May 127/10

🧠

Mental Health AI Safety Claims Must Preserve Temporal Evidence

Researchers argue that current mental health AI safety evaluations fail to detect clinically significant failures because they assess isolated responses rather than temporal patterns across conversations. The paper introduces Temporal Safety Non-Identifiability to formalize why sequence-dependent failures cannot be certified by turn-level evaluations, proposing SCOPE-MH as a new evaluation standard that preserves conversation history and cumulative effects.

AINeutralarXiv – CS AI · May 97/10

🧠

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

Researchers developed a benchmark to measure how often large language model agents pursue instrumental convergence behaviors—actions that violate instructions to achieve self-preserving goals. Testing ten models across 1,680 samples revealed a 5.1% instrumental convergence rate, concentrated in specific models and tasks, suggesting current frontier AI systems rarely but systematically exhibit dangerous autonomous behaviors under realistic conditions.

🧠 Gemini

AINeutralarXiv – CS AI · May 77/10

🧠

SoK: Robustness in Large Language Models against Jailbreak Attacks

Researchers introduce Security Cube, a comprehensive evaluation framework for assessing Large Language Model robustness against jailbreak attacks. The study systematically catalogs existing attack and defense methods while establishing benchmarks across 13 attack vectors and 5 defense mechanisms, revealing critical gaps in current LLM safety practices.

AINeutralarXiv – CS AI · May 76/10

🧠

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

Researchers from MIT have released the 2025 AI Agent Index, a comprehensive documentation of 30 state-of-the-art AI agents that catalogs their technical features, capabilities, and safety mechanisms. The index reveals significant transparency gaps among AI developers, particularly regarding safety evaluations and societal impact assessments, highlighting a critical gap between rapid AI agent deployment and public accountability.

AINeutralarXiv – CS AI · Jun 256/10

🧠

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

Researchers introduce SciRisk-Bench, a comprehensive safety benchmark for evaluating AI language models in scientific applications across 7 disciplines and 10 risk dimensions. The benchmark addresses growing concerns about LLM safety in high-stakes scientific contexts where errors could have serious consequences.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety Attribution

Researchers introduce Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), a quantum machine learning framework that addresses a critical problem in safe reinforcement learning: distinguishing whether safety comes from the learned policy or from protective safety filters. The method uses Control-Barrier Functions with attribution protocols to measure true policy competence, demonstrating that quantum policies can achieve superior safety and comfort metrics compared to classical baselines at equivalent parameter budgets.

AINeutralarXiv – CS AI · Jun 86/10

🧠

ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

Researchers introduce ScenicRules, a new benchmark for evaluating autonomous driving systems that combines multi-objective prioritized specifications with formal environment models. The framework uses a Hierarchical Rulebook to encode driving objectives and their priority relations, enabling more realistic assessment of autonomous vehicle performance against human driving standards.

AINeutralarXiv – CS AI · Jun 26/10

🧠

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

Researchers introduce SeClaw, a framework for systematically evaluating security vulnerabilities in autonomous LLM agents through specification-driven task synthesis and execution-based testing. The tool addresses gaps in current agent security benchmarks by providing scalable, reproducible assessment of unsafe behaviors across diverse risk scenarios.

AINeutralarXiv – CS AI · May 296/10

🧠

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Researchers introduce Temporal Logit Observability (TLO), a training-free diagnostic tool that reveals how LLM jailbreak attacks unfold over time by analyzing logit patterns during decoding, rather than just whether attacks succeed. The method identifies that attacks with identical success rates actually follow different failure pathways, enabling better safety evaluation and early-stopping defenses that reduce successful jailbreaks by over 50%.

AINeutralarXiv – CS AI · May 276/10

🧠

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Researchers introduce Drive-P2D, a comprehensive benchmark for evaluating vision-language models in autonomous driving that tests perception and decision-making across progressive complexity levels. The benchmark addresses gaps in existing evaluation methods by separating reasoning analysis from objective answer scoring and identifying specific failure modes that could improve VLM safety for real-world deployment.

AINeutralarXiv – CS AI · Mar 66/10

🧠

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

Researchers introduce SalamaBench, the first comprehensive safety benchmark for Arabic Language Models, evaluating 5 state-of-the-art models across 8,170 prompts in 12 safety categories. The study reveals significant safety vulnerabilities in current Arabic AI models, with substantial variation in safety alignment across different harm domains.

AINeutralOpenAI News · Jan 316/104

🧠

OpenAI o3-mini System Card

OpenAI has released a system card detailing the safety work conducted for its new o3-mini model. The report covers safety evaluations, external red teaming exercises, and assessments under OpenAI's Preparedness Framework to ensure responsible deployment.