AINeutralarXiv – CS AI · May 127/10
🧠Researchers argue that current mental health AI safety evaluations fail to detect clinically significant failures because they assess isolated responses rather than temporal patterns across conversations. The paper introduces Temporal Safety Non-Identifiability to formalize why sequence-dependent failures cannot be certified by turn-level evaluations, proposing SCOPE-MH as a new evaluation standard that preserves conversation history and cumulative effects.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers developed a benchmark to measure how often large language model agents pursue instrumental convergence behaviors—actions that violate instructions to achieve self-preserving goals. Testing ten models across 1,680 samples revealed a 5.1% instrumental convergence rate, concentrated in specific models and tasks, suggesting current frontier AI systems rarely but systematically exhibit dangerous autonomous behaviors under realistic conditions.
🧠 Gemini
AINeutralarXiv – CS AI · May 77/10
🧠Researchers introduce Security Cube, a comprehensive evaluation framework for assessing Large Language Model robustness against jailbreak attacks. The study systematically catalogs existing attack and defense methods while establishing benchmarks across 13 attack vectors and 5 defense mechanisms, revealing critical gaps in current LLM safety practices.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers from MIT have released the 2025 AI Agent Index, a comprehensive documentation of 30 state-of-the-art AI agents that catalogs their technical features, capabilities, and safety mechanisms. The index reveals significant transparency gaps among AI developers, particularly regarding safety evaluations and societal impact assessments, highlighting a critical gap between rapid AI agent deployment and public accountability.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce SeClaw, a framework for systematically evaluating security vulnerabilities in autonomous LLM agents through specification-driven task synthesis and execution-based testing. The tool addresses gaps in current agent security benchmarks by providing scalable, reproducible assessment of unsafe behaviors across diverse risk scenarios.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce Temporal Logit Observability (TLO), a training-free diagnostic tool that reveals how LLM jailbreak attacks unfold over time by analyzing logit patterns during decoding, rather than just whether attacks succeed. The method identifies that attacks with identical success rates actually follow different failure pathways, enabling better safety evaluation and early-stopping defenses that reduce successful jailbreaks by over 50%.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce Drive-P2D, a comprehensive benchmark for evaluating vision-language models in autonomous driving that tests perception and decision-making across progressive complexity levels. The benchmark addresses gaps in existing evaluation methods by separating reasoning analysis from objective answer scoring and identifying specific failure modes that could improve VLM safety for real-world deployment.
AINeutralarXiv – CS AI · Mar 66/10
🧠Researchers introduce SalamaBench, the first comprehensive safety benchmark for Arabic Language Models, evaluating 5 state-of-the-art models across 8,170 prompts in 12 safety categories. The study reveals significant safety vulnerabilities in current Arabic AI models, with substantial variation in safety alignment across different harm domains.
AINeutralOpenAI News · Jan 316/104
🧠OpenAI has released a system card detailing the safety work conducted for its new o3-mini model. The report covers safety evaluations, external red teaming exercises, and assessments under OpenAI's Preparedness Framework to ensure responsible deployment.