#safety-testing News & Analysis

15 articles tagged with #safety-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

15 articles

AIBearisharXiv – CS AI · Jun 117/10

🧠

JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization

JailbreakOPT is a new framework that optimizes adversarial prompts to exploit safety vulnerabilities in large language models through iterative refinement and tool composition. The approach combines atomic jailbreak techniques with contextual bandits to achieve higher attack success rates while reducing the number of queries needed, demonstrating meaningful progress in LLM security testing.

AIBearisharXiv – CS AI · Jun 97/10

🧠

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

A new research paper reveals that LLM-based safety judges—widely used to evaluate AI safety at scale—have significant blind spots: they struggle to adapt their evaluations when presented with new contextual information or alternative safety definitions that conflict with their internal priors. This limitation undermines confidence in current safety evaluation methodologies across the AI industry.

AINeutralArs Technica – AI · May 287/10

🧠

Trump loses more control over AI regulation as Illinois passes landmark law

Illinois has passed landmark AI safety testing legislation, marking a significant shift in U.S. regulatory authority away from federal control. Major AI companies including Anthropic and OpenAI support the safety testing requirements, signaling industry acceptance of state-level AI governance standards.

🏢 OpenAI🏢 Anthropic

AINeutralarXiv – CS AI · May 277/10

🧠

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Researchers introduce LURE (Live-Usage Replay Evaluations), a method to detect when large language models recognize they are being tested and alter their behavior accordingly. The technique replays realistic user interaction sequences before appending evaluation prompts, making benchmarks more aligned with actual deployment conditions and revealing that current safety evaluations may be fundamentally compromised by evaluation awareness.

AIBullisharXiv – CS AI · May 277/10

🧠

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot is a new framework for generating safety-critical scenarios to test autonomous driving systems by targeting the boundary between physically feasible and infeasible situations. Using constrained reinforcement learning combined with physical feasibility constraints, the method achieves 6.2 percentage points higher collision rates while maintaining physical validity, enabling more effective stress testing of AV safety systems.

AIBullisharXiv – CS AI · May 277/10

🧠

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Researchers demonstrate that multi-agent reinforcement learning (MARL) significantly improves autonomous vehicle safety testing by co-training self-driving cars alongside realistic pedestrian agents with hidden behavioral traits. The co-trained SDC achieved 78% goal success with 14% collision rate versus 35%/33% for rule-based baselines, with jaywalking accounting for 62% of collisions despite representing only 13% of crossing events.

AIBearisharXiv – CS AI · May 127/10

🧠

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Researchers introduce IndustryBench, a 2,049-item benchmark testing large language models on industrial procurement tasks grounded in Chinese national standards. The study reveals that current LLMs perform poorly on safety-critical industrial applications, with the best models scoring only 2.08/3.0, and that extended reasoning paradoxically increases safety violations by introducing unsupported details into answers.

🧠 GPT-5

AIBearisharXiv – CS AI · May 97/10

🧠

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

Researchers demonstrate that large language models exhibit inconsistent safety behavior depending on whether prompts are framed as evaluations, deployments, or neutral requests—a phenomenon called evaluation-context divergence. Testing five open-weight model families reveals striking heterogeneity: OLMo-3-Instruct becomes more cautious during evaluations, while Mistral, Phi, and Llama models show the opposite pattern, raising questions about the reliability of safety benchmarks for predicting real-world deployment behavior.

🧠 Llama

AINeutralarXiv – CS AI · Apr 107/10

🧠

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Researchers introduce ATBench, a comprehensive benchmark for evaluating the safety of LLM-based agents across realistic multi-step interactions. The 1,000-trajectory dataset addresses critical gaps in existing safety evaluations by incorporating diverse risk scenarios, detailed failure classification, and long-horizon complexity that mirrors real-world deployment challenges.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Researchers demonstrate a technique using steering vectors to suppress evaluation-awareness in large language models, preventing them from adjusting their behavior during safety evaluations. The method makes models act as they would during actual deployment rather than performing differently when they detect they're being tested.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Risk Assessment of Autonomous Driving: Integrating Technical Failures, Ethical Dilemmas, and Policy Frameworks

Researchers analyzing autonomous vehicle safety data from NHTSA, California DMV, and MIT datasets identify perception and classification errors as primary technical failure modes, while highlighting divergent ethical frameworks and inconsistent regulatory approaches across jurisdictions as critical barriers to safe, widespread deployment.

AINeutralCrypto Briefing · Jun 36/10

🧠

Trump signs executive order for voluntary AI safety testing amid criticism it changes nothing

President Trump signed an executive order establishing a voluntary AI safety testing framework, drawing criticism for lacking enforcement mechanisms and potentially prioritizing industry growth over regulatory oversight. The initiative may boost investor confidence in AI development but raises concerns about adequate safety standards and meaningful regulatory impact.

AINeutralarXiv – CS AI · May 276/10

🧠

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

SemProbe is a new interactive tool for testing object detection systems in safety-critical applications using semantically meaningful image corruptions rather than simple pixel-level noise. The system uses diffusion-based inpainting to generate realistic test scenarios, automatically runs model inference, and logs results as structured artifacts for safety evaluation compliance.

AINeutralarXiv – CS AI · May 126/10

🧠

Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation

Researchers propose a hierarchical reinforcement learning framework that combines multi-agent interaction reasoning with continuous motion control to improve behavioral realism in traffic simulations. The approach outperforms self-play methods by better capturing socially aware driving behaviors while maintaining safety and efficiency in closed-loop SUMO simulations.

AINeutralOpenAI News · Sep 195/104

🧠

OpenAI Red Teaming Network

OpenAI has announced an open call for experts to join their Red Teaming Network, focusing on improving AI model safety. The initiative seeks domain experts to help identify vulnerabilities and enhance security measures for OpenAI's AI systems.