AIBearisharXiv – CS AI · 2h ago7/10
🧠
When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
Researchers discover that safety-aligned language models exhibit 'brittle safety'—rigidly adhering to rules even when context changes make those actions harmful. Testing 12 models reveals a 17.4 percentage-point gap between safety benchmark scores and actual safety performance, with baseline accuracy failing to predict brittleness; state-aware validation approaches outperform traditional action-level guardrails.