Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Researchers discovered that large language models fail to refuse harmful requests in low-resource languages not because they lack the underlying safety representations, but because they cannot properly calibrate their safety decisions across languages. A recalibration approach using minimal target-language examples substantially improves refusal rates, suggesting safety alignment failures stem from decision calibration rather than representation gaps.
This research addresses a critical vulnerability in multilingual AI safety systems that has significant implications for deploying language models globally. The study reveals that current safety alignment methods break down for low-resource languages—languages spoken by billions of people—creating a security blind spot in AI deployment. The finding that safety representations transfer well but decision calibration does not is counterintuitive and fundamentally changes how the field should approach the problem.
The research emerges from growing recognition that AI safety frameworks optimized for English often fail catastrophically when applied to other languages. Previous work attributed these failures to representation gaps, assuming models simply didn't learn the right internal features for harmful content in low-resource languages. This study contradicts that assumption by demonstrating that the harmfulness direction extracted from English activations separates harmful from harmless prompts effectively even in Swahili and Burmese. The problem lies downstream: models possess the knowledge but fail to act on it decisively.
The practical implications are substantial. The proposed recalibration method requires only 1-4 examples per class in target languages, making it scalable for the hundreds of languages with limited training data. This efficiency matters because full retraining of large models remains computationally prohibitive for most organizations. The approach raises refusal selectivity from 33.6 to 54.5 while maintaining utility on benchmark tasks, suggesting a genuine solution rather than a theoretical exercise.
For organizations deploying multilingual AI systems, this work indicates that safety cannot be treated as a one-time English-language problem. The field must develop language-aware calibration methods as standard practice. This research provides a template for that work, though scaling beyond the 23 languages tested and ensuring robustness across diverse harm categories remains open.
- →Safety failures in low-resource languages stem from miscalibrated decision thresholds, not missing representations of harmfulness
- →The harmfulness direction extracted from high-resource language activations transfers effectively to low-resource languages with 87% accuracy in separating harmful content
- →A recalibration approach using as few as 1-4 target-language examples per class substantially improves refusal rates from 43.9% to higher selectivity levels
- →Existing adaptive steering methods like AdaSteer and CAST inherit cross-lingual calibration failures and require modification
- →This approach preserves model utility on benchmark tasks while improving safety, avoiding the typical accuracy-safety tradeoff