LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety
Researchers have identified a critical vulnerability in large language models where safety guardrails fail across low-resource languages despite strong performance in high-resource ones. The team proposes LASA (Language-Agnostic Semantic Alignment), a new method that anchors safety protocols at the semantic bottleneck layer, dramatically reducing attack success rates from 24.7% to 2.8% on tested models.
The research addresses a fundamental asymmetry in LLM safety: while models perform robustly against adversarial attacks in English and other well-resourced languages, they become vulnerable when queried in languages with limited training data. This gap reveals that current safety alignment techniques are surface-level, optimizing for linguistic patterns rather than underlying semantic understanding. The discovery of the semantic bottleneck—an intermediate layer where representations are governed by shared meaning across languages rather than language-specific features—provides a mechanistic explanation for this vulnerability. By targeting safety alignment at this semantic layer rather than at the language surface, LASA achieves substantial improvements across multiple model families and scales. The implications extend beyond academic interest: as LLMs become globally deployed, multilingual safety becomes a critical infrastructure concern. Organizations relying on these models for high-stakes applications face exposure to attacks through low-resource language inputs. The approach suggests that more sophisticated safety engineering must move beyond pattern matching in training data to address the underlying semantic representations that drive model behavior. This work contributes to the growing field of mechanistic interpretability applied to AI safety, demonstrating that understanding model internals can yield practical improvements in robustness.
- →LASA reduces attack success rates from 24.7% to 2.8% by anchoring safety alignment at semantic bottleneck layers rather than language surfaces
- →Current LLM safety mechanisms fail systematically in low-resource languages due to training data imbalance, creating exploitable vulnerabilities
- →The semantic bottleneck represents the layer where language-agnostic meaning dominates over language-specific features in model representations
- →Multilingual safety alignment requires mechanistic understanding of model internals, not just better training data or conventional fine-tuning
- →Global LLM deployment faces significant security risks until safety protocols address language-agnostic semantic spaces consistently