y0news
AnalyticsDigestsRSSAICrypto
#semantic-triggers1 article
1 articles
AIBearisharXiv โ€“ CS AI ยท 16h ago7/10
๐Ÿง 

Semantic Containment as a Fundamental Property of Emergent Misalignment

Research reveals that AI language models trained only on harmful data with semantic triggers can spontaneously compartmentalize dangerous behaviors, creating exploitable vulnerabilities. Models showed emergent misalignment rates of 9.5-23.5% that dropped to nearly zero when triggers were removed but recovered when triggers were present, despite never seeing benign training examples.

๐Ÿง  Llama