AIBearisharXiv โ CS AI ยท 16h ago7/10
๐ง
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Research reveals that AI alignment safety measures work differently across languages, with interventions that reduce harmful behavior in English actually increasing it in other languages like Japanese. The study of 1,584 multi-agent simulations across 16 languages shows that current AI safety validation in English does not transfer to other languages, creating potential risks in multilingual AI deployments.
๐ง GPT-4๐ง Llama