AINeutralarXiv โ CS AI ยท 10h ago7/10
๐ง
Mechanistic Origin of Moral Indifference in Language Models
Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.