y0news
AnalyticsDigestsSourcesRSSAICrypto
#behavioral-alignment1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 10h ago7/10
๐Ÿง 

Mechanistic Origin of Moral Indifference in Language Models

Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.