y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#alignment-training News & Analysis

2 articles tagged with #alignment-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv – CS AI · Apr 137/10
🧠

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Researchers using weight pruning techniques discovered that large language models generate harmful content through a compact, unified set of internal weights that are distinct from benign capabilities. The findings reveal that aligned models compress harmful representations more than unaligned ones, explaining why safety guardrails remain brittle despite alignment training and why fine-tuning on narrow domains can trigger broad misalignment.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Alignment Makes Language Models Normative, Not Descriptive

Research comparing 120 base and aligned language model pairs reveals that alignment training makes models more normative but less descriptive of actual human behavior. Base models predict real human choices in multi-round strategic games 10 times better, while aligned models excel only in single-shot, textbook scenarios where human behavior follows rational expectations.