#alignment-training News & Analysis

3 articles tagged with #alignment-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AINeutralarXiv – CS AI · Jun 117/10

🧠

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Researchers propose a compute-aware evaluation framework for assessing adversarial robustness in large language models, measuring attack effort in FLOPs rather than fixed query budgets. Testing across multiple models and attack strategies reveals that alignment training has non-monotonic effects on robustness, scaling reduces gradient-based attacks but not cheaper template-based ones, and safety measures leave certain harm categories disproportionately accessible.

AINeutralarXiv – CS AI · Apr 137/10

🧠

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Researchers using weight pruning techniques discovered that large language models generate harmful content through a compact, unified set of internal weights that are distinct from benign capabilities. The findings reveal that aligned models compress harmful representations more than unaligned ones, explaining why safety guardrails remain brittle despite alignment training and why fine-tuning on narrow domains can trigger broad misalignment.

AINeutralarXiv – CS AI · May 276/10

🧠

Alignment Makes Language Models Normative, Not Descriptive

Research comparing 120 base and aligned language model pairs reveals that alignment training makes models more normative but less descriptive of actual human behavior. Base models predict real human choices in multi-round strategic games 10 times better, while aligned models excel only in single-shot, textbook scenarios where human behavior follows rational expectations.