#alignment-tax News & Analysis

2 articles tagged with #alignment-tax. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer introduces a novel method for aligning large language models with safety requirements while minimizing degradation of general capabilities. By using localized on-policy distillation focused only on safety-critical tokens, the approach achieves strong safety performance with minimal data (100 harmful samples) and reduced computational costs compared to existing alignment methods.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

Researchers reveal that large language model user-memory capabilities exhibit substrate asymmetry across three orthogonal dimensions—behavioral consistency, factual recall, and factual abstinence—with parametric methods (gamma-LoRA) excelling at style preservation while retrieval-augmented generation (RAG) excels at knowing when to abstain. The same neural circuits drive opposite-direction failures, and this tradeoff intensifies in heavily RLHF-tuned models, suggesting fundamental alignment costs to parametric personalization.

🧠 Llama