AIBullisharXiv – CS AI · 6h ago7/10
🧠
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
SafeSteer introduces a novel method for aligning large language models with safety requirements while minimizing degradation of general capabilities. By using localized on-policy distillation focused only on safety-critical tokens, the approach achieves strong safety performance with minimal data (100 harmful samples) and reduced computational costs compared to existing alignment methods.