AINeutralarXiv – CS AI · 6h ago7/10
🧠
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Researchers demonstrate that large reasoning models (LRMs) expose safety vulnerabilities in their intermediate reasoning traces that don't appear in final answers, creating a blind spot in current safety evaluation practices. Using adaptive multi-principle steering, they achieve up to 40.8% reduction in unsafe outputs while maintaining task accuracy, suggesting safety must be evaluated across the full reasoning-answer trajectory rather than just final responses.