AINeutralarXiv – CS AI · 10h ago6/10
🧠
Internalizing Safety Understanding in Large Reasoning Models via Verification
Researchers propose Safety Internal (SInternal), a framework that trains large reasoning models to verify the safety of their own outputs rather than relying on external compliance mechanisms. The approach demonstrates that models can internalize safety understanding through verification tasks, significantly improving robustness against adversarial jailbreaks and out-of-domain attacks.