AIBullisharXiv โ CS AI ยท 17h ago6/10
๐ง
Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check
Researchers introduce Answer-Then-Check, a novel safety alignment approach for large language models that enables them to evaluate response safety before outputting to users. The method uses a new 80K-sample dataset called Reasoned Safety Alignment (ReSA) and demonstrates improved jailbreak defense while maintaining general reasoning capabilities.
๐ข Hugging Face