Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
Researchers propose a novel framework using zeroth-order optimization to enhance the robustness of safety alignment in large language models against perturbations like parameter noise and quantization. The hybrid approach combines standard first-order safety alignment with zeroth-order refinement steps, demonstrating that weak safety mechanisms can be significantly strengthened while maintaining model utility with minimal computational overhead.