y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-hardening News & Analysis

2 articles tagged with #model-hardening. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv – CS AI · May 297/10
🧠

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Researchers propose a novel framework using zeroth-order optimization to enhance the robustness of safety alignment in large language models against perturbations like parameter noise and quantization. The hybrid approach combines standard first-order safety alignment with zeroth-order refinement steps, demonstrating that weak safety mechanisms can be significantly strengthened while maintaining model utility with minimal computational overhead.

AIBearisharXiv – CS AI · May 117/10
🧠

A Systematic Investigation of The RL-Jailbreaker in LLMs

Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.