AINeutralarXiv – CS AI · 8h ago7/10
🧠
Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
Researchers demonstrate that safety-aligned large language models remain vulnerable to token injections at any point during generation, not just early in the output sequence. By training models directly on generation trajectories with mid-sequence perturbations, they achieve improved robustness that generalizes across different attack vectors, revealing that robust AI safety requires alignment of the entire generation process rather than just output supervision.