Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
Researchers demonstrate that safety-aligned large language models remain vulnerable to token injections at any point during generation, not just early in the output sequence. By training models directly on generation trajectories with mid-sequence perturbations, they achieve improved robustness that generalizes across different attack vectors, revealing that robust AI safety requires alignment of the entire generation process rather than just output supervision.
This research addresses a critical gap in AI safety that extends beyond previously identified vulnerabilities. While prior work identified 'shallow safety'—where alignment concentrates in initial tokens—this study reveals a more systemic problem: models can be redirected toward harmful outputs through injections at any generation step. This finding matters because it demonstrates that current safety training methods operate under a flawed assumption about how alignment works in neural networks.
The research builds on growing concerns about inference-time attacks on aligned language models. As LLMs become more prevalent in sensitive applications, the ability to bypass safety measures through subtle prompt injections poses significant risks. The discovery that internal alignment—measured by a model's representation of refusal directions—doesn't predict robustness to perturbation is particularly revealing, suggesting safety is encoded dynamically through generation rather than as static internal representations.
The practical implications are substantial for developers and deployers of LLMs. Current safety evaluation benchmarks may give false confidence in model robustness since they typically assess final outputs rather than the generation process under adversarial conditions. The proposed solution—training on generation trajectories simulating mid-sequence perturbations—introduces additional computational overhead during training but appears necessary for genuine safety.
Looking forward, this work suggests safety research must shift from output-focused approaches toward process-based alignment. The generalization to early-token attacks indicates the method addresses multiple vulnerability classes simultaneously. As LLMs integrate into high-stakes domains, whether this trajectory-based training becomes standard practice will determine whether future generations achieve meaningful safety improvements or merely cosmetic gains.
- →Safety vulnerabilities in LLMs extend throughout generation, not just early token positions, requiring comprehensive defense mechanisms
- →Internal alignment representations don't predict robustness to perturbation, indicating safety operates dynamically during generation
- →Training on generation trajectories with simulated mid-sequence attacks improves both mid-sequence and early-token attack resistance
- →Current safety evaluation methods may overestimate model robustness by only assessing final outputs rather than generation processes
- →Robust alignment requires computational investment in process-level training rather than relying on output supervision alone