ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
Researchers introduce ALIGNBEAM, a training-free inference-time defense that transfers safety alignment between different language model families by translating logits across vocabularies. The method addresses a critical gap where existing safety defenses fail for cross-family model pairs, enabling safety constraints without modifying model weights or retraining.
ALIGNBEAM tackles a fundamental vulnerability in domain-specialized language models: fine-tuning for specific tasks inadvertently degrades safety guardrails, causing models to comply with harmful requests framed in domain-specific language. Existing defense mechanisms rely on logit mixing between a safe anchor model and a specialist, but this approach breaks down when models use different vocabularies—precisely where safety degradation is most severe across different model families.
The research builds on inference-time alignment techniques that have emerged as practical alternatives to expensive retraining. Rather than modifying model weights, ALIGNBEAM translates anchor model logits into the target vocabulary token-by-token during decoding, then uses a lightweight LLM judge to select the safest among multiple candidate outputs. This approach maintains computational efficiency while preserving model utility.
For AI safety researchers and practitioners deploying specialized models, this represents significant progress. The ability to transfer safety alignment across model families without touching weights eliminates deployment friction and allows safety-utility trade-offs to be tuned at runtime without retraining pipelines. Benchmarks demonstrate meaningful improvements in adversarial refusal rates while keeping task accuracy and inference overhead practical.
The work's broader implications extend to enterprise AI deployment where specialized models handle sensitive domains. Organizations can now confidently fine-tune models for specific tasks without sacrificing safety baseline guarantees, provided they implement ALIGNBEAM at inference. This decouples the optimization pressures between domain performance and safety—a critical consideration for regulated industries.
- →ALIGNBEAM enables safety alignment transfer across different model families using only inference-time modifications with no weight changes.
- →Cross-vocabulary logit translation solves the fundamental limitation preventing existing defenses from working on incompatible model pairs.
- →Safety-utility trade-offs can be adjusted at deployment without retraining, providing operational flexibility for specialized models.
- →Empirical results show substantial improvements in adversarial refusal rates while maintaining practical inference overhead and task accuracy.
- →The training-free approach makes safety hardening accessible for domain specialists without requiring expensive computational resources.