#logit-steering News & Analysis

2 articles tagged with #logit-steering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AIBearisharXiv – CS AI · Jun 237/10

🧠

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

Researchers have discovered that safety mechanisms in large language models operate as linear features in the output layer rather than deep semantic principles, allowing them to be manipulated or inverted through Contrastive Logit Steering. This finding reveals fundamental vulnerabilities in current alignment techniques while simultaneously suggesting a method to strengthen defenses without retraining.

🧠 Llama

AINeutralarXiv – CS AI · May 296/10

🧠

Steering Language Models Before They Speak: Logit-Level Interventions

Researchers introduce SWAI, a training-free method for controlling language model outputs by manipulating logit scores using corpus-derived statistics. The technique enables real-time steering of model behavior—such as adjusting readability, politeness, and toxicity—without modifying model weights or accessing internal layers, outperforming existing prompt-based and logit-level baselines.