AINeutralarXiv – CS AI · 14h ago6/10
🧠
Steering Language Models Before They Speak: Logit-Level Interventions
Researchers introduce SWAI, a training-free method for controlling language model outputs by manipulating logit scores using corpus-derived statistics. The technique enables real-time steering of model behavior—such as adjusting readability, politeness, and toxicity—without modifying model weights or accessing internal layers, outperforming existing prompt-based and logit-level baselines.