AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers demonstrate that subliminal learning—where AI models inherit unrelated traits from teacher models—occurs through steering vectors embedded in activations rather than semantic content. The findings reveal that students learn aligned vectors during fine-tuning on steered teacher outputs, explaining why this transfer fails across different model architectures and highlighting the critical role of adaptive optimizers in this process.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduce FineSteer, a novel framework for controlling Large Language Model behavior at inference time through two-stage steering: conditional guidance and expert-based vector synthesis. The method achieves superior safety and truthfulness performance while preserving model utility more effectively than existing approaches, without requiring parameter updates.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers propose Sequential Adaptive Steering (SAS), a new framework for controlling Large Language Model personalities at inference time without retraining. The method uses orthogonalized steering vectors to enable precise, multi-dimensional personality control by adjusting coefficients, validated on Big Five personality traits.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers introduce SVDecode, a new method for adapting large language models to specific tasks without extensive fine-tuning. The technique uses steering vectors during decoding to align output distributions with task requirements, improving accuracy by up to 5 percentage points while adding minimal computational overhead.
AINeutralarXiv – CS AI · Mar 37/104
🧠Researchers demonstrate a technique using steering vectors to suppress evaluation-awareness in large language models, preventing them from adjusting their behavior during safety evaluations. The method makes models act as they would during actual deployment rather than performing differently when they detect they're being tested.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce FBHM, a systematically curated benchmark for evaluating vision-language models on hateful meme detection across 25 rhetorical functionalities and 10 target communities. The study reveals that state-of-the-art VLMs exhibit severe generalization failures, dropping from high accuracy on standard datasets to near-random performance on FBHM, indicating they rely on dataset-specific shortcuts rather than robust multimodal reasoning. The proposed LSV (learnable steering vectors) method achieves ~30 Macro-F1 point improvements using minimal training data without degrading source-domain performance.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers developed a method to control AI safety refusal behavior using categorical refusal tokens in Llama 3 8B, enabling fine-grained control over when models refuse harmful versus benign requests. The technique uses steering vectors that can be applied during inference without additional training, improving both safety and reducing over-refusal of harmless prompts.
🧠 Llama
AINeutralarXiv – CS AI · Mar 166/10
🧠Researchers propose Global Evolutionary Refined Steering (GER-steer), a new training-free framework for controlling Large Language Models without fine-tuning costs. The method addresses issues with existing activation engineering approaches by using geometric stability to improve steering vector accuracy and reduce noise.