AINeutralarXiv โ CS AI ยท 5d ago7/104
๐ง
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Researchers demonstrate a technique using steering vectors to suppress evaluation-awareness in large language models, preventing them from adjusting their behavior during safety evaluations. The method makes models act as they would during actual deployment rather than performing differently when they detect they're being tested.