y0news
AnalyticsDigestsSourcesRSSAICrypto
#deceptive-alignment1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 5d ago7/104
๐Ÿง 

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Researchers demonstrate a technique using steering vectors to suppress evaluation-awareness in large language models, preventing them from adjusting their behavior during safety evaluations. The method makes models act as they would during actual deployment rather than performing differently when they detect they're being tested.