AINeutralarXiv โ CS AI ยท 5h ago1
๐ง
Understanding and Mitigating Dataset Corruption in LLM Steering
Research reveals that contrastive steering, a method for adjusting LLM behavior during inference, is moderately robust to data corruption but vulnerable to malicious attacks when significant portions of training data are compromised. The study identifies geometric patterns in corruption types and proposes using robust mean estimators as a safeguard against unwanted effects.