AINeutralarXiv – CS AI · 6h ago7/10
🧠
VFUSE: Virulent Feature Understanding with Sparse autoEncoders
Researchers introduce VFUSE, a mechanistic interpretability tool using sparse autoencoders to audit protein design models for hazardous features. The approach successfully identifies virulent design patterns in popular open-weight models like RoseTTAFold3 and RFDiffusion3, achieving up to 0.84 AUROC detection rates while maintaining model performance.