AINeutralarXiv โ CS AI ยท 10h ago6/10
๐ง
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.