TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment
Researchers introduce TEVI, a framework using sparse autoencoders to improve vision-language alignment in models like CLIP by selectively filtering image embeddings based on text captions. The method addresses a fundamental information imbalance where images contain more data than captions describe, demonstrating improved retrieval performance across multiple benchmarks.
TEVI addresses a critical inefficiency in vision-language models that power numerous AI applications. The core insight—that image embeddings contain substantially more information than their corresponding text descriptions—reveals why CLIP and similar models struggle with alignment. By using sparse autoencoders to decompose image embeddings into interpretable features, TEVI's masking module learns to retain only caption-relevant information, effectively filtering noise. This technical approach echoes recent trends in mechanistic interpretability research that emphasize understanding and steering neural network internals.
The research builds on growing recognition that embedding space quality fundamentally constrains downstream performance. Vision-language models underpin recommendation systems, semantic search, multimodal retrieval, and emerging applications in embodied AI. Poor alignment between modalities introduces systematic errors that compound across applications. TEVI's controlled experiments with synthetic captions demonstrate the mechanism works, while natural image experiments validate practical utility.
The performance improvements across diverse benchmarks—from short captions (MS COCO, Flickr) to long, detailed descriptions (IIW, DOCCI)—suggest the framework scales across caption complexity. Particularly significant are gains on richer captions, indicating TEVI captures nuanced relationships. Enhanced robustness on adversarial benchmarks (RoCOCO) implies the approach doesn't merely overfit to specific datasets.
For practitioners, this work suggests vision-language model performance has untapped headroom through better embedding alignment. The sparse autoencoder approach is architecture-agnostic, potentially applicable to newer models beyond CLIP. Future work likely focuses on scaling to larger models and exploring whether this filtering mechanism transfers across different vision-language architectures.
- →TEVI uses sparse autoencoders to filter image embeddings, retaining only caption-relevant features and improving vision-language alignment
- →Framework demonstrates consistent retrieval improvements across short-caption and long-caption benchmarks with stronger gains on richer descriptions
- →Method addresses fundamental information imbalance where images contain more data than their text descriptions capture
- →Enhanced robustness on adversarial benchmarks suggests the approach provides practical benefits beyond controlled laboratory settings
- →Technique is architecture-agnostic and potentially applicable to vision-language models beyond CLIP