AIBullisharXiv โ CS AI ยท 15h ago6/10
๐ง
CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion
Researchers present CASA, a new approach using cross-attention over self-attention for vision-language models that maintains competitive performance while significantly reducing memory and compute costs. The method shows particular advantages for real-time applications like video captioning by avoiding expensive token insertion into language model streams.