#cross-modal-learning News & Analysis

4 articles tagged with #cross-modal-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AINeutralarXiv – CS AI · 6h ago6/10

🧠

T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval

Researchers introduce PFCVR, a new AI model for text-to-image vehicle retrieval that identifies vehicles based on witness descriptions rather than photos alone. The team also releases T2I-VeRW, a large-scale dataset with 14,668 annotated vehicle images, achieving significant performance improvements over existing methods.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

Researchers introduce VisPrompt, a framework that improves prompt learning for vision-language models by injecting visual semantic information to enhance robustness against label noise. The approach keeps pre-trained models frozen while adding minimal trainable parameters, demonstrating superior performance across seven benchmark datasets under both synthetic and real-world noisy conditions.

AINeutralarXiv – CS AI · Apr 145/10

🧠

Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions

Researchers propose a novel reinforcement learning approach for fine-tuning multimodal conversational agents by learning a compact latent action space instead of operating directly on large text token spaces. The method combines paired image-text data with unpaired text-only data through a cross-modal projector trained with cycle consistency loss, demonstrating superior performance across multiple RL algorithms and conversation tasks.

AIBullishApple Machine Learning · Mar 35/102

🧠

EMBridge: Enhancing Gesture Generalization from EMG Signals through Cross-Modal Representation Learning

EMBridge is a new AI framework that enhances gesture recognition from EMG biosignals by aligning them with high-quality structured data from videos and images. The technology enables zero-shot gesture generalization on low-power wearable devices, potentially advancing human-computer interaction applications.