y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#cross-modal News & Analysis

20 articles tagged with #cross-modal. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

20 articles
AIBullisharXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Researchers have released DanQing, a large-scale Chinese vision-language dataset containing 100 million high-quality image-text pairs curated from Common Crawl data. The dataset addresses the bottleneck in Chinese VLP development and demonstrates superior performance compared to existing Chinese datasets across various AI tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 117/10
๐Ÿง 

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Researchers introduce MMGraphRAG, a new AI framework that addresses hallucination issues in large language models by integrating visual scene graphs with text knowledge graphs through cross-modal fusion. The system uses SpecLink for entity linking and demonstrates superior performance in multimodal information processing across multiple benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Researchers introduced SPARC, a framework that creates unified latent spaces across different AI models and modalities, enabling direct comparison of how various architectures represent identical concepts. The method achieves 0.80 Jaccard similarity on Open Images, tripling alignment compared to previous methods, and enables practical applications like text-guided spatial localization in vision-only models.

AIBullisharXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

Rethinking Cross-Modal Fine-Tuning: Optimizing the Interaction between Feature Alignment and Target Fitting

Researchers developed a theoretical framework to optimize cross-modal fine-tuning of pre-trained AI models, addressing the challenge of aligning new feature modalities with existing representation spaces. The approach introduces a novel concept of feature-label distortion and demonstrates improved performance over state-of-the-art methods across benchmark datasets.

AINeutralarXiv โ€“ CS AI ยท 1d ago6/10
๐Ÿง 

Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

Researchers demonstrate that MMA2A, a multimodal routing protocol for agent-to-agent networks, achieves 52% task accuracy versus 32% for text-only baselines by preserving native modalities (voice, image, text) across agent boundaries. The 20-percentage-point improvement requires both protocol-level native routing and capable downstream reasoning agents, establishing routing as a critical design variable in multi-agent systems.

$TCA
AIBullisharXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Researchers introduce RC2, a reinforcement learning framework that improves multimodal AI reasoning by enforcing consistency between visual and textual representations. The system uses cycle-consistent training to resolve internal conflicts between modalities, achieving up to 7.6 point improvements in reasoning accuracy without requiring additional labeled data.

AIBullisharXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

Researchers propose X-OPD, a Cross-Modal On-Policy Distillation framework to improve Speech Large Language Models by aligning them with text-based counterparts. The method uses token-level feedback from teacher models to bridge performance gaps in end-to-end speech systems while preserving inherent capabilities.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Feature-level Interaction Explanations in Multimodal Transformers

Researchers introduce FL-I2MoE, a new Mixture-of-Experts layer for multimodal Transformers that explicitly identifies synergistic and redundant cross-modal feature interactions. The method provides more interpretable explanations for how different data modalities contribute to AI decision-making compared to existing approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

ES-Merging: Biological MLLM Merging via Embedding Space Signals

Researchers propose ES-Merging, a new framework for combining specialized biological multimodal large language models (MLLMs) by using embedding space signals rather than traditional parameter-based methods. The approach estimates merging coefficients at both layer-wise and element-wise granularities, outperforming existing merging techniques and even task-specific fine-tuned models on cross-modal scientific problems.

AIBullisharXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

Latent Speech-Text Transformer

Facebook Research introduces the Latent Speech-Text Transformer (LST), which aggregates speech tokens into higher-level patches to improve computational efficiency and cross-modal alignment. The model achieves up to +6.5% absolute gain on speech HellaSwag benchmarks while maintaining text performance and reducing inference costs for ASR and TTS tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

Researchers introduce CoE, a training-free multimodal summarization framework that uses a Chain-of-Events approach with Hierarchical Event Graph to better understand and summarize content across videos, transcripts, and images. The system achieves significant performance improvements over existing methods, showing average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore across eight datasets.

AIBullisharXiv โ€“ CS AI ยท Mar 26/1015
๐Ÿง 

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Researchers have developed an 'Omnivorous Vision Encoder' that creates consistent feature representations across different visual modalities (RGB, depth, segmentation) of the same scene. The framework addresses the poor cross-modal alignment in existing vision encoders like DINOv2 by training with dual objectives to maximize feature alignment while preserving discriminative semantics.

AIBullisharXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Researchers introduced ViCLIP-OT, the first foundation vision-language model specifically designed for Vietnamese image-text retrieval. The model integrates CLIP-style contrastive learning with Similarity-Graph Regularized Optimal Transport (SIGROT) loss, achieving significant improvements over existing baselines with 67.34% average Recall@K on UIT-OpenViIC benchmark.

AINeutralarXiv โ€“ CS AI ยท Feb 275/107
๐Ÿง 

Same Words, Different Judgments: Modality Effects on Preference Alignment

Researchers conducted a cross-modal study comparing human preference annotations between text and audio formats for AI alignment. The study found that while audio preferences are as reliable as text, different modalities lead to different judgment patterns, with synthetic ratings showing promise as replacements for human annotations.

$NEAR
AINeutralarXiv โ€“ CS AI ยท Feb 276/107
๐Ÿง 

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Researchers introduce SPARTA, an automated framework for generating large-scale Table-Text question answering benchmarks that require complex multi-hop reasoning across structured and unstructured data. The benchmark exposes significant weaknesses in current AI models, with state-of-the-art systems experiencing over 30 F1 point performance drops compared to existing simpler datasets.

AIBullisharXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Researchers developed an unbiased sliced Wasserstein RBF kernel with rotary positional embedding to improve audio captioning systems by addressing exposure bias and temporal relationship issues. The method shows significant improvements in caption quality and text-to-audio retrieval accuracy on AudioCaps and Clotho datasets, while also enhancing audio reasoning capabilities in large language models.

AINeutralarXiv โ€“ CS AI ยท Mar 115/10
๐Ÿง 

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Researchers introduce Daily-Omni, a new benchmark for evaluating multimodal AI models' ability to process audio and video simultaneously. The study of 24 foundation models reveals that current AI systems struggle with cross-modal temporal alignment, highlighting a key limitation in multimodal reasoning.

AINeutralarXiv โ€“ CS AI ยท Mar 34/103
๐Ÿง 

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Researchers introduce Stepping Stone Plus (SSP), a novel framework that combines optical flow and textual prompts to improve audio-visual semantic segmentation. The method outperforms existing approaches by using motion dynamics for moving sound sources and textual descriptions for stationary objects, with a visual-textual alignment module for better cross-modal integration.