#multimodal-learning News & Analysis

32 articles tagged with #multimodal-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

32 articles

AIBullisharXiv – CS AI · 3d ago7/10

🧠

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

Researchers introduced FLUID, a production-scale recommendation system that eliminates reliance on item IDs for livestreaming platforms by using multimodal semantic codes instead. Deployed across platforms with over one billion users, the system achieves significant performance gains including 2.05% improvement in cold-start room views, addressing a fundamental challenge in recommending short-lived broadcast content.

AIBullisharXiv – CS AI · May 127/10

🧠

Event Fields: Learning Latent Event Structure for Waveform Foundation Models

Researchers introduce a novel waveform foundation model that represents physiological signals as latent event processes rather than sequential tokens, using self-supervised learning to capture clinically meaningful structure. The approach demonstrates improved performance on medical benchmarks including arrhythmia classification and hemodynamic prediction, suggesting event-centric representations may be more suitable for healthcare AI than traditional sequence-based methods.

AIBullisharXiv – CS AI · May 117/10

🧠

Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness

Researchers introduce Pan-FM, a foundation model trained on multimodal medical imaging from seven organs that addresses the critical problem of missing data in real-world biomedical datasets. The model uses Saliency-Guided Masking to prevent bias toward dominant organs and demonstrates superior performance on disease prediction tasks across the UK Biobank.

AIBullisharXiv – CS AI · Apr 207/10

🧠

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

Researchers introduce AcuLa, a post-training framework that aligns audio encoders with medical language models to enhance clinical understanding of auscultation sounds. The method leverages LLMs to generate synthetic clinical reports from audio metadata and achieves significant performance improvements across 18 cardio-respiratory tasks, including boosting COVID-19 cough detection from 55% to 89% accuracy.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning

Researchers propose Schema-Adaptive Tabular Representation Learning, which uses LLMs to convert structured clinical data into semantic embeddings that transfer across different electronic health record schemas without retraining. When combined with imaging data for dementia diagnosis, the method achieves state-of-the-art results and outperforms board-certified neurologists on retrospective diagnostic tasks.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

Researchers developed a new disentangled multi-modal framework that combines histopathology and transcriptome data for improved cancer diagnosis and prognosis. The framework addresses key challenges in medical AI including multi-modal data heterogeneity and dependency on paired datasets through innovative fusion techniques and knowledge distillation strategies.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

Researchers have developed a new deepfake detection framework called T-AVFD that addresses a critical gap in audio-visual forgery detection by handling singing scenarios, where traditional cross-modal inconsistency methods fail. The study introduces the SHDF dataset and demonstrates improved detection performance across both talking and singing deepfakes through text-guided pattern learning.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

Researchers propose a Conflict-aware Penalty and Statistical Loss framework to address gradient norm conflicts in multimodal sentiment analysis, where dominant text modalities suppress weaker acoustic and visual streams. The approach achieves state-of-the-art results on CMU-MOSI benchmarks by balancing modality contributions and stabilizing training dynamics.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

Researchers propose a utility-aware multimodal contrastive learning framework that optimizes AI-generated product images for consumer demand rather than just semantic accuracy. The method, tested on Amazon and Airbnb data, outperforms existing generative AI models by shifting the learned image-text representation space toward demand-driven visual cues while maintaining image quality and text alignment.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

A comprehensive survey examines how Mixture-of-Experts (MoE) architectures address multimodal learning challenges by enabling scalable modeling, enriching representation learning across modalities, and adapting to imperfect data scenarios. The research identifies critical gaps in interpretable routing, expert communication, and lifelong multimodal learning, positioning MoE as a foundational framework for building more efficient and flexible AI systems.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

Researchers propose DACLR, a dynamic contrastive learning method that improves evidence retrieval for multimodal fact-checking by converting diverse media types to text and extracting event-level features. The approach uses a two-stage recall-rerank system with adaptive loss functions to better match claims with relevant evidence rather than merely semantically similar content.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

FLORO is a multimodal geospatial foundation model that learns from diverse remote sensing data across multiple sensor types and resolutions with minimal pretraining data. Despite using significantly smaller datasets than competing models, FLORO demonstrates strong transfer learning performance on ecological and environmental applications, achieving competitive results on scene classification, segmentation, and regression tasks.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Researchers introduce SAME, a new approach for training Multimodal Large Language Models that can continuously learn new tasks without forgetting previous capabilities. The method addresses fundamental problems in continual learning by stabilizing how AI systems route tasks to specialized expert networks and preventing knowledge degradation over time.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

Researchers propose a case-aware medical image classification framework that leverages multimodal knowledge graphs to retrieve similar historical cases and integrate external clinical knowledge, improving diagnostic accuracy through interpretable evidence-based reasoning rather than relying solely on isolated visual analysis.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

Researchers introduce FAST-GOAL, a fine-tuning method that improves CLIP's ability to process lengthy text descriptions through global-local semantic alignment. The approach combines object detection with token-level similarity learning and introduces GLIT100k, a new dataset linking long captions to localized image-text pairs, demonstrating significant performance gains across multiple benchmarks.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?

Researchers developed a gated multimodal AI framework that combines electronic health record data with chest X-ray analysis to predict respiratory failure in ICU patients within 24 hours. The model achieved significantly higher accuracy (AUROC 0.860) than EHR-only baselines and physician predictions, demonstrating that adaptive fusion of imaging and structured clinical data improves critical care decision-making.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

Researchers introduce CmIVTP, a cross-modal AI framework that combines AIS and CCTV data to improve maritime vessel trajectory prediction. The system uses transformer-based architecture with attention mechanisms to model vessel-environment interactions, addressing limitations of single-source data in maritime navigation systems.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Researchers introduce TowerMind, a lightweight tower defense game environment designed to evaluate Large Language Models as autonomous agents. The benchmark tests LLMs' capabilities in strategic planning and real-time decision-making while revealing significant performance gaps compared to human experts and highlighting key limitations in model reasoning.

AIBullisharXiv – CS AI · May 126/10

🧠

PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis

Researchers introduce PromptDx, a novel AI framework that combines differentiable prompt tuning with multimodal learning to diagnose Alzheimer's Disease using MRI and biomarker data. The method achieves competitive performance using only 1% of context samples compared to 30% in standard approaches, demonstrating significant data efficiency gains for medical imaging applications.

AINeutralarXiv – CS AI · May 126/10

🧠

Compressed Video Aggregator: Content-driven Module for Efficient Micro-Video Recommendation

Researchers propose Compressed Video Aggregator (CVA), a lightweight module that improves micro-video recommendation systems by decoupling video processing from preference learning. The method reduces training time and GPU memory by orders of magnitude while maintaining or improving performance through intelligent frame selection based on video titles.

AINeutralarXiv – CS AI · May 96/10

🧠

Debiased Multimodal Personality Understanding through Dual Causal Intervention

Researchers introduce a Dual Causal Adjustment Network (DCAN) to improve fairness in multimodal AI systems that assess personality traits from video data. The method addresses demographic and latent biases that cause unfair predictions across different population groups, achieving 92%+ accuracy while significantly improving fairness metrics.

AINeutralarXiv – CS AI · May 96/10

🧠

HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

Researchers introduce Hard Negative Captions (HNC), an automatically generated dataset designed to improve vision-language models' ability to understand fine-grained mismatches between images and text. The work addresses a fundamental limitation in current image-text matching approaches, where weakly paired web data fails to teach models detailed cross-modal comprehension, demonstrating improved performance on diagnostic tasks and robustness under noisy conditions.

AINeutralarXiv – CS AI · Apr 206/10

🧠

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

GIST is a multimodal AI system that converts mobile point cloud data into semantically-annotated navigation maps for complex indoor environments. The technology combines vision-language models with spatial reasoning to enable embodied AI systems to navigate cluttered spaces like retail stores and hospitals, with applications in semantic search, localization, and natural language instruction generation.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning

Researchers propose FedTSP, a federated learning method that uses pre-trained language models to generate semantically-enriched prototypes for improving model performance across heterogeneous data. The approach leverages textual descriptions of classes to preserve semantic relationships while mitigating data heterogeneity challenges in federated settings.

AINeutralarXiv – CS AI · Apr 156/10

🧠

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Researchers introduce MODIX, a training-free framework that dynamically optimizes how Vision-Language Models allocate attention across multimodal inputs by adjusting positional encoding based on information density rather than uniform token assignment. The approach improves reasoning performance without modifying model parameters, suggesting positional encoding should be treated as an adaptive resource in multimodal transformer architectures.

Page 1 of 2Next →