y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#multimodal-learning News & Analysis

38 articles tagged with #multimodal-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

38 articles
AIBullisharXiv – CS AI · 3d ago7/10
🧠

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

Researchers introduced FLUID, a production-scale recommendation system that eliminates reliance on item IDs for livestreaming platforms by using multimodal semantic codes instead. Deployed across platforms with over one billion users, the system achieves significant performance gains including 2.05% improvement in cold-start room views, addressing a fundamental challenge in recommending short-lived broadcast content.

AIBullisharXiv – CS AI · May 127/10
🧠

Event Fields: Learning Latent Event Structure for Waveform Foundation Models

Researchers introduce a novel waveform foundation model that represents physiological signals as latent event processes rather than sequential tokens, using self-supervised learning to capture clinically meaningful structure. The approach demonstrates improved performance on medical benchmarks including arrhythmia classification and hemodynamic prediction, suggesting event-centric representations may be more suitable for healthcare AI than traditional sequence-based methods.

AIBullisharXiv – CS AI · May 117/10
🧠

Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness

Researchers introduce Pan-FM, a foundation model trained on multimodal medical imaging from seven organs that addresses the critical problem of missing data in real-world biomedical datasets. The model uses Saliency-Guided Masking to prevent bias toward dominant organs and demonstrates superior performance on disease prediction tasks across the UK Biobank.

AIBullisharXiv – CS AI · Apr 207/10
🧠

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

Researchers introduce AcuLa, a post-training framework that aligns audio encoders with medical language models to enhance clinical understanding of auscultation sounds. The method leverages LLMs to generate synthetic clinical reports from audio metadata and achieves significant performance improvements across 18 cardio-respiratory tasks, including boosting COVID-19 cough detection from 55% to 89% accuracy.

AIBullisharXiv – CS AI · Apr 157/10
🧠

Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning

Researchers propose Schema-Adaptive Tabular Representation Learning, which uses LLMs to convert structured clinical data into semantic embeddings that transfer across different electronic health record schemas without retraining. When combined with imaging data for dementia diagnosis, the method achieves state-of-the-art results and outperforms board-certified neurologists on retrospective diagnostic tasks.

AIBullisharXiv – CS AI · Mar 37/104
🧠

Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

Researchers developed a new disentangled multi-modal framework that combines histopathology and transcriptome data for improved cancer diagnosis and prognosis. The framework addresses key challenges in medical AI including multi-modal data heterogeneity and dependency on paired datasets through innovative fusion techniques and knowledge distillation strategies.

AINeutralarXiv – CS AI · 2d ago5/10
🧠

Balancing Multimodal Learning through Label Space Reshaping

Researchers propose Balanced Multimodal Label Reshaping (BMLR), a novel machine learning approach that addresses modality imbalance in multimodal systems by reshaping label spaces rather than adjusting optimization gradients. The method equalizes mapping difficulty across different data modalities, enabling more balanced learning and improved overall performance across various neural network architectures.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Researchers introduce VLA-Trace, a diagnostic framework for analyzing Vision-Language-Action models that reveals how these AI systems transform multimodal inputs into physical control actions. The study identifies that popular VLA models like π₀.₅ and OpenVLA exhibit distinct adaptation patterns, rely on different routing strategies during decision-making, but struggle with fine-grained semantic understanding despite excelling at visual grounding.

AIBullisharXiv – CS AI · 2d ago6/10
🧠

TRACER: Persistent Regularization for Robust Multimodal Finetuning

Researchers introduce TRACER, a novel finetuning method for multimodal AI models that addresses catastrophic forgetting and out-of-distribution robustness degradation. By replacing standard Exponential Moving Average teachers with Weighted Moving Average teachers and combining contrastive learning with multi-perspective distillation, the approach demonstrates consistent performance gains across CLIP backbone architectures without hyperparameter sensitivity.

AIBullisharXiv – CS AI · 2d ago6/10
🧠

Genetically Aligned Patient Representations Improve Hematological Diagnosis

Researchers developed a framework that aligns single-cell white blood cell images with genetic data (karyotypes and mutations) to improve hematological cancer diagnosis. Using a two-stage training approach combining self-supervised vision learning and supervised contrastive alignment, the model outperforms existing histopathology foundation models and enables disease retrieval based on genetic alterations.

AIBullisharXiv – CS AI · 2d ago6/10
🧠

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

Researchers develop a federated domain generalization framework to improve respiratory sound classification across different stethoscope devices, addressing inter-device variability that hinders multi-site AI deployment in pulmonary disease detection. The approach combines causality-inspired interventions with multimodal learning to outperform existing baselines without requiring access to unseen devices during training.

AIBullisharXiv – CS AI · 2d ago6/10
🧠

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

Researchers developed a multimodal AI framework that combines cardiac MRI imaging, clinical metrics, and medical text records to improve heart failure prognosis prediction and treatment planning. The integrated approach demonstrates superior accuracy compared to single-data-source algorithms, addressing a critical gap in managing this leading cause of global mortality.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

Researchers propose a Conflict-aware Penalty and Statistical Loss framework to address gradient norm conflicts in multimodal sentiment analysis, where dominant text modalities suppress weaker acoustic and visual streams. The approach achieves state-of-the-art results on CMU-MOSI benchmarks by balancing modality contributions and stabilizing training dynamics.

AIBullisharXiv – CS AI · 3d ago6/10
🧠

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

Researchers propose a utility-aware multimodal contrastive learning framework that optimizes AI-generated product images for consumer demand rather than just semantic accuracy. The method, tested on Amazon and Airbnb data, outperforms existing generative AI models by shifting the learned image-text representation space toward demand-driven visual cues while maintaining image quality and text alignment.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

A comprehensive survey examines how Mixture-of-Experts (MoE) architectures address multimodal learning challenges by enabling scalable modeling, enriching representation learning across modalities, and adapting to imperfect data scenarios. The research identifies critical gaps in interpretable routing, expert communication, and lifelong multimodal learning, positioning MoE as a foundational framework for building more efficient and flexible AI systems.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

Researchers propose DACLR, a dynamic contrastive learning method that improves evidence retrieval for multimodal fact-checking by converting diverse media types to text and extracting event-level features. The approach uses a two-stage recall-rerank system with adaptive loss functions to better match claims with relevant evidence rather than merely semantically similar content.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

FLORO is a multimodal geospatial foundation model that learns from diverse remote sensing data across multiple sensor types and resolutions with minimal pretraining data. Despite using significantly smaller datasets than competing models, FLORO demonstrates strong transfer learning performance on ecological and environmental applications, achieving competitive results on scene classification, segmentation, and regression tasks.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Researchers introduce SAME, a new approach for training Multimodal Large Language Models that can continuously learn new tasks without forgetting previous capabilities. The method addresses fundamental problems in continual learning by stabilizing how AI systems route tasks to specialized expert networks and preventing knowledge degradation over time.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

Researchers have developed a new deepfake detection framework called T-AVFD that addresses a critical gap in audio-visual forgery detection by handling singing scenarios, where traditional cross-modal inconsistency methods fail. The study introduces the SHDF dataset and demonstrates improved detection performance across both talking and singing deepfakes through text-guided pattern learning.

AIBullisharXiv – CS AI · 4d ago6/10
🧠

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

Researchers introduce FAST-GOAL, a fine-tuning method that improves CLIP's ability to process lengthy text descriptions through global-local semantic alignment. The approach combines object detection with token-level similarity learning and introduces GLIT100k, a new dataset linking long captions to localized image-text pairs, demonstrating significant performance gains across multiple benchmarks.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?

Researchers developed a gated multimodal AI framework that combines electronic health record data with chest X-ray analysis to predict respiratory failure in ICU patients within 24 hours. The model achieved significantly higher accuracy (AUROC 0.860) than EHR-only baselines and physician predictions, demonstrating that adaptive fusion of imaging and structured clinical data improves critical care decision-making.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

Researchers introduce CmIVTP, a cross-modal AI framework that combines AIS and CCTV data to improve maritime vessel trajectory prediction. The system uses transformer-based architecture with attention mechanisms to model vessel-environment interactions, addressing limitations of single-source data in maritime navigation systems.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Researchers introduce TowerMind, a lightweight tower defense game environment designed to evaluate Large Language Models as autonomous agents. The benchmark tests LLMs' capabilities in strategic planning and real-time decision-making while revealing significant performance gaps compared to human experts and highlighting key limitations in model reasoning.

AIBullisharXiv – CS AI · May 126/10
🧠

PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis

Researchers introduce PromptDx, a novel AI framework that combines differentiable prompt tuning with multimodal learning to diagnose Alzheimer's Disease using MRI and biomarker data. The method achieves competitive performance using only 1% of context samples compared to 30% in standard approaches, demonstrating significant data efficiency gains for medical imaging applications.

Page 1 of 2Next →