#multi-modal News & Analysis

11 articles tagged with #multi-modal. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles

AIBullisharXiv – CS AI · Mar 177/10

🧠

Agentic DAG-Orchestrated Planner Framework for Multi-Modal, Multi-Hop Question Answering in Hybrid Data Lakes

Researchers introduce A.DOT Planner, an AI framework that enables multi-hop question answering across hybrid data lakes containing both structured and unstructured data. The system uses directed acyclic graphs to orchestrate complex queries, achieving 14.8% better accuracy and 10.7% better completeness than existing solutions.

$DOT

AIBullisharXiv – CS AI · Mar 37/104

🧠

Beyond Single-Modal Analytics: A Framework for Integrating Heterogeneous LLM-Based Query Systems for Multi-Modal Data

Researchers introduce Meta Engine, a unified semantic query system that integrates multiple specialized LLM-based query systems to handle multi-modal data analysis. The system addresses fragmentation in current semantic query tools by combining specialized systems through five key components, achieving 3-24x better performance than existing baselines.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

Researchers propose DISS, a training-free framework that enhances diffusion-based image reconstruction by incorporating side information through inference-time search. The method demonstrates consistent quality improvements across multiple inverse problems (inpainting, super-resolution, deblurring) and diffusion solvers while supporting diverse side information types including reference images, text, and medical scans.

AIBullisharXiv – CS AI · Mar 276/10

🧠

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

Researchers propose X-OPD, a Cross-Modal On-Policy Distillation framework to improve Speech Large Language Models by aligning them with text-based counterparts. The method uses token-level feedback from teacher models to bridge performance gaps in end-to-end speech systems while preserving inherent capabilities.

AINeutralarXiv – CS AI · Mar 126/10

🧠

FERRET: Framework for Expansion Reliant Red Teaming

Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.

AIBullisharXiv – CS AI · Mar 96/10

🧠

StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

Researchers introduce StreamWise, a system for real-time multi-modal content generation that can produce 10-minute podcast videos with sub-second startup delays. The system dynamically manages quality and resources across LLMs, text-to-speech, and video generation, costing under $25 for basic generation or $45 for high-quality real-time streaming.

AIBullisharXiv – CS AI · Mar 36/107

🧠

REMIND: Rethinking Medical High-Modality Learning under Missingness--A Long-Tailed Distribution Perspective

Researchers propose REMIND, a framework for medical multi-modal AI learning that addresses the challenge of missing data across multiple modalities. The solution uses a Mixture-of-Experts architecture to handle long-tail distributions of modality combinations and shows superior performance on real-world medical datasets.

AINeutralarXiv – CS AI · Mar 264/10

🧠

Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement

Researchers propose Text-guided Multi-view Knowledge Distillation (TMKD), a new method that uses dual-modality teachers (visual and text) to improve knowledge transfer from large AI models to smaller ones. The approach enhances visual teachers with multi-view inputs and incorporates CLIP text guidance, achieving up to 4.49% performance improvements across five benchmarks.

AIBullisharXiv – CS AI · Mar 95/10

🧠

GazeMoE: Perception of Gaze Target with Mixture-of-Experts

Researchers have developed GazeMoE, a new AI framework that uses Mixture-of-Experts architecture to accurately estimate where humans are looking by analyzing visual cues like eyes, head poses, and gestures. The system achieves state-of-the-art performance on benchmark datasets and addresses key challenges in gaze target detection through advanced multi-modal processing.

🏢 Hugging Face

AINeutralarXiv – CS AI · Mar 25/106

🧠

M3TR: Temporal Retrieval Enhanced Multi-Modal Micro-video Popularity Prediction

Researchers developed M3TR, a new AI framework that uses temporal retrieval and multi-modal analysis to predict micro-video popularity with 19.3% better accuracy than existing methods. The system combines a Mamba-Hawkes Process module to model user feedback patterns with temporal-aware retrieval to identify historically relevant videos based on content and popularity trajectories.

$TR

AINeutralarXiv – CS AI · Mar 34/105

🧠

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Researchers propose DASP (Decoupling Adaptation for Stability and Plasticity), a novel framework for adapting multi-modal AI models to changing test environments. The method addresses key challenges of negative transfer and catastrophic forgetting by using asymmetric adaptation strategies that treat biased and unbiased modalities differently.