10 articles tagged with #multi-modal. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce A.DOT Planner, an AI framework that enables multi-hop question answering across hybrid data lakes containing both structured and unstructured data. The system uses directed acyclic graphs to orchestrate complex queries, achieving 14.8% better accuracy and 10.7% better completeness than existing solutions.
$DOT
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce Meta Engine, a unified semantic query system that integrates multiple specialized LLM-based query systems to handle multi-modal data analysis. The system addresses fragmentation in current semantic query tools by combining specialized systems through five key components, achieving 3-24x better performance than existing baselines.
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers propose X-OPD, a Cross-Modal On-Policy Distillation framework to improve Speech Large Language Models by aligning them with text-based counterparts. The method uses token-level feedback from teacher models to bridge performance gaps in end-to-end speech systems while preserving inherent capabilities.
AINeutralarXiv โ CS AI ยท Mar 126/10
๐ง Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduce StreamWise, a system for real-time multi-modal content generation that can produce 10-minute podcast videos with sub-second startup delays. The system dynamically manages quality and resources across LLMs, text-to-speech, and video generation, costing under $25 for basic generation or $45 for high-quality real-time streaming.
AIBullisharXiv โ CS AI ยท Mar 36/107
๐ง Researchers propose REMIND, a framework for medical multi-modal AI learning that addresses the challenge of missing data across multiple modalities. The solution uses a Mixture-of-Experts architecture to handle long-tail distributions of modality combinations and shows superior performance on real-world medical datasets.
AINeutralarXiv โ CS AI ยท Mar 264/10
๐ง Researchers propose Text-guided Multi-view Knowledge Distillation (TMKD), a new method that uses dual-modality teachers (visual and text) to improve knowledge transfer from large AI models to smaller ones. The approach enhances visual teachers with multi-view inputs and incorporates CLIP text guidance, achieving up to 4.49% performance improvements across five benchmarks.
AIBullisharXiv โ CS AI ยท Mar 95/10
๐ง Researchers have developed GazeMoE, a new AI framework that uses Mixture-of-Experts architecture to accurately estimate where humans are looking by analyzing visual cues like eyes, head poses, and gestures. The system achieves state-of-the-art performance on benchmark datasets and addresses key challenges in gaze target detection through advanced multi-modal processing.
๐ข Hugging Face
AINeutralarXiv โ CS AI ยท Mar 25/106
๐ง Researchers developed M3TR, a new AI framework that uses temporal retrieval and multi-modal analysis to predict micro-video popularity with 19.3% better accuracy than existing methods. The system combines a Mamba-Hawkes Process module to model user feedback patterns with temporal-aware retrieval to identify historically relevant videos based on content and popularity trajectories.
$TR
AINeutralarXiv โ CS AI ยท Mar 34/105
๐ง Researchers propose DASP (Decoupling Adaptation for Stability and Plasticity), a novel framework for adapting multi-modal AI models to changing test environments. The method addresses key challenges of negative transfer and catastrophic forgetting by using asymmetric adaptation strategies that treat biased and unbiased modalities differently.