#mixture-of-experts News & Analysis

130 articles tagged with #mixture-of-experts. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

130 articles

AIBullisharXiv – CS AI · May 296/10

🧠

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

ConMoE presents a novel post-training compression method for Mixture-of-Experts language models that consolidates expert pools through prototype reassignment rather than pruning or weight merging. The train-free approach selectively retains pretrained experts as reusable prototypes and remaps original expert references to these prototypes, achieving competitive or superior performance on major MoE models while significantly reducing deployment memory requirements.

AINeutralarXiv – CS AI · May 296/10

🧠

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

Researchers propose a mathematical model explaining how Mixture-of-Experts (MoE) neural networks can suddenly shift from balanced to imbalanced expert utilization. The model reveals a bifurcation mechanism where increased feedback strength triggers abrupt transitions between stable states, providing theoretical insight into a practical problem affecting large language models and distributed AI systems.

AINeutralarXiv – CS AI · May 296/10

🧠

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

Researchers have developed a mathematical framework that preserves closed-form variational inference when composing multiple probabilistic models together, traditionally a challenge that breaks analytical tractability. By identifying five core factor-graph primitives and proving their composability, the work enables Bayesian mixture-of-experts models with inferred gating functions, demonstrated through improved ensemble forecasting with calibrated uncertainty.

AIBullisharXiv – CS AI · May 286/10

🧠

Laguna M.1/XS.2 Technical Report

Poolside has released Laguna M.1 and XS.2, two Mixture-of-Experts foundation models designed for agentic coding tasks, with the smaller XS.2 model open-sourced under Apache 2.0. Both models achieve competitive performance on software engineering benchmarks while introducing a vertically-integrated 'Model Factory' approach to streamlined AI development.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 286/10

🧠

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

Researchers introduce PEAM, a parametric memory framework for AI agents in Minecraft that consolidates learned skills directly into model parameters rather than relying on retrieval-based memory. The system uses a mixture-of-experts architecture with contrastive learning to internalize both successful and failed experiences, achieving better long-horizon task performance while avoiding catastrophic forgetting.

AINeutralarXiv – CS AI · May 286/10

🧠

Continual Model Routing in Evolving Model Hubs

Researchers introduce Continual Model Routing (CMR), a framework addressing the challenge of efficiently selecting from thousands of pre-trained models in expanding AI hubs. They present CMRBench, a large-scale benchmark with over 2,000 candidate models, and CARvE, a contrastive embedding method that outperforms existing routing strategies as model repositories grow.

AINeutralarXiv – CS AI · May 286/10

🧠

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

A comprehensive survey examines how Mixture-of-Experts (MoE) architectures address multimodal learning challenges by enabling scalable modeling, enriching representation learning across modalities, and adapting to imperfect data scenarios. The research identifies critical gaps in interpretable routing, expert communication, and lifelong multimodal learning, positioning MoE as a foundational framework for building more efficient and flexible AI systems.

AIBullisharXiv – CS AI · May 286/10

🧠

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

Researchers introduce FPMoE, a sparse Mixture-of-Experts model optimized for functional programming languages like Haskell, OCaml, and Scala, addressing a significant gap in LLM-based code generation. With only 3B active parameters, the model matches the performance of much larger models while using a novel architecture combining language-specific experts with a shared expert for cross-language functional patterns.

AIBullisharXiv – CS AI · May 286/10

🧠

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

Researchers present a method for aggressively pruning expert modules from mixture-of-experts large language models to create specialized translation systems. The approach removes up to 90% of experts with minimal performance degradation, demonstrating that translation tasks require only a fraction of a full LLM's parameters, enabling substantial model compression.

AINeutralarXiv – CS AI · May 286/10

🧠

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

Researchers introduce SMILE-Next, a comprehensive dataset and specialized large language model framework for understanding laughter in real-world contexts. The work combines laughter detection, classification, and reasoning tasks with novel training techniques including laughter-specific self-instruction and a mixture-of-experts architecture to improve multimodal language model performance on this underexplored domain.

AIBullisharXiv – CS AI · May 286/10

🧠

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism introduces a heterogeneous Mixture-of-Experts framework that enhances Vision-Language Models for video understanding by deploying specialized experts rather than identical generalists. The approach uses dynamic multi-rate sampling and bidirectional fusion to achieve state-of-the-art performance on video recognition benchmarks.

AINeutralarXiv – CS AI · May 286/10

🧠

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

Researchers propose RA-MoE, a fine-tuning framework that optimizes Mixture-of-Experts language models for multilingual tasks by aligning target-language routing patterns with English task performance in middle layers. The approach outperforms standard fine-tuning across multiple models and languages, addressing a critical gap in adapting efficient LLM architectures for non-English downstream applications.

AINeutralarXiv – CS AI · May 286/10

🧠

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Researchers introduce SAME, a new approach for training Multimodal Large Language Models that can continuously learn new tasks without forgetting previous capabilities. The method addresses fundamental problems in continual learning by stabilizing how AI systems route tasks to specialized expert networks and preventing knowledge degradation over time.

AINeutralarXiv – CS AI · May 276/10

🧠

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

Researchers have developed BioFact-MoE, a machine learning framework that uses specialized expert networks to separately analyze liver and tumor factors in hepatocellular carcinoma prognosis. The model achieves superior survival prediction accuracy (75%+ AUC at 12-18 months) while providing interpretable biological insights into treatment heterogeneity.

AIBullisharXiv – CS AI · May 276/10

🧠

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

Researchers introduce Dense2MoE, a framework that converts dense language models into efficient Mixture of Experts (MoE) architectures through unified pruning and upcycling, enabling viable on-device LLM deployment with improved latency-accuracy tradeoffs.

AINeutralarXiv – CS AI · May 276/10

🧠

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

L2Rec introduces a novel framework that adapts large language models for personalized recommendations by unifying behavioral and semantic signals at the parameter level using a Dual-view Personalized Mixture-of-Experts mechanism. The approach demonstrates superior performance across multiple datasets and validates real-world applicability through industrial A/B testing.

AINeutralarXiv – CS AI · May 276/10

🧠

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

Researchers propose R2E-IG, a deep reinforcement learning model using mixture-of-experts architecture to improve vehicle routing problem solutions across different data distributions. The approach combines residual-refined expert modules with instance-level gating and dynamic weight adaptation training, achieving competitive performance on both standard and out-of-distribution test cases.

AIBullisharXiv – CS AI · May 276/10

🧠

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

Researchers present a new quantization method for large video diffusion models that achieves 59.3% memory reduction while maintaining near-baseline quality. The technique addresses challenges in compressing Wan2.2-I2V's mixture-of-experts architecture by using timestep-aware and expert-specific calibration strategies.

AINeutralarXiv – CS AI · May 126/10

🧠

Hierarchical Mixture-of-Experts with Two-Stage Optimization

Researchers introduce Hi-MoE, a hierarchical Mixture-of-Experts framework that addresses a fundamental routing trade-off in sparse MoE models by implementing two-stage optimization: inter-group load balancing and intra-group expert specialization. Tested on large-scale NLP and vision tasks, Hi-MoE achieves 5.6% perplexity improvements and superior expert balance compared to existing methods.

🏢 Meta🏢 Perplexity

AINeutralarXiv – CS AI · May 126/10

🧠

SDG-MoE: Signed Debate Graph Mixture-of-Experts

Researchers introduce SDG-MoE, a novel mixture-of-experts architecture that enables deliberation among routed experts through signed graph communication before output aggregation. The model demonstrates 19.8% perplexity improvement over vanilla MoE and achieves state-of-the-art results on multiple language modeling benchmarks while maintaining computational efficiency.

🏢 Perplexity

AIBullisharXiv – CS AI · May 126/10

🧠

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Researchers introduce SimReg, an embedding similarity regularization technique for large language model pretraining that improves training efficiency by encouraging similar token representations to cluster together while separating different tokens. The approach achieves over 30% faster training convergence and 1% improvement in zero-shot performance across standard benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

Researchers studying one-layer Transformers discovered that architectural choices in feedforward networks (FFNs)—particularly sparse mixture-of-experts (MoE) routing—fundamentally reshape how attention mechanisms learn to compute, with sparsity rather than learned specialization driving this computational redistribution.

AINeutralarXiv – CS AI · May 126/10

🧠

Mixture of Layers with Hybrid Attention

Researchers introduce Mixture of Layers (MoL), a novel architecture that extends Mixture-of-Experts concepts from individual experts to entire transformer blocks, using parallel thin blocks with learned routing. The approach incorporates hybrid attention combining global softmax with linear attention to address token coverage limitations in sparse routing systems.

AINeutralarXiv – CS AI · May 116/10

🧠

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Researchers propose MoLF (Mixture of LoRA and Full Fine-Tuning), a hybrid framework that dynamically routes gradient updates between full fine-tuning and low-rank adaptation during LLM training. The approach addresses limitations of relying solely on either method, achieving competitive or superior performance across diverse tasks while maintaining training stability and memory efficiency.

AIBullisharXiv – CS AI · May 116/10

🧠

Tracking Large-scale Shared Bikes with Inertial Motion Learning in GNSS Blocked Environments

Researchers propose an inertial motion learning framework for tracking shared bikes in GNSS-denied environments like urban canyons, combining mechanical constraints with mixture-of-experts models to achieve 12% accuracy improvements over baselines. The system leverages pedaling behavior patterns to dynamically calibrate wheel speed estimates, demonstrating practical viability through real-world deployment data from DiDi's bike-sharing platform.

← PrevPage 4 of 6Next →