#mixture-of-experts News & Analysis

84 articles tagged with #mixture-of-experts. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

84 articles

AIBullisharXiv – CS AI · 3d ago7/10

🧠

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

MobileMoE: Scaling On-Device Mixture of Experts

Researchers present MobileMoE, a family of sub-billion parameter Mixture-of-Experts language models optimized for on-device deployment that achieve 2-4x efficiency gains over dense models while matching or exceeding performance. The work establishes new on-device scaling laws and delivers the first practical MoE inference implementation on smartphones, with 1.8-3.8x faster performance than existing mobile baselines.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Researchers introduce ReMoE, a router fine-tuning framework that optimizes Mixture-of-Experts language models for memory-constrained inference by increasing expert reuse and reducing storage I/O overhead. The approach improves expert reuse by 26% while maintaining performance, delivering up to 1.99× decode speedup on edge devices.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax introduces the M2 series, a Mixture-of-Experts language model with 229.9B total parameters but only 9.8B activated per token, achieving frontier-tier performance on agentic tasks through agent-driven data pipelines and a custom reinforcement learning system called Forge. The M2.7 checkpoint demonstrates early self-evolution capabilities, autonomously debugging and modifying its own training scaffold.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Researchers introduce a symmetry-compatible principle for neural network optimizer design that aligns gradient updates with the geometric properties of different parameter types. The approach yields specialized update rules for embeddings, language model heads, SwiGLU MLPs, and mixture-of-experts routers, demonstrating improved validation loss and training stability across multiple language model architectures compared to standard AdamW optimization.

AIBullisharXiv – CS AI · May 127/10

🧠

ZAYA1-VL-8B Technical Report

Zyphra has released ZAYA1-VL-8B, a compact mixture-of-experts vision-language model that delivers competitive performance with larger systems while using significantly fewer active parameters. The model introduces vision-specific LoRA adapters and bidirectional attention mechanisms to enhance visual understanding, representing meaningful progress in efficient AI model design.

🏢 Hugging Face

AIBullisharXiv – CS AI · May 127/10

🧠

Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution

Researchers demonstrate that Mixture of Experts (MoE) models contain substantial underutilized sparsity within individual experts that can be exploited without modifying model parameters. By implementing intra-expert activation sparsity in vLLM, they achieve up to 2.5x speedup in MoE layer execution, offering a practical optimization path for efficient large language model deployment.

AIBullisharXiv – CS AI · May 117/10

🧠

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Researchers introduce MISA, an optimization technique that reduces computational costs in DeepSeek's sparse attention mechanism for large language models by treating indexer heads as a mixture-of-experts system. The method achieves 3.82x speedup on GPU inference while maintaining performance across benchmarks, addressing a key bottleneck in long-context LLM processing.

🏢 Nvidia

AIBullisharXiv – CS AI · May 97/10

🧠

ZAYA1-8B Technical Report

Zyphra has unveiled ZAYA1-8B, a compact reasoning-focused AI model with only 700M active parameters that matches larger competitors like DeepSeek-R1 on mathematics and coding tasks. The model introduces Markovian RSA, a novel test-time compute method that achieves 91.9% on AIME'25 benchmarks while maintaining computational efficiency, suggesting small models can compete with much larger reasoning systems through architectural innovation.

🧠 GPT-5🧠 Gemini

AIBullisharXiv – CS AI · May 77/10

🧠

LAWS: Learning from Actual Workloads Symbolically -- A Self-Certifying Parametrized Cache Architecture for Neural Inference, Robotics, and Edge Deployment

Researchers introduce LAWS, a self-certifying caching architecture for neural inference that builds a library of expert functions with formal error bounds, enabling efficient deployment across LLMs, robotics, and edge devices. The system generalizes both Mixture-of-Experts and KV prefix caching while providing mathematically verifiable performance guarantees without requiring ground truth validation.

AINeutralarXiv – CS AI · Apr 147/10

🧠

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise

Researchers demonstrate that Mixture of Experts (MoEs) specialization in large language models emerges from hidden state geometry rather than specialized routing architecture, challenging assumptions about how these systems work. Expert routing patterns resist human interpretation across models and tasks, suggesting that understanding MoE specialization remains as difficult as the broader unsolved problem of interpreting LLM internal representations.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

Researchers propose a method to adapt 2D multimodal large language models for 3D medical imaging analysis, introducing a Text-Guided Hierarchical Mixture of Experts framework that enables task-specific feature extraction. The approach demonstrates improved performance on medical report generation and visual question answering tasks while reusing pre-trained parameters from 2D models.

AIBullisharXiv – CS AI · Apr 147/10

🧠

SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

Researchers introduce SpecMoE, a new inference system that applies speculative decoding to Mixture-of-Experts language models to improve computational efficiency. The approach achieves up to 4.30x throughput improvements while reducing memory and bandwidth requirements without requiring model retraining.

AIBullisharXiv – CS AI · Apr 147/10

🧠

MoEITS: A Green AI approach for simplifying MoE-LLMs

Researchers present MoEITS, a novel algorithm for simplifying Mixture-of-Experts large language models while maintaining performance and reducing computational costs. The method outperforms existing pruning techniques across multiple benchmark models including Mixtral 8×7B and DeepSeek-V2-Lite, addressing the energy and resource efficiency challenges of deploying advanced LLMs.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.

AIBullisharXiv – CS AI · Apr 107/10

🧠

MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2× inference speedup with substantial perplexity reductions on benchmark models.

🏢 Perplexity

AIBullisharXiv – CS AI · Apr 67/10

🧠

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

JoyAI-LLM Flash is a new efficient Mixture-of-Experts language model with 48B parameters that activates only 2.7B per forward pass, trained on 20 trillion tokens. The model introduces FiberPO, a novel reinforcement learning algorithm, and achieves higher sparsity ratios than comparable industry models while being released open-source on Hugging Face.

🏢 Hugging Face

AIBullisharXiv – CS AI · Apr 67/10

🧠

Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus

Researchers propose Council Mode, a multi-agent consensus framework that reduces AI hallucinations by 35.9% by routing queries to multiple diverse LLMs and synthesizing their outputs through a dedicated consensus model. The system operates through intelligent triage classification, parallel expert generation, and structured consensus synthesis to address factual accuracy issues in large language models.

AIBullisharXiv – CS AI · Mar 277/10

🧠

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Ming-Flash-Omni is a new 100 billion parameter multimodal AI model with Mixture-of-Experts architecture that uses only 6.1 billion active parameters per token. The model demonstrates unified capabilities across vision, speech, and language tasks, achieving performance comparable to Gemini 2.5 Pro on vision-language benchmarks.

🧠 Gemini

AIBullisharXiv – CS AI · Mar 167/10

🧠

LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing

Researchers introduce LightMoE, a new framework that compresses Mixture-of-Experts language models by replacing redundant expert modules with parameter-efficient alternatives. The method achieves 30-50% compression rates while maintaining or improving performance, addressing the substantial memory demands that limit MoE model deployment.

AIBullisharXiv – CS AI · Mar 127/10

🧠

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Researchers have developed a new scaling law for Mixture-of-Experts (MoE) models that optimizes compute allocation between expert and attention layers. The study extends the Chinchilla scaling law by introducing an optimal ratio formula that follows a power-law relationship with total compute and model sparsity.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

Researchers have developed Variational Mixture-of-Experts Routing (VMoER), a Bayesian framework that enables uncertainty quantification in large-scale AI models while adding less than 1% computational overhead. The method improves routing stability by 38%, reduces calibration error by 94%, and increases out-of-distribution detection by 12%.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth

Researchers introduce 'opaque serial depth' as a metric to measure how much reasoning large language models can perform without externalizing it through chain of thought processes. The study provides computational bounds for Gemma 3 models and releases open-source tools to calculate these bounds for any neural network architecture.

AIBullisharXiv – CS AI · Mar 56/10

🧠

RANGER: Sparsely-Gated Mixture-of-Experts with Adaptive Retrieval Re-ranking for Pathology Report Generation

Researchers introduce RANGER, a new AI framework using sparsely-gated Mixture-of-Experts architecture for generating pathology reports from medical images. The system achieves superior performance on standard benchmarks by enabling dynamic expert specialization and reducing noise through adaptive retrieval re-ranking.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning

Researchers developed Uni-NTFM, a new foundation model for EEG signal analysis that incorporates biological neural mechanisms and achieved record-breaking 1.9 billion parameters. The model was pre-trained on 28,000 hours of EEG data and outperformed existing models across nine downstream tasks by aligning architecture with actual brain functionality.

Page 1 of 4Next →