AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers present MobileMoE, a family of sub-billion parameter Mixture-of-Experts language models optimized for on-device deployment that achieve 2-4x efficiency gains over dense models while matching or exceeding performance. The work establishes new on-device scaling laws and delivers the first practical MoE inference implementation on smartphones, with 1.8-3.8x faster performance than existing mobile baselines.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce ReMoE, a router fine-tuning framework that optimizes Mixture-of-Experts language models for memory-constrained inference by increasing expert reuse and reducing storage I/O overhead. The approach improves expert reuse by 26% while maintaining performance, delivering up to 1.99× decode speedup on edge devices.
AIBullisharXiv – CS AI · 4d ago7/10
🧠MiniMax introduces the M2 series, a Mixture-of-Experts language model with 229.9B total parameters but only 9.8B activated per token, achieving frontier-tier performance on agentic tasks through agent-driven data pipelines and a custom reinforcement learning system called Forge. The M2.7 checkpoint demonstrates early self-evolution capabilities, autonomously debugging and modifying its own training scaffold.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce a symmetry-compatible principle for neural network optimizer design that aligns gradient updates with the geometric properties of different parameter types. The approach yields specialized update rules for embeddings, language model heads, SwiGLU MLPs, and mixture-of-experts routers, demonstrating improved validation loss and training stability across multiple language model architectures compared to standard AdamW optimization.
AIBullisharXiv – CS AI · May 127/10
🧠Zyphra has released ZAYA1-VL-8B, a compact mixture-of-experts vision-language model that delivers competitive performance with larger systems while using significantly fewer active parameters. The model introduces vision-specific LoRA adapters and bidirectional attention mechanisms to enhance visual understanding, representing meaningful progress in efficient AI model design.
🏢 Hugging Face
AIBullisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that Mixture of Experts (MoE) models contain substantial underutilized sparsity within individual experts that can be exploited without modifying model parameters. By implementing intra-expert activation sparsity in vLLM, they achieve up to 2.5x speedup in MoE layer execution, offering a practical optimization path for efficient large language model deployment.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MISA, an optimization technique that reduces computational costs in DeepSeek's sparse attention mechanism for large language models by treating indexer heads as a mixture-of-experts system. The method achieves 3.82x speedup on GPU inference while maintaining performance across benchmarks, addressing a key bottleneck in long-context LLM processing.
🏢 Nvidia
AIBullisharXiv – CS AI · May 97/10
🧠Zyphra has unveiled ZAYA1-8B, a compact reasoning-focused AI model with only 700M active parameters that matches larger competitors like DeepSeek-R1 on mathematics and coding tasks. The model introduces Markovian RSA, a novel test-time compute method that achieves 91.9% on AIME'25 benchmarks while maintaining computational efficiency, suggesting small models can compete with much larger reasoning systems through architectural innovation.
🧠 GPT-5🧠 Gemini
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce LAWS, a self-certifying caching architecture for neural inference that builds a library of expert functions with formal error bounds, enabling efficient deployment across LLMs, robotics, and edge devices. The system generalizes both Mixture-of-Experts and KV prefix caching while providing mathematically verifiable performance guarantees without requiring ground truth validation.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that Mixture of Experts (MoEs) specialization in large language models emerges from hidden state geometry rather than specialized routing architecture, challenging assumptions about how these systems work. Expert routing patterns resist human interpretation across models and tasks, suggesting that understanding MoE specialization remains as difficult as the broader unsolved problem of interpreting LLM internal representations.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers propose a method to adapt 2D multimodal large language models for 3D medical imaging analysis, introducing a Text-Guided Hierarchical Mixture of Experts framework that enables task-specific feature extraction. The approach demonstrates improved performance on medical report generation and visual question answering tasks while reusing pre-trained parameters from 2D models.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce SpecMoE, a new inference system that applies speculative decoding to Mixture-of-Experts language models to improve computational efficiency. The approach achieves up to 4.30x throughput improvements while reducing memory and bandwidth requirements without requiring model retraining.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers present MoEITS, a novel algorithm for simplifying Mixture-of-Experts large language models while maintaining performance and reducing computational costs. The method outperforms existing pruning techniques across multiple benchmark models including Mixtral 8×7B and DeepSeek-V2-Lite, addressing the energy and resource efficiency challenges of deploying advanced LLMs.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2× inference speedup with substantial perplexity reductions on benchmark models.
🏢 Perplexity
AIBullisharXiv – CS AI · Apr 67/10
🧠JoyAI-LLM Flash is a new efficient Mixture-of-Experts language model with 48B parameters that activates only 2.7B per forward pass, trained on 20 trillion tokens. The model introduces FiberPO, a novel reinforcement learning algorithm, and achieves higher sparsity ratios than comparable industry models while being released open-source on Hugging Face.
🏢 Hugging Face
AIBullisharXiv – CS AI · Apr 67/10
🧠Researchers propose Council Mode, a multi-agent consensus framework that reduces AI hallucinations by 35.9% by routing queries to multiple diverse LLMs and synthesizing their outputs through a dedicated consensus model. The system operates through intelligent triage classification, parallel expert generation, and structured consensus synthesis to address factual accuracy issues in large language models.
AIBullisharXiv – CS AI · Mar 277/10
🧠Ming-Flash-Omni is a new 100 billion parameter multimodal AI model with Mixture-of-Experts architecture that uses only 6.1 billion active parameters per token. The model demonstrates unified capabilities across vision, speech, and language tasks, achieving performance comparable to Gemini 2.5 Pro on vision-language benchmarks.
🧠 Gemini
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers introduce LightMoE, a new framework that compresses Mixture-of-Experts language models by replacing redundant expert modules with parameter-efficient alternatives. The method achieves 30-50% compression rates while maintaining or improving performance, addressing the substantial memory demands that limit MoE model deployment.
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers have developed a new scaling law for Mixture-of-Experts (MoE) models that optimizes compute allocation between expert and attention layers. The study extends the Chinchilla scaling law by introducing an optimal ratio formula that follows a power-law relationship with total compute and model sparsity.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed Variational Mixture-of-Experts Routing (VMoER), a Bayesian framework that enables uncertainty quantification in large-scale AI models while adding less than 1% computational overhead. The method improves routing stability by 38%, reduces calibration error by 94%, and increases out-of-distribution detection by 12%.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers introduce 'opaque serial depth' as a metric to measure how much reasoning large language models can perform without externalizing it through chain of thought processes. The study provides computational bounds for Gemma 3 models and releases open-source tools to calculate these bounds for any neural network architecture.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers introduce RANGER, a new AI framework using sparsely-gated Mixture-of-Experts architecture for generating pathology reports from medical images. The system achieves superior performance on standard benchmarks by enabling dynamic expert specialization and reducing noise through adaptive retrieval re-ranking.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers developed Uni-NTFM, a new foundation model for EEG signal analysis that incorporates biological neural mechanisms and achieved record-breaking 1.9 billion parameters. The model was pre-trained on 28,000 hours of EEG data and outperformed existing models across nine downstream tasks by aligning architecture with actual brain functionality.