#compute-optimization News & Analysis

4 articles tagged with #compute-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Mar 127/10

🧠

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Researchers have developed a new scaling law for Mixture-of-Experts (MoE) models that optimizes compute allocation between expert and attention layers. The study extends the Chinchilla scaling law by introducing an optimal ratio formula that follows a power-law relationship with total compute and model sparsity.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Researchers analyzed Mixture-of-Experts (MoE) language models to determine optimal sparsity levels for different tasks. They found that reasoning tasks require balancing active compute (FLOPs) with optimal data-to-parameter ratios, while memorization tasks benefit from more parameters regardless of sparsity.

AINeutralarXiv – CS AI · May 286/10

🧠

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

ADWIN is a new framework for on-policy distillation that optimizes training efficiency by adaptively adjusting rollout lengths instead of requiring full completions for every update. The method reduces training costs by up to 4.1x while maintaining or improving accuracy on math and code reasoning tasks by identifying when shorter teacher-anchored sequences contain sufficient signal for learning.

AINeutralarXiv – CS AI · Mar 54/10

🧠

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

Researchers have developed LilMoo, a 0.6-billion parameter Hindi language model trained from scratch using a transparent, reproducible pipeline optimized for limited compute environments. The model outperforms similarly sized multilingual baselines like Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that language-specific pretraining can rival larger multilingual models.