y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-scaling News & Analysis

10 articles tagged with #llm-scaling. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

10 articles
AIBullisharXiv – CS AI · Mar 267/10
🧠

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Researchers present Memory Sparse Attention (MSA), a new AI framework that enables language models to process up to 100 million tokens with linear complexity and less than 9% performance degradation. The technology addresses current limitations in long-term memory processing and can run 100M-token inference on just 2 GPUs, potentially revolutionizing applications like large-corpus analysis and long-history reasoning.

AIBullisharXiv – CS AI · Jun 47/10
🧠

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Researchers introduce Speculative Thinking, a training-free framework that leverages larger AI models to guide smaller ones during inference, improving reasoning accuracy while reducing output length. The method achieves a 6.2% accuracy boost on mathematical reasoning tasks for a 1.5B parameter model with 15.7% shorter outputs, demonstrating efficiency gains without costly retraining.

AIBullisharXiv – CS AI · Jun 27/10
🧠

DOT-MoE: Differentiable Optimal Transport for MoEfication

Researchers introduce DOT-MoE, a framework that converts dense language models into sparse Mixture-of-Experts architectures using differentiable optimal transport. The method achieves 90% performance retention while reducing active parameters by 50%, addressing a critical bottleneck in LLM inference efficiency without the instability of training MoEs from scratch.

$DOT
AIBullisharXiv – CS AI · Jun 27/10
🧠

Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

Researchers have developed a framework for generating high-quality synthetic data that enables Large Language Models to achieve predictable scaling laws for recommendation systems—a previously unattainable milestone. Models trained on this principled synthetic data outperform those trained on real user interaction data by 130% on key metrics, establishing a foundational methodology for scaling LLM capabilities in recommendations.

🏢 Perplexity
AIBullisharXiv – CS AI · May 277/10
🧠

MobileMoE: Scaling On-Device Mixture of Experts

Researchers present MobileMoE, a family of sub-billion parameter Mixture-of-Experts language models optimized for on-device deployment that achieve 2-4x efficiency gains over dense models while matching or exceeding performance. The work establishes new on-device scaling laws and delivers the first practical MoE inference implementation on smartphones, with 1.8-3.8x faster performance than existing mobile baselines.

AIBullisharXiv – CS AI · Mar 167/10
🧠

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Research shows that large language models' performance on short tasks may underestimate their capabilities, as small improvements in single-step accuracy lead to exponential gains in handling longer tasks. The study reveals that larger models excel at execution over many steps, though they suffer from 'self-conditioning' where previous errors increase the likelihood of future mistakes, which can be mitigated through 'thinking' mechanisms.

AINeutralarXiv – CS AI · Jun 86/10
🧠

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

Researchers demonstrate that semantic ID-based generative recommendation systems hit significant scaling bottlenecks, while large language models used directly as recommenders show superior scaling properties and up to 20% performance improvements. This challenges current approaches in generative recommendation and suggests LLM-based systems represent a more promising path forward for recommendation foundation models.

AINeutralarXiv – CS AI · Jun 26/10
🧠

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

Researchers propose DAG-MoE, a new Mixture-of-Experts architecture that improves large language model scaling by optimizing how expert outputs are aggregated rather than just increasing expert count. The framework uses structural aggregation instead of weighted summation, enabling multi-step reasoning within a single layer while reducing routing overhead and improving both pretraining and fine-tuning performance.

AINeutralarXiv – CS AI · Jun 26/10
🧠

Scaling Behavior of Single LLM-Driven Multi-Agent Systems

Researchers demonstrate that multi-agent LLM systems exhibit diminishing returns as agent count increases, challenging the assumption that more agents automatically improve performance. The study reveals that optimal scaling depends on base model capability, task type, and interaction design, with coordination overhead—not context limitations—driving performance degradation.

AINeutralarXiv – CS AI · May 126/10
🧠

HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing

Researchers present HoReN, a novel method for editing large language models that preserves original knowledge while incorporating new information through a codebook-based external memory system. The approach uses Hopfield networks and angular similarity retrieval to handle up to 50,000 sequential edits, significantly outperforming existing model editing techniques that degrade at scale.