y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-efficiency News & Analysis

72 articles tagged with #ai-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

72 articles
AIBullisharXiv – CS AI · 1d ago7/10
🧠

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.

AINeutralarXiv – CS AI · 2d ago7/10
🧠

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Researchers challenge the assumption that longer reasoning chains always improve LLM performance, discovering that extended test-time compute leads to diminishing returns and 'overthinking' where models abandon correct answers. The study demonstrates that optimal compute allocation varies by problem difficulty, enabling significant efficiency gains without sacrificing accuracy.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM

Researchers introduce AtlasKV, a parametric knowledge integration method that enables large language models to leverage billion-scale knowledge graphs while consuming less than 20GB of VRAM. Unlike traditional retrieval-augmented generation (RAG) approaches, AtlasKV integrates knowledge directly into LLM parameters without requiring external retrievers or extended context windows, reducing inference latency and computational overhead.

AIBullisharXiv – CS AI · Apr 77/10
🧠

Zero-Shot Quantization via Weight-Space Arithmetic

Researchers have developed a zero-shot quantization method that transfers robustness between AI models through weight-space arithmetic, improving post-training quantization performance by up to 60% without requiring additional training. This breakthrough enables low-cost deployment of extremely low-bit models by extracting 'quantization vectors' from donor models to patch receiver models.

AIBullisharXiv – CS AI · Apr 77/10
🧠

SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

Researchers propose SoLA, a training-free compression method for large language models that combines soft activation sparsity and low-rank decomposition. The method achieves significant compression while improving performance, demonstrating 30% compression on LLaMA-2-70B with reduced perplexity from 6.95 to 4.44 and 10% better downstream task accuracy.

🏢 Perplexity
AIBullisharXiv – CS AI · Apr 67/10
🧠

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Researchers discovered that in Large Reasoning Models like DeepSeek-R1, the first solution is often the best, with alternative solutions being detrimental due to error accumulation. They propose RED, a new framework that achieves up to 19% performance gains while reducing token consumption by 37.7-70.4%.

AIBullisharXiv – CS AI · Mar 177/10
🧠

D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Researchers introduce D-MEM, a biologically-inspired memory architecture for AI agents that uses dopamine-like reward prediction error routing to dramatically reduce computational costs. The system reduces token consumption by over 80% and eliminates quadratic scaling bottlenecks by selectively processing only high-importance information through cognitive restructuring.

AIBullisharXiv – CS AI · Mar 177/10
🧠

Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems

Researchers introduce REDEREF, a training-free controller that improves multi-agent LLM system efficiency by 28% token usage reduction and 17% fewer agent calls through probabilistic routing and belief-guided delegation. The system uses Thompson sampling and reflection-driven re-routing to optimize agent coordination without requiring model fine-tuning.

AIBullisharXiv – CS AI · Mar 167/10
🧠

Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Researchers have developed Pyramid MoA, a new framework that optimizes large language model inference costs by using a hierarchical router system that escalates queries to more expensive models only when necessary. The system achieves up to 62.7% cost savings while maintaining Oracle-level accuracy on various benchmarks including coding and mathematical reasoning tasks.

🧠 Llama
AIBullisharXiv – CS AI · Mar 127/10
🧠

Training Language Models via Neural Cellular Automata

Researchers developed a method using neural cellular automata (NCA) to generate synthetic data for pre-training language models, achieving up to 6% improvement in downstream performance with only 164M synthetic tokens. This approach outperformed traditional pre-training on 1.6B natural language tokens while being more computationally efficient and transferring well to reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 117/10
🧠

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

Researchers introduce FCDM, a fully convolutional diffusion model based on ConvNeXt architecture that achieves competitive performance with DiT-XL/2 using only 50% of the computational resources. The model demonstrates exceptional training efficiency, requiring 7x fewer training steps and can be trained on just 4 GPUs, reviving convolutional networks as an efficient alternative to Transformer-based diffusion models.

AIBullisharXiv – CS AI · Mar 117/10
🧠

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Researchers have developed a new framework for training neural networks at ultra-low precision and high sparsity by modeling quantization as additive noise rather than using traditional Straight-Through Estimators. The method enables stable training of A1W1 and sub-1-bit networks, achieving state-of-the-art results for highly efficient neural networks including modern LLMs.

AIBullisharXiv – CS AI · Mar 57/10
🧠

AutoHarness: improving LLM agents by automatically synthesizing a code harness

Researchers developed AutoHarness, a technique where smaller LLMs like Gemini-2.5-Flash can automatically generate code harnesses to prevent illegal moves in games, outperforming larger models like Gemini-2.5-Pro and GPT-5.2-High. The method eliminates 78% of failures attributed to illegal moves in chess competitions and demonstrates superior performance across 145 different games.

🧠 Gemini
AIBullisharXiv – CS AI · Mar 57/10
🧠

AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution

Researchers from KAIST propose AMiD, a new knowledge distillation framework that improves the efficiency of training smaller language models by transferring knowledge from larger models. The technique introduces α-mixture assistant distribution to address training instability and capacity gaps in existing approaches.

AINeutralarXiv – CS AI · Mar 47/102
🧠

Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs

Research comparing Knowledge Tracing (KT) models to Large Language Models (LLMs) for predicting student responses found that specialized KT models significantly outperform LLMs in accuracy, speed, and cost-effectiveness. The study demonstrates that domain-specific models are superior to general-purpose LLMs for educational prediction tasks, with LLMs being orders of magnitude slower and more expensive to deploy.

AIBullisharXiv – CS AI · Mar 47/103
🧠

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Researchers introduce OptMerge, a new benchmark and method for combining multiple expert Multimodal Large Language Models (MLLMs) into single, more capable models without requiring additional training data. The approach achieves 2.48% average performance gains while reducing storage and serving costs by merging models across different modalities like vision, audio, and video.

AIBullisharXiv – CS AI · Mar 37/104
🧠

SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs

Researchers introduce SwiReasoning, a training-free framework that improves large language model reasoning by dynamically switching between explicit chain-of-thought and latent reasoning modes. The method achieves 1.8%-3.1% accuracy improvements and 57%-79% better token efficiency across mathematics, STEM, coding, and general benchmarks.

AINeutralarXiv – CS AI · Mar 37/104
🧠

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Researchers analyzed compression effects on large reasoning models (LRMs) through quantization, distillation, and pruning methods. They found that dynamically quantized 2.51-bit models maintain near-original performance, while identifying critical weight components and showing that protecting just 2% of excessively compressed weights can improve accuracy by 6.57%.

AI × CryptoNeutralBankless · Feb 277/106
🤖

Jack Dorsey's Block Slashes 40% of Workforce, Credits AI Efficiency

Jack Dorsey's Block has laid off 40% of its workforce, citing AI efficiency as the reason for the massive reduction. The cryptocurrency XYZ surged 20% overnight as investors anticipate improved productivity through the company's agentic AI implementation.

AIBullisharXiv – CS AI · Feb 277/107
🧠

Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study

Researchers developed a system that trains large language models using renewable energy during curtailment periods when excess clean electricity would otherwise be wasted. The distributed training approach across multiple GPU clusters reduced operational emissions to 5-12% of traditional single-site training while maintaining model quality.

AIBullisharXiv – CS AI · Feb 277/106
🧠

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Researchers propose 'Intelligence per Watt' (IPW) as a metric to measure AI efficiency, finding that local AI models can handle 71.3% of queries while being 1.4x more energy efficient than cloud alternatives. The study demonstrates that smaller local language models (≤20B parameters) can redistribute computational demand from centralized cloud infrastructure.

Page 1 of 3Next →