y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-efficiency News & Analysis

60 articles tagged with #model-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

60 articles
AIBullisharXiv – CS AI · 1d ago7/10
🧠

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Researchers introduce Lightning OPD, an offline on-policy distillation framework that eliminates the need for live teacher inference servers during large language model post-training. By enforcing 'teacher consistency'—using the same teacher model for both supervised fine-tuning and distillation—the method achieves comparable performance to standard OPD while delivering 4x speedup and significantly reducing infrastructure costs.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Multi-Model Synthetic Training for Mission-Critical Small Language Models

Researchers demonstrate a cost-effective approach to training specialized small language models by using LLMs as one-time teachers to generate synthetic training data. By converting 3.2 billion maritime vessel tracking records into 21,543 QA pairs, they fine-tuned Qwen2.5-7B to achieve 75% accuracy on maritime tasks at a fraction of the cost of deploying larger models, establishing a reproducible framework for domain-specific AI applications.

🧠 GPT-4
AIBullisharXiv – CS AI · 2d ago7/10
🧠

SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

SVD-Prune introduces a training-free token pruning method for Vision-Language Models using Singular Value Decomposition to reduce computational overhead. The approach maintains model performance while drastically reducing vision tokens to 16-32, addressing efficiency challenges in multimodal AI systems without requiring retraining.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

Researchers identify dimensional misalignment as a critical bottleneck in compressed large language models, where parameter reduction fails to improve GPU performance due to hardware-incompatible tensor dimensions. They propose GAC (GPU-Aligned Compression), a new optimization method that achieves up to 1.5× speedup while maintaining model quality by ensuring hardware-friendly dimensions.

🧠 Llama
AINeutralarXiv – CS AI · 2d ago7/10
🧠

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise

Researchers demonstrate that Mixture of Experts (MoEs) specialization in large language models emerges from hidden state geometry rather than specialized routing architecture, challenging assumptions about how these systems work. Expert routing patterns resist human interpretation across models and tasks, suggesting that understanding MoE specialization remains as difficult as the broader unsolved problem of interpreting LLM internal representations.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

Researchers introduce GRIP, a unified framework that integrates retrieval decisions directly into language model generation through control tokens, eliminating the need for external retrieval controllers. The system enables models to autonomously decide when to retrieve information, reformulate queries, and terminate retrieval within a single autoregressive process, achieving competitive performance with GPT-4o while using substantially fewer parameters.

🧠 GPT-4
AIBullisharXiv – CS AI · 2d ago7/10
🧠

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

Dynamic sparsity in tree-structured feed-forward layers at scale

Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.

AIBullisharXiv – CS AI · 6d ago7/10
🧠

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

Researchers propose the Master Key Hypothesis, suggesting that AI model capabilities can be transferred across different model scales without retraining through linear subspace alignment. The UNLOCK framework demonstrates training-free capability transfer, achieving significant accuracy improvements such as 12.1% gains on mathematical reasoning tasks when transferring from larger to smaller models.

AIBullisharXiv – CS AI · 6d ago7/10
🧠

Computer Environments Elicit General Agentic Intelligence in LLMs

Researchers introduce LLM-in-Sandbox, a minimal computer environment that significantly enhances large language models' capabilities across diverse tasks without additional training. The approach enables weaker models to internalize agent-like behaviors through specialized training, demonstrating that environmental interaction—not just model parameters—drives general intelligence in LLMs.

AIBullisharXiv – CS AI · Apr 77/10
🧠

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

Researchers developed QED-Nano, a 4B parameter AI model that achieves competitive performance on Olympiad-level mathematical proofs despite being much smaller than proprietary systems. The model uses a three-stage training approach including supervised fine-tuning, reinforcement learning, and reasoning cache expansion to match larger models at a fraction of the inference cost.

🧠 Gemini
AIBullisharXiv – CS AI · Mar 177/10
🧠

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.

🧠 Llama
AIBullisharXiv – CS AI · Mar 177/10
🧠

Directional Routing in Transformers

Researchers introduce directional routing, a lightweight mechanism for transformer models that adds only 3.9% parameter cost but significantly improves performance. The technique gives attention heads learned suppression directions controlled by a shared router, reducing perplexity by 31-56% and becoming the dominant computational pathway in the model.

🏢 Perplexity
AIBullisharXiv – CS AI · Mar 117/10
🧠

Reinforcing Numerical Reasoning in LLMs for Tabular Prediction via Structural Priors

Researchers propose PRPO (Permutation Relative Policy Optimization), a reinforcement learning framework that enhances large language models' numerical reasoning capabilities for tabular data prediction. The method achieves performance comparable to supervised baselines while excelling in zero-shot scenarios, with an 8B parameter model outperforming much larger models by up to 53.17%.

AIBullisharXiv – CS AI · Mar 57/10
🧠

Phi-4-reasoning-vision-15B Technical Report

Researchers released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that combines vision and language capabilities with strong performance in scientific and mathematical reasoning. The model demonstrates that careful architecture design and high-quality data curation can enable smaller models to achieve competitive performance with less computational resources.

AIBullisharXiv – CS AI · Mar 47/102
🧠

$\texttt{SEM-CTRL}$: Semantically Controlled Decoding

Researchers introduce SEM-CTRL, a new approach that ensures Large Language Models produce syntactically and semantically correct outputs without requiring fine-tuning. The system uses token-level Monte Carlo Tree Search guided by Answer Set Grammars to enforce context-sensitive constraints, allowing smaller pre-trained LLMs to outperform larger models on tasks like reasoning and planning.

AIBullisharXiv – CS AI · Mar 46/102
🧠

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Researchers propose Router Knowledge Distillation (Router KD) to improve retraining-free compression of Mixture-of-Experts (MoE) models by calibrating routers while keeping expert parameters unchanged. The method addresses router-expert mismatch issues that cause performance degradation in compressed MoE models, showing particularly strong results in fine-grained MoE architectures.

AINeutralarXiv – CS AI · Mar 47/103
🧠

Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

Research compares Transformers, State Space Models (SSMs), and hybrid architectures for in-context retrieval tasks, finding hybrid models excel at information-dense retrieval while Transformers remain superior for position-based tasks. SSM-based models develop unique locality-aware embeddings that create interpretable positional structures, explaining their specific strengths and limitations.

AIBullisharXiv – CS AI · Mar 47/102
🧠

RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning

Researchers developed RxnNano, a compact 0.5B-parameter AI model for chemical reaction prediction that outperforms much larger 7B+ parameter models by 23.5% through novel training techniques focused on chemical understanding rather than scale. The framework uses hierarchical curriculum learning and chemical consistency objectives to improve drug discovery and synthesis planning applications.

$ATOM
AIBullisharXiv – CS AI · Mar 37/103
🧠

DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Researchers propose Decoupled Reward Policy Optimization (DRPO), a new framework that reduces computational costs in large reasoning models by 77% while maintaining performance. The method addresses the 'overthinking' problem where AI models generate unnecessarily long reasoning for simple questions, achieving significant efficiency gains over existing approaches.

AIBullisharXiv – CS AI · Mar 37/103
🧠

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.

AIBullisharXiv – CS AI · Mar 37/105
🧠

Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

Researchers provide mathematical proof that implicit models can achieve greater expressive power through increased test-time computation, explaining how these memory-efficient architectures can match larger explicit networks. The study validates this scaling property across image reconstruction, scientific computing, operations research, and LLM reasoning domains.

AIBullisharXiv – CS AI · Mar 37/105
🧠

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Researchers introduce Arbor, a framework that decomposes large language model decision-making into specialized node-level tasks for critical applications like healthcare triage. The system improves accuracy by 29.4 percentage points while reducing latency by 57.1% and costs by 14.4x compared to single-prompt approaches.

Page 1 of 3Next →