#model-efficiency News & Analysis

207 articles tagged with #model-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

207 articles

AIBullisharXiv – CS AI · Apr 107/10

🧠

Computer Environments Elicit General Agentic Intelligence in LLMs

Researchers introduce LLM-in-Sandbox, a minimal computer environment that significantly enhances large language models' capabilities across diverse tasks without additional training. The approach enables weaker models to internalize agent-like behaviors through specialized training, demonstrating that environmental interaction—not just model parameters—drives general intelligence in LLMs.

AIBullisharXiv – CS AI · Apr 107/10

🧠

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

Researchers propose the Master Key Hypothesis, suggesting that AI model capabilities can be transferred across different model scales without retraining through linear subspace alignment. The UNLOCK framework demonstrates training-free capability transfer, achieving significant accuracy improvements such as 12.1% gains on mathematical reasoning tasks when transferring from larger to smaller models.

AIBullisharXiv – CS AI · Apr 77/10

🧠

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

Researchers developed QED-Nano, a 4B parameter AI model that achieves competitive performance on Olympiad-level mathematical proofs despite being much smaller than proprietary systems. The model uses a three-stage training approach including supervised fine-tuning, reinforcement learning, and reasoning cache expansion to match larger models at a fraction of the inference cost.

🧠 Gemini

AIBullisharXiv – CS AI · Mar 177/10

🧠

Directional Routing in Transformers

Researchers introduce directional routing, a lightweight mechanism for transformer models that adds only 3.9% parameter cost but significantly improves performance. The technique gives attention heads learned suppression directions controlled by a shared router, reducing perplexity by 31-56% and becoming the dominant computational pathway in the model.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 177/10

🧠

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.

🧠 Llama

AIBullisharXiv – CS AI · Mar 127/10

🧠

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Researchers have developed a new scaling law for Mixture-of-Experts (MoE) models that optimizes compute allocation between expert and attention layers. The study extends the Chinchilla scaling law by introducing an optimal ratio formula that follows a power-law relationship with total compute and model sparsity.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Reinforcing Numerical Reasoning in LLMs for Tabular Prediction via Structural Priors

Researchers propose PRPO (Permutation Relative Policy Optimization), a reinforcement learning framework that enhances large language models' numerical reasoning capabilities for tabular data prediction. The method achieves performance comparable to supervised baselines while excelling in zero-shot scenarios, with an 8B parameter model outperforming much larger models by up to 53.17%.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Research analyzes FP4 quantization sensitivity across different layers in large language models using NVFP4 and MXFP4 formats on Qwen2.5 models. The study finds MLP projection layers are most sensitive to quantization, while attention layers show substantial robustness to FP4 precision reduction.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Phi-4-reasoning-vision-15B Technical Report

Researchers released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that combines vision and language capabilities with strong performance in scientific and mathematical reasoning. The model demonstrates that careful architecture design and high-quality data curation can enable smaller models to achieve competitive performance with less computational resources.

AIBullisharXiv – CS AI · Mar 47/102

🧠

$\texttt{SEM-CTRL}$: Semantically Controlled Decoding

Researchers introduce SEM-CTRL, a new approach that ensures Large Language Models produce syntactically and semantically correct outputs without requiring fine-tuning. The system uses token-level Monte Carlo Tree Search guided by Answer Set Grammars to enforce context-sensitive constraints, allowing smaller pre-trained LLMs to outperform larger models on tasks like reasoning and planning.

AIBullisharXiv – CS AI · Mar 46/102

🧠

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Researchers propose Router Knowledge Distillation (Router KD) to improve retraining-free compression of Mixture-of-Experts (MoE) models by calibrating routers while keeping expert parameters unchanged. The method addresses router-expert mismatch issues that cause performance degradation in compressed MoE models, showing particularly strong results in fine-grained MoE architectures.

AINeutralarXiv – CS AI · Mar 47/103

🧠

Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

Research compares Transformers, State Space Models (SSMs), and hybrid architectures for in-context retrieval tasks, finding hybrid models excel at information-dense retrieval while Transformers remain superior for position-based tasks. SSM-based models develop unique locality-aware embeddings that create interpretable positional structures, explaining their specific strengths and limitations.

AIBullisharXiv – CS AI · Mar 47/102

🧠

RxnNano:Training Compact LLMs for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning

Researchers developed RxnNano, a compact 0.5B-parameter AI model for chemical reaction prediction that outperforms much larger 7B+ parameter models by 23.5% through novel training techniques focused on chemical understanding rather than scale. The framework uses hierarchical curriculum learning and chemical consistency objectives to improve drug discovery and synthesis planning applications.

$ATOM

AIBullisharXiv – CS AI · Mar 37/103

🧠

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.

AIBullisharXiv – CS AI · Mar 37/103

🧠

DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Researchers propose Decoupled Reward Policy Optimization (DRPO), a new framework that reduces computational costs in large reasoning models by 77% while maintaining performance. The method addresses the 'overthinking' problem where AI models generate unnecessarily long reasoning for simple questions, achieving significant efficiency gains over existing approaches.

AIBullisharXiv – CS AI · Mar 37/105

🧠

Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

Researchers provide mathematical proof that implicit models can achieve greater expressive power through increased test-time computation, explaining how these memory-efficient architectures can match larger explicit networks. The study validates this scaling property across image reconstruction, scientific computing, operations research, and LLM reasoning domains.

AIBullisharXiv – CS AI · Mar 37/105

🧠

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Researchers introduce Arbor, a framework that decomposes large language model decision-making into specialized node-level tasks for critical applications like healthcare triage. The system improves accuracy by 29.4 percentage points while reducing latency by 57.1% and costs by 14.4x compared to single-prompt approaches.

AIBullisharXiv – CS AI · Feb 277/109

🧠

Sparse Attention Post-Training for Mechanistic Interpretability

Researchers have developed a post-training method that makes transformer attention 99.6% sparser while maintaining performance, reducing attention connectivity to just 0.4% of edges in models up to 7B parameters. This breakthrough demonstrates that most transformer computation is redundant and enables more interpretable AI models through simplified circuit structures.

AIBullisharXiv – CS AI · Feb 277/105

🧠

Ruyi2 Technical Report

Ruyi2 is an adaptive large language model that achieves 2-3x speedup over its predecessor while maintaining comparable performance to Qwen3 models. The model introduces a 'Familial Model' approach using 3D parallel training and establishes a 'Train Once, Deploy Many' paradigm for efficient AI deployment.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Structure and Redundancy in Large Language Models: A Spectral Study via Random Matrix Theory

Researchers have developed a unified framework using Spectral Geometry and Random Matrix Theory to address reliability and efficiency challenges in large language models. The study introduces EigenTrack for real-time hallucination detection and RMT-KD for model compression while maintaining accuracy.

AIBullisharXiv – CS AI · Jun 256/10

🧠

Lightweight PCGAE-Net: Parallel CrossGate Attention and Bottleneck AutoEncoder for Efficient 5G Channel Prediction

Researchers introduce Lightweight PCGAE-Net, a new neural network architecture that reduces 5G channel prediction model size by 58% while improving accuracy by up to 6.0dB. The model addresses architectural inefficiencies in existing transformers through parallel attention mechanisms and a bottleneck autoencoder, enabling deployment on base-station hardware with computational constraints.

AIBullisharXiv – CS AI · Jun 256/10

🧠

FDN: Interpretable Spatiotemporal Forecasting with Future Decomposition Networks

Researchers propose Future Decomposition Networks (FDN), a spatiotemporal forecasting model that prioritizes interpretability while matching state-of-the-art accuracy with significantly lower computational costs. The method decomposes predictions into classifiable components and reveals latent patterns, demonstrating effectiveness across hydrologic, traffic, and energy systems.

AIBullisharXiv – CS AI · Jun 256/10

🧠

EPTS: Elastic Post-Training Sparsity for Efficient Large Language Model Compression

Researchers introduce EPTS, a new framework for compressing large language models that enables a single optimized model to perform efficiently across multiple sparsity levels, eliminating the need for separate optimization for each deployment scenario. This approach combines Multi-Sparsity Hierarchy LoRA and a Feature Mixer mechanism to maintain performance while reducing computational requirements.

AINeutralarXiv – CS AI · Jun 235/10

🧠

FiLM-Coordinated Dual-Branch Transformer for Global-Local Dependency Modeling in Language Modeling

Researchers propose a FiLM-coordinated dual-branch Transformer architecture that separates global and local dependency modeling in language models, using feature-wise linear modulation for dynamic cross-branch coordination. The approach demonstrates consistent improvements over single-branch baselines in small-scale language modeling benchmarks while maintaining parameter efficiency through intelligent channel-wise calibration rather than token-level interaction.

AINeutralarXiv – CS AI · Jun 236/10

🧠

MixedPEFT: Combining Multiple PEFT Methods with Mixed Objectives for Unsupervised Domain Adaptation

Researchers present MixedPEFT, a parameter-efficient fine-tuning method combining multiple adaptation techniques to improve pre-trained language models' performance on new domains without full retraining. The approach achieves state-of-the-art results on domain adaptation benchmarks while using only 7% of trainable parameters, demonstrating that strategic architectural combinations can outperform both existing efficient methods and computationally expensive full fine-tuning.

← PrevPage 4 of 9Next →