#model-efficiency News & Analysis

207 articles tagged with #model-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

207 articles

AIBullisharXiv – CS AI · Jun 236/10

🧠

MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration

Researchers introduce MINCE, a novel method that significantly reduces the computational cost of evaluating large language models by intelligently shrinking benchmark datasets. Using Monte Carlo simulation with minimal calibration models, MINCE achieves 54-89% dataset size reductions while maintaining accuracy within acceptable drift thresholds, enabling 2.7-8.1x faster GPU evaluations.

AIBullisharXiv – CS AI · Jun 126/10

🧠

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover introduces a family of efficient Lean theorem provers that achieve state-of-the-art performance with significantly fewer parameters than existing models, using novel training techniques including curriculum learning and augmented data generation. The 4B-parameter model outperforms DeepSeek-Prover-V2-671B by 167x parameter efficiency, while the 32B model sets new benchmarks on formal mathematics tasks.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Researchers propose Manifold Power Iteration (MPI), a novel router redesign method for Mixture-of-Experts models that aligns router rows with principal singular directions of associated experts. The approach uses a "Power-then-Retract" paradigm and demonstrates improved MoE model effectiveness across scales from 1B to 11B parameters.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Researchers propose Reroute, a training-free method that improves vision-language model efficiency by recoverable token routing instead of permanent token removal. The approach dynamically reroutes less important visual tokens through decoder layers rather than discarding them, improving performance on grounding tasks while maintaining computational efficiency.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Exploration Structure in LLM Agents for Multi-File Change Localization

Researchers compare linear versus non-linear exploration strategies for LLM agents tasked with localizing files requiring changes to resolve software issues. Domain-scoped parallel agent spawning with smaller models achieves competitive performance against larger models while reducing costs, revealing that repository exploration structure significantly impacts software engineering task efficiency.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

Researchers demonstrate that FSQ (Finite Scalar Quantization) tokenization optimally structures latent space for continuous diffusion models applied to categorical data, offering a non-autoregressive alternative to large language models. Text-to-speech experiments validate FSQ's superiority, achieving better performance than LLM-based approaches while requiring smaller model sizes and faster inference.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

Researchers have developed a causal analysis framework to understand how attention mechanisms work in SAM Audio, a flow-matching transformer for audio separation. The study reveals a dual-pathway conditioning system and proposes Layer-Selective Attention Caching (LSAC), a training-free optimization technique that reduces computational overhead by ~25% while maintaining audio quality.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

Researchers propose lightweight token-level probes that monitor LLM safety directly within model hidden states during generation, eliminating the computational overhead of separate moderation models. This streaming approach enables real-time intervention before unsafe content completes generation, reducing inference costs by orders of magnitude while maintaining safety standards.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

Researchers propose WMSS, a post-training optimization method that leverages weak model checkpoints to improve strong language models beyond conventional saturation points. The approach identifies and addresses learning gaps through entropy dynamics, achieving performance gains in mathematical reasoning and code generation without additional inference costs.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Researchers propose Dual-Path Vision Token Routing (DPVR), a framework that optimizes multimodal large language models by routing vision tokens away from deep transformer layers where they saturate early, instead fusing visual and textual information only in the final layer. The approach reduces computational overhead by 3% while maintaining competitive performance, challenging the assumption that vision tokens must traverse all deep language-model layers.

AIBullisharXiv – CS AI · Jun 96/10

🧠

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF (Low-rank Exploration with Adaptive Forking) introduces a novel tree-based reinforcement learning method for training speech-aware large language models that improves credit assignment by identifying shared response prefixes and assigning rewards at the span level rather than uniformly across tokens. The approach achieves superior performance compared to existing GRPO-style methods without requiring additional computational overhead, enabling smaller models to match or exceed larger baselines.

AINeutralarXiv – CS AI · Jun 96/10

🧠

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

Researchers introduce MM-Matryoshka, a training framework that enables visual document retrievers to dynamically adjust computational and storage costs without requiring multiple models. The approach allows Vision-Language Models to optimize along two dimensions—vector width and encoder depth—while maintaining retrieval quality, addressing a key efficiency challenge in multimodal AI systems.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

Researchers propose Semantic Cache Distillation (SCD), a technical framework that significantly reduces communication overhead in large language model inference by replacing raw Key-Value cache transmission with compact semantic codes. The method achieves up to 2.65x speedup in time-to-first-token while maintaining generation quality within 5% of baseline performance, addressing a critical bottleneck in disaggregated LLM serving architectures.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

Researchers introduce RLSR, a reinforcement learning framework that trains smaller language models to rewrite source text for improved machine translation without manual prompt tuning. The approach achieves competitive performance with larger models across six MT systems and 16 language pairs, demonstrating that RL-optimized 4B parameter models can match capabilities of 235B parameter prompt-based systems.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

Researchers present an anatomy-aware benchmark demonstrating that in low-data medical imaging scenarios, effective representation of clinically meaningful cardiac structures outperforms model complexity for pathology prediction. The study uses cardiac MRI segmentation data to show that simpler classifiers with better anatomical feature engineering achieve superior results compared to more complex models with generic representations.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

Researchers have characterized how modern reasoning models achieve strong zero-shot performance on multi-label selection tasks by operating in two distinct phases: broad candidate shortlisting followed by fine-grained reasoning. This mechanistic understanding enables a more effective distillation strategy that outperforms standard knowledge transfer approaches.

AINeutralarXiv – CS AI · Jun 86/10

🧠

When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

Researchers studying lung CT imaging found that 2.5D CNNs provide the best balance of performance, stability, and computational efficiency for cancer screening compared to full 3D models or pure 2D approaches. The study challenges the assumption that 3D models are universally superior for volumetric medical imaging, revealing that 3D CNNs suffer from threshold instability while transformers produce unreliable degenerate predictions.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

Researchers introduce SETA, a machine learning framework that addresses catastrophic forgetting in large language models through sparse expert decomposition. The method separates task-specific and shared knowledge into distinct expert modules, enabling models to retain previous capabilities while learning new ones—a fundamental challenge in continual AI development.

AINeutralarXiv – CS AI · Jun 56/10

🧠

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

GuardNet, an ensemble-based detection system using shallow neural networks, demonstrates competitive performance in identifying prompt injection and jailbreak attacks on large language models while operating at 50ms latency suitable for production deployment. Although larger LLMs outperform it on some benchmarks, GuardNet achieves strong results (0.747 AUROC) with significantly lower computational overhead, challenging the assumption that adversarial robustness requires massive model scale.

🧠 Llama

AIBullisharXiv – CS AI · Jun 56/10

🧠

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Researchers introduce a critic-guided multi-agent framework that improves LLM reasoning reliability for mathematical problem-solving by combining heterogeneous AI agents with adaptive feedback loops. The approach achieves 13% accuracy improvements on benchmarks while demonstrating that smaller models can match larger ones when equipped with critique mechanisms.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Differentiable Efficient Operator Search

Researchers propose Efficient Operator Search, a differentiable framework that automates the design of token-reduction operators for multimodal foundation models. The approach unifies previously distinct manual techniques like pruning and merging into a shared search space, discovering hybrid operators that achieve better accuracy-efficiency trade-offs than hand-designed baselines.

AINeutralarXiv – CS AI · Jun 56/10

🧠

UNIVID: Unified Vision-Language Model for Video Moderation

Researchers introduce UNIVID, a unified vision-language model designed for large-scale video moderation that generates interpretable policy-aware captions instead of opaque classification outputs. The system reduces violation detection errors by 42.7% and false positives by 37.0% while consolidating over 1,000 specialized models into a single backbone, demonstrating practical AI efficiency gains in content moderation infrastructure.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

Researchers introduce Alternating Token-Weighted Unlearning (ATWU), a new method for removing specific knowledge from language models while maintaining their general capabilities. The approach identifies which tokens are most relevant for forgetting by measuring conflict with model retention objectives, achieving state-of-the-art results without requiring external supervision or auxiliary models.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Unlocking Feature Learning in Gated Delta Networks at Scale

Researchers have developed scaling rules for Gated Delta Networks (GDNs) by extending the Maximal Update Parametrization (μP) framework, enabling stable hyperparameter transfer across model sizes. This advancement addresses a critical bottleneck in training efficient sub-quadratic language models, allowing learning rates to transfer zero-shot between different model widths without retuning.

AIBullisharXiv – CS AI · Jun 46/10

🧠

POLARIS: Guiding Small Models to Write Long Stories

Researchers present POLARIS, a training method that enables smaller language models (9B parameters) to generate long-form creative stories comparable to much larger models. The approach combines LLM-based reward signals with human reference injection, demonstrating that efficient fine-tuning can close the gap between small and frontier models on complex creative tasks.

← PrevPage 5 of 9Next →