#model-acceleration News & Analysis

12 articles tagged with #model-acceleration. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles

AIBullisharXiv – CS AI · 4d ago7/10

🧠

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

Researchers introduce JetViT, a hybrid Vision Transformer architecture that maintains accuracy of state-of-the-art models while delivering up to 1.79x faster throughput and 44.81% lower latency on high-resolution images. The innovation uses post-training attention search to convert full-attention models into efficient hybrid variants by strategically replacing redundant attention blocks.

🏢 Nvidia

AIBullisharXiv – CS AI · 4d ago7/10

🧠

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

Researchers propose VLA-Pruner, a novel token pruning method that accelerates Vision-Language-Action models for embodied AI by addressing the mismatch between semantic and action-critical visual processing. The method achieves up to 1.99x speedup while maintaining manipulation performance by considering both semantic context and temporal action relevance, unlike existing VLM pruning approaches.

AIBullisharXiv – CS AI · May 77/10

🧠

Parallel Prefix Verification for Speculative Generation

Researchers introduce PARSE, a speculative generation framework that accelerates large language model inference by verifying multiple prefix candidates in parallel rather than sequentially. The method achieves 1.25x to 4.3x throughput improvements over baseline models and up to 4.5x gains when combined with existing techniques like EAGLE-3, with minimal accuracy loss.

AIBullisharXiv – CS AI · Apr 157/10

🧠

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.

AIBullisharXiv – CS AI · Mar 167/10

🧠

When Drafts Evolve: Speculative Decoding Meets Online Learning

Researchers introduce OnlineSpec, a framework that uses online learning to continuously improve draft models in speculative decoding for large language model inference acceleration. The approach leverages verification feedback to evolve draft models dynamically, achieving up to 24% speedup improvements across seven benchmarks and three foundation models.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Researchers introduce Dynamic Pruning Policy Optimization (DPPO), a new framework that accelerates AI language model training by 2.37x while maintaining accuracy. The method addresses computational bottlenecks in Group Relative Policy Optimization through unbiased gradient estimation and improved data efficiency.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge introduces a training-free method to accelerate Vision Transformers by improving token merging through salience-aware mechanisms and adaptive layer-wise compression. The approach outperforms existing token reduction methods across all computational efficiency benchmarks, maintaining superior accuracy-to-FLOPs ratios on ImageNet-1k evaluations.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

Researchers propose NExt, a nonlinear extrapolation framework that accelerates reinforcement learning with verifiable rewards (RLVR) for large language models by modeling low-rank parameter trajectories. The method reduces computational overhead by approximately 37.5% while remaining compatible with various RLVR algorithms, addressing a key bottleneck in scaling LLM training.

AIBullisharXiv – CS AI · Mar 36/103

🧠

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

MeanCache introduces a training-free caching framework that accelerates Flow Matching inference by using average velocities instead of instantaneous ones. The framework achieves 3.59X to 4.56X acceleration on major AI models like FLUX.1, Qwen-Image, and HunyuanVideo while maintaining superior generation quality compared to existing caching methods.

AIBullisharXiv – CS AI · Mar 26/1012

🧠

Task-Centric Acceleration of Small-Language Models

Researchers propose TASC (Task-Adaptive Sequence Compression), a framework for accelerating small language models through two methods: TASC-ft for fine-tuning with expanded vocabularies and TASC-spec for training-free speculative decoding. The methods demonstrate improved inference efficiency while maintaining task performance across low output-variability generation tasks.

AIBullishHugging Face Blog · Sep 295/107

🧠

Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models

The article discusses optimizing Qwen3-8B AI agent performance on Intel Core Ultra processors using depth-pruned draft models. This technical advancement focuses on improving AI model inference speed and efficiency on consumer-grade Intel hardware.

AIBullishHugging Face Blog · Apr 34/105

🧠

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

The article appears to discuss optimizing SetFit inference performance using Hugging Face's Optimum Intel library on Intel Xeon processors. This represents a technical advancement in AI model optimization and deployment efficiency on enterprise hardware.