AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce JetViT, a hybrid Vision Transformer architecture that maintains accuracy of state-of-the-art models while delivering up to 1.79x faster throughput and 44.81% lower latency on high-resolution images. The innovation uses post-training attention search to convert full-attention models into efficient hybrid variants by strategically replacing redundant attention blocks.
🏢 Nvidia
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose VLA-Pruner, a novel token pruning method that accelerates Vision-Language-Action models for embodied AI by addressing the mismatch between semantic and action-critical visual processing. The method achieves up to 1.99x speedup while maintaining manipulation performance by considering both semantic context and temporal action relevance, unlike existing VLM pruning approaches.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce PARSE, a speculative generation framework that accelerates large language model inference by verifying multiple prefix candidates in parallel rather than sequentially. The method achieves 1.25x to 4.3x throughput improvements over baseline models and up to 4.5x gains when combined with existing techniques like EAGLE-3, with minimal accuracy loss.
AIBullisharXiv – CS AI · Apr 157/10
🧠SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers introduce OnlineSpec, a framework that uses online learning to continuously improve draft models in speculative decoding for large language model inference acceleration. The approach leverages verification feedback to evolve draft models dynamically, achieving up to 24% speedup improvements across seven benchmarks and three foundation models.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers introduce Dynamic Pruning Policy Optimization (DPPO), a new framework that accelerates AI language model training by 2.37x while maintaining accuracy. The method addresses computational bottlenecks in Group Relative Policy Optimization through unbiased gradient estimation and improved data efficiency.
AINeutralarXiv – CS AI · 3d ago6/10
🧠AdaMerge introduces a training-free method to accelerate Vision Transformers by improving token merging through salience-aware mechanisms and adaptive layer-wise compression. The approach outperforms existing token reduction methods across all computational efficiency benchmarks, maintaining superior accuracy-to-FLOPs ratios on ImageNet-1k evaluations.
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers propose NExt, a nonlinear extrapolation framework that accelerates reinforcement learning with verifiable rewards (RLVR) for large language models by modeling low-rank parameter trajectories. The method reduces computational overhead by approximately 37.5% while remaining compatible with various RLVR algorithms, addressing a key bottleneck in scaling LLM training.
AIBullisharXiv – CS AI · Mar 36/103
🧠MeanCache introduces a training-free caching framework that accelerates Flow Matching inference by using average velocities instead of instantaneous ones. The framework achieves 3.59X to 4.56X acceleration on major AI models like FLUX.1, Qwen-Image, and HunyuanVideo while maintaining superior generation quality compared to existing caching methods.
AIBullisharXiv – CS AI · Mar 26/1012
🧠Researchers propose TASC (Task-Adaptive Sequence Compression), a framework for accelerating small language models through two methods: TASC-ft for fine-tuning with expanded vocabularies and TASC-spec for training-free speculative decoding. The methods demonstrate improved inference efficiency while maintaining task performance across low output-variability generation tasks.
AIBullishHugging Face Blog · Sep 295/107
🧠The article discusses optimizing Qwen3-8B AI agent performance on Intel Core Ultra processors using depth-pruned draft models. This technical advancement focuses on improving AI model inference speed and efficiency on consumer-grade Intel hardware.
AIBullishHugging Face Blog · Apr 34/105
🧠The article appears to discuss optimizing SetFit inference performance using Hugging Face's Optimum Intel library on Intel Xeon processors. This represents a technical advancement in AI model optimization and deployment efficiency on enterprise hardware.