AIBullisharXiv – CS AI · May 277/10
🧠Researchers address a critical failure mode in quantized Vision-Language Models by proposing LRA-EE, a technique that uses early exit strategies to bypass noise-saturated layers in INT8 CLIP. The method improves zero-shot classification accuracy by 2.44 percentage points while reducing computational load by 13.4%, demonstrating that selective layer utilization can recover performance lost to quantization-induced representation collapse.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers apply game-theoretic free energy principles to analyze attention head interactions in large language models, discovering that heads exhibit higher-order redundancy. Their framework enables principled pruning of low-contribution heads, achieving 18% FLOP reduction and 22% throughput improvement in GPT2 with minimal performance degradation.
🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce HY-Himmel, a hierarchical video-language framework that efficiently processes long videos by separating semantic and motion encoding tasks. The system uses sparse keyframes for visual grounding while a lightweight adapter extracts motion information from compressed video data, achieving better performance than dense-frame baselines while reducing token usage by 3.6x.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose Kaczmarz Linear Attention (KLA), an improved algorithm for long-context language modeling that replaces empirically-learned coefficients with mathematically-derived key-norm-normalized step sizes. KLA outperforms existing linear attention baselines like Gated DeltaNet while maintaining computational efficiency and enabling stable processing of up to 65K token contexts.
🏢 Perplexity
AIBullisharXiv – CS AI · Apr 147/10
🧠A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers propose Stem, a new sparse attention mechanism for Large Language Models that reduces computational complexity while maintaining accuracy. The method uses position-dependent token selection and output-aware metrics to optimize information flow in causal attention, achieving faster pre-filling with better performance.
AIBullisharXiv – CS AI · 5d ago6/10
🧠Researchers introduce Parallax, a scalable Local Linear Attention mechanism that improves upon traditional softmax attention in large language models by learning query-like projectors to probe key-value covariance. Pretraining experiments at 0.6B and 1.7B parameters demonstrate consistent perplexity improvements and downstream benchmark gains, with performance matching or exceeding FlashAttention while revealing novel architecture-optimizer codesign benefits with the Muon optimizer.
🏢 Perplexity
AIBullisharXiv – CS AI · 6d ago6/10
🧠Researchers propose entropy-aware masking for masked language modeling, which selectively masks tokens based on prediction uncertainty rather than random selection. The approach achieves 5% improvement in GLUE scores and performs best when combined with knowledge distillation, offering a more efficient pretraining strategy for encoder-based language models.
AINeutralarXiv – CS AI · 6d ago6/10
🧠AdaMerge introduces a training-free method to accelerate Vision Transformers by improving token merging through salience-aware mechanisms and adaptive layer-wise compression. The approach outperforms existing token reduction methods across all computational efficiency benchmarks, maintaining superior accuracy-to-FLOPs ratios on ImageNet-1k evaluations.
AIBullisharXiv – CS AI · May 276/10
🧠Researchers present a new quantization method for large video diffusion models that achieves 59.3% memory reduction while maintaining near-baseline quality. The technique addresses challenges in compressing Wan2.2-I2V's mixture-of-experts architecture by using timestep-aware and expert-specific calibration strategies.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce MODIX, a training-free framework that dynamically optimizes how Vision-Language Models allocate attention across multimodal inputs by adjusting positional encoding based on information density rather than uniform token assignment. The approach improves reasoning performance without modifying model parameters, suggesting positional encoding should be treated as an adaptive resource in multimodal transformer architectures.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers introduce WAND, a framework that reduces computational and memory costs of autoregressive text-to-speech models by replacing full self-attention with windowed attention combined with knowledge distillation. The approach achieves up to 66.2% KV cache memory reduction while maintaining speech quality, addressing a critical scalability bottleneck in modern AR-TTS systems.
AIBullisharXiv – CS AI · Mar 36/106
🧠Researchers developed a new token reduction method for hybrid vision-language models that process long videos, achieving 3.8-4.2x speedup while retaining only 25% of visual tokens. The approach uses progressive reduction and unified scoring for both attention and Mamba blocks, maintaining near-baseline accuracy on long-context video benchmarks.
$NEAR