#transformer-efficiency News & Analysis

11 articles tagged with #transformer-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

SPOTR: Spatio-temporal Pooling One-Token Reconstruction for Universal Physiological Signal Self-supervised Learning

SPOTR, a new self-supervised learning framework, significantly advances physiological signal processing by using a single-token bottleneck to compress and reconstruct EEG, ECG, PPG, and iEEG signals. The model demonstrates substantial performance improvements across 20 datasets while reducing computational requirements by 78% in latency and 52% in GPU memory compared to existing foundation models.

AIBullisharXiv – CS AI · Jun 97/10

🧠

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

WhiFlash introduces a novel speculative decoding method that combines autoregressive and diffusion-based drafting models through token-level routing, achieving up to 69.6% throughput improvements over existing approaches. The system uses lightweight controllers to dynamically switch between drafting paradigms based on per-token conditions, addressing a key bottleneck in LLM inference efficiency.

AIBullisharXiv – CS AI · Jun 97/10

🧠

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Researchers introduce STAR-KV, an adaptive compression framework that reduces KV cache memory requirements in large language models by up to 75% through low-rank projections and intelligent rank selection. The technique achieves up to 20x compression when combined with quantization and delivers significant speedups in attention computation, addressing a critical bottleneck in LLM inference efficiency.

AIBullisharXiv – CS AI · May 127/10

🧠

Pretraining large language models with MXFP4

Researchers identify weight gradient (Wgrad) quantization as the primary cause of instability in FP4 training of large language models, while forward and activation gradient quantization prove relatively benign. Using deterministic Hadamard rotations on AMD MI355X GPUs, they demonstrate that structured micro-scaling errors—not insufficient randomness—drive training divergence, offering insights for efficient LLM pretraining.

🧠 Llama

AIBullisharXiv – CS AI · May 97/10

🧠

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Researchers introduce SPEED, a novel inference optimization technique for long-context language models that reduces computational cost by materializing key-value cache states only in lower layers during the prefill phase while maintaining full-depth processing during decoding. Testing on Llama-3.1-8B demonstrates 33% improvement in time-to-first-token, 22% improvement in tokens-per-second, and 25% reduction in KV memory with minimal quality degradation, suggesting that prompt tokens don't require persistent full-depth caching.

🧠 Llama

AIBullisharXiv – CS AI · Apr 107/10

🧠

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv – CS AI · Jun 116/10

🧠

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

Researchers present SWARR, a two-stage method combining supervised fine-tuning and reinforcement learning to make sliding-window attention (SWA) competitive with standard self-attention for mathematical reasoning tasks. By using RL to adapt model trajectories to SWA's architectural constraints, the approach recovers much of the accuracy lost during conversion while maintaining linear-complexity efficiency benefits.

AINeutralarXiv – CS AI · Jun 116/10

🧠

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

Researchers identify a 'structural attention tax' where knowledge graph formats capture 2-3x more model attention than semantically equivalent natural language, degrading in-context learning performance by up to 42% regardless of content relevance. The study formalizes attention decomposition into semantic and structural components, revealing that retrieval format can independently distort LLM outputs independent of knowledge quality.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Q-Delta: Beyond Key-Value Associative State Evolution

Q-Delta presents a novel approach to linear attention mechanisms in sequence modeling by integrating query-conditioned state evolution, moving beyond traditional key-value associative paradigms. The method combines efficient linear-time inference with improved performance on language modeling and long-context retrieval tasks through a hardware-optimized implementation.

AIBullisharXiv – CS AI · Jun 26/10

🧠

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Researchers introduce STaR-KV, a training-free compression framework that reduces key-value cache memory consumption in vision-language GUI agents by up to 40% while maintaining accuracy. The method addresses a critical bottleneck where models like UI-TARS-1.5-7B consume prohibitive GPU memory during multi-step interactions, enabling more practical deployment on standard accelerators.

AIBullisharXiv – CS AI · May 276/10

🧠

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

Researchers introduce Layerwise Learning Rate (LLR), an adaptive training technique that assigns different learning rates to individual Transformer layers based on Heavy-Tailed Self-Regularization theory. Testing across multiple LLM architectures and scales demonstrates up to 1.5x training speedup and improved generalization, with zero-shot accuracy improvements of 2-3% on billion-parameter models.