#transformer-optimization News & Analysis

26 articles tagged with #transformer-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

26 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

Researchers introduce ATMA, a novel hybrid attention architecture that solves the long-context problem in language models by combining polar attention with gated-delta compression memory. The system maintains 90%+ retrieval accuracy at 64K tokens (32x training length) while improving perplexity monotonically, addressing fundamental limitations of softmax attention that degrades with longer sequences.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 237/10

🧠

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

HyperQuant is a new post-training quantization pipeline that compresses large language and diffusion models to 3-5 bits per weight while maintaining near-lossless quality, outperforming existing methods like HIGGS and TurboQuant. The technique combines Hadamard transforms, optimal lattice quantization, and entropy coding to achieve 3.9x compression on model weights and 3.79x on KV cache, enabling more efficient deployment of large AI models.

AIBullisharXiv – CS AI · Jun 197/10

🧠

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek released V4, a new series of efficient mixture-of-experts language models supporting one-million-token context windows. The models achieve significant computational improvements over predecessors while maintaining state-of-the-art performance, with V4-Pro requiring only 27% of the inference compute of DeepSeek-V3.2.

🏢 Hugging Face

AIBullisharXiv – CS AI · Jun 107/10

🧠

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Researchers have mapped how Audio-Visual Large Language Models (AVLLMs) process and integrate audio and visual information internally, revealing distinct information flow patterns depending on input configuration. The study demonstrates that multimodal tokens can be pruned after information transfer with minimal performance impact, enabling more efficient inference across different model scales.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

Researchers present a Mathematics of Arrays framework that optimizes transformer attention mechanisms to achieve near-theoretical minimum memory requirements, reducing data movement from O(n²) to O(n) complexity. The approach delivers formal mathematical proofs of memory optimality and projects 2-100x speedup improvements, addressing a critical computational bottleneck in AI systems.

AIBullisharXiv – CS AI · Jun 97/10

🧠

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Researchers introduce FlashMemory-DeepSeek-V4, a novel inference system using Lookahead Sparse Attention to reduce GPU memory requirements for long-context LLM serving by 86.5% while maintaining accuracy. The approach uses a neural memory indexer to selectively preserve only critical KV cache chunks, enabling efficient processing of ultra-long contexts up to 500K tokens.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Researchers systematically evaluate whether transformer models require three separate QKV projections, discovering that shared projection variants perform comparably while reducing computational overhead. The Q-K=V configuration achieves 50% KV cache reduction with minimal performance loss and combines effectively with existing optimization techniques like MQA to enable practical on-device deployment.

🏢 Perplexity

AIBullisharXiv – CS AI · May 277/10

🧠

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

Researchers address a critical failure mode in quantized Vision-Language Models by proposing LRA-EE, a technique that uses early exit strategies to bypass noise-saturated layers in INT8 CLIP. The method improves zero-shot classification accuracy by 2.44 percentage points while reducing computational load by 13.4%, demonstrating that selective layer utilization can recover performance lost to quantization-induced representation collapse.

AIBullisharXiv – CS AI · May 127/10

🧠

A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models

Researchers apply game-theoretic free energy principles to analyze attention head interactions in large language models, discovering that heads exhibit higher-order redundancy. Their framework enables principled pruning of low-contribution heads, achieving 18% FLOP reduction and 22% throughput improvement in GPT2 with minimal performance degradation.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · May 127/10

🧠

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Researchers introduce HY-Himmel, a hierarchical video-language framework that efficiently processes long videos by separating semantic and motion encoding tasks. The system uses sparse keyframes for visual grounding while a lightweight adapter extracts motion information from compressed video data, achieving better performance than dense-frame baselines while reducing token usage by 3.6x.

AIBullisharXiv – CS AI · May 127/10

🧠

Kaczmarz Linear Attention

Researchers propose Kaczmarz Linear Attention (KLA), an improved algorithm for long-context language modeling that replaces empirically-learned coefficients with mathematically-derived key-norm-normalized step sizes. KLA outperforms existing linear attention baselines like Gated DeltaNet while maintaining computational efficiency and enabling stable processing of up to 65K token contexts.

🏢 Perplexity

AIBullisharXiv – CS AI · Apr 147/10

🧠

Quantization Dominates Rank Reduction for KV-Cache Compression

A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.

AIBullisharXiv – CS AI · Mar 97/10

🧠

Stem: Rethinking Causal Information Flow in Sparse Attention

Researchers propose Stem, a new sparse attention mechanism for Large Language Models that reduces computational complexity while maintaining accuracy. The method uses position-dependent token selection and output-aware metrics to optimize information flow in causal attention, achieving faster pre-filling with better performance.

AIBullisharXiv – CS AI · Jun 256/10

🧠

Lightweight PCGAE-Net: Parallel CrossGate Attention and Bottleneck AutoEncoder for Efficient 5G Channel Prediction

Researchers introduce Lightweight PCGAE-Net, a new neural network architecture that reduces 5G channel prediction model size by 58% while improving accuracy by up to 6.0dB. The model addresses architectural inefficiencies in existing transformers through parallel attention mechanisms and a bottleneck autoencoder, enabling deployment on base-station hardware with computational constraints.

AIBullisharXiv – CS AI · Jun 96/10

🧠

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem is a new memory compression framework for audio-visual large language models that enables efficient long-form video understanding by using modality-aware memory allocation and perturbation-aware token selection. The approach achieves 2-4% accuracy improvements over existing compression methods while reducing memory requirements, with potential applications in real-time video AI systems.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Researchers propose Dual-Path Vision Token Routing (DPVR), a framework that optimizes multimodal large language models by routing vision tokens away from deep transformer layers where they saturate early, instead fusing visual and textual information only in the final layer. The approach reduces computational overhead by 3% while maintaining competitive performance, challenging the assumption that vision tokens must traverse all deep language-model layers.

AINeutralarXiv – CS AI · Jun 56/10

🧠

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

Researchers introduce ATT-CR, a Transformer-based model that improves cloud removal in remote sensing images by reducing computational complexity and filtering cloudy pixel interference. The innovation combines Triangular Attention with lower computational costs (O(N)) and a Feature Selected Gating Module to distinguish between valid and invalid features, addressing scalability limitations in existing Transformer approaches.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

Researchers introduce pause-and-think-T, a reasoning-focused training dataset that enables compact Vision-Language Models to perform grounded video understanding and action suggestion tasks. A 4-billion parameter model fine-tuned on this dataset matches or exceeds much larger models (including GPT-4o and Qwen3-VL-235B) on benchmark tasks while demonstrating strong generalization to unseen datasets.

🧠 GPT-4🧠 GPT-5

AIBullisharXiv – CS AI · Jun 26/10

🧠

DynMuon: A Dynamic Spectral Shaping View of Muon

Researchers propose DynMuon, an enhancement to the Muon optimizer used in large language model training that dynamically adjusts spectral shaping parameters throughout training. The method achieves lower validation loss and requires 10.6-26.5% fewer training steps than standard Muon by shifting from positive to mildly negative spectral exponents.

$UV

AIBullisharXiv – CS AI · May 296/10

🧠

Parallax: Parameterized Local Linear Attention for Language Modeling

Researchers introduce Parallax, a scalable Local Linear Attention mechanism that improves upon traditional softmax attention in large language models by learning query-like projectors to probe key-value covariance. Pretraining experiments at 0.6B and 1.7B parameters demonstrate consistent perplexity improvements and downstream benchmark gains, with performance matching or exceeding FlashAttention while revealing novel architecture-optimizer codesign benefits with the Muon optimizer.

🏢 Perplexity

AIBullisharXiv – CS AI · May 286/10

🧠

Entropy-aware Masking for Masked Language Modeling

Researchers propose entropy-aware masking for masked language modeling, which selectively masks tokens based on prediction uncertainty rather than random selection. The approach achieves 5% improvement in GLUE scores and performs best when combined with knowledge distillation, offering a more efficient pretraining strategy for encoder-based language models.

AINeutralarXiv – CS AI · May 286/10

🧠

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge introduces a training-free method to accelerate Vision Transformers by improving token merging through salience-aware mechanisms and adaptive layer-wise compression. The approach outperforms existing token reduction methods across all computational efficiency benchmarks, maintaining superior accuracy-to-FLOPs ratios on ImageNet-1k evaluations.

AIBullisharXiv – CS AI · May 276/10

🧠

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

Researchers present a new quantization method for large video diffusion models that achieves 59.3% memory reduction while maintaining near-baseline quality. The technique addresses challenges in compressing Wan2.2-I2V's mixture-of-experts architecture by using timestep-aware and expert-specific calibration strategies.

AINeutralarXiv – CS AI · Apr 156/10

🧠

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Researchers introduce MODIX, a training-free framework that dynamically optimizes how Vision-Language Models allocate attention across multimodal inputs by adjusting positional encoding based on information density rather than uniform token assignment. The approach improves reasoning performance without modifying model parameters, suggesting positional encoding should be treated as an adaptive resource in multimodal transformer architectures.

AIBullisharXiv – CS AI · Apr 136/10

🧠

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

Researchers introduce WAND, a framework that reduces computational and memory costs of autoregressive text-to-speech models by replacing full self-attention with windowed attention combined with knowledge distillation. The approach achieves up to 66.2% KV cache memory reduction while maintaining speech quality, addressing a critical scalability bottleneck in modern AR-TTS systems.

Page 1 of 2Next →