#token-compression News & Analysis

14 articles tagged with #token-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBullisharXiv – CS AI · 2d ago7/10

🧠

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

FastSLM introduces a Hierarchical Temporal Abstractor (HTA) that compresses long-form speech into just 1.67 tokens per second—a 97% reduction—while maintaining competitive performance on speech understanding benchmarks. This architecture solves a critical scaling bottleneck for multimodal AI models by preserving acoustic detail despite extreme compression, enabling efficient deployment of speech-capable language models.

AIBullisharXiv – CS AI · 6d ago7/10

🧠

Accelerating Constrained Decoding with Token Space Compression

Researchers introduce CFGzip, a token space compression technique that dramatically accelerates constrained decoding for large language models using context-free grammars. The method achieves up to 100x latency reduction and 7.5x total speedup, making complex grammar-constrained generation feasible at scale.

AIBullisharXiv – CS AI · 6d ago7/10

🧠

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Researchers introduce PARCEL, a new vision-language model architecture that reduces computational overhead during inference by dynamically balancing spatial pooling and query-based token compression. The approach outperforms existing methods across 27 benchmarks while maintaining flexibility to deploy at multiple computational budgets without retraining.

AIBullisharXiv – CS AI · May 277/10

🧠

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

Researchers introduce Self-Signals Driven Multi-LLM Debate (SID), a method that leverages internal model signals like token logits and attention mechanisms to improve multi-agent LLM reasoning while reducing computational overhead. The approach enables high-confidence models to exit early and compresses redundant debate content, achieving better accuracy with lower token consumption than existing multi-LLM debate techniques.

AIBullisharXiv – CS AI · May 77/10

🧠

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

TSCG is a deterministic compiler that converts JSON tool schemas into structured text optimized for language model interpretation, solving a critical failure point in agentic AI systems. The technology restores accuracy in smaller models (4B-14B) from near-zero to 84%+ on production-scale tool catalogs while reducing token consumption by 52-57%, shipping as a lightweight TypeScript package.

🏢 OpenAI🏢 Anthropic🧠 GPT-5

AIBullisharXiv – CS AI · Apr 77/10

🧠

LightThinker++: From Reasoning Compression to Memory Management

Researchers developed LightThinker++, a new framework that enables large language models to compress intermediate reasoning thoughts and manage memory more efficiently. The system reduces peak token usage by up to 70% while improving accuracy by 2.42% and maintaining performance over extended reasoning tasks.

AIBullisharXiv – CS AI · May 116/10

🧠

TTF: Temporal Token Fusion for Efficient Video-Language Model

Researchers introduce Temporal Token Fusion (TTF), a training-free compression technique that reduces visual tokens in video-language models by 67% while maintaining 99.5% accuracy. The method addresses the critical bottleneck of LLM prefill costs in video understanding by identifying and fusing redundant tokens across video frames using local similarity matching.

AIBullisharXiv – CS AI · Apr 66/10

🧠

Token-Efficient Multimodal Reasoning via Image Prompt Packaging

Researchers introduce Image Prompt Packaging (IPPg), a technique that embeds text directly into images to reduce multimodal AI inference costs by 35.8-91.0% while maintaining competitive accuracy. The method shows significant promise for cost optimization in large multimodal language models, though effectiveness varies by model and task type.

🧠 GPT-4🧠 Claude

AIBullisharXiv – CS AI · Mar 276/10

🧠

Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

Photon is a new framework that efficiently processes 3D medical imaging for AI visual question answering by using variable-length token sequences and adaptive compression. The system reduces computational costs while maintaining accuracy through instruction-conditioned token scheduling and custom gradient propagation techniques.

AIBullisharXiv – CS AI · Mar 166/10

🧠

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Researchers developed a structured distillation method that compresses AI agent conversation history by 11x (from 371 to 38 tokens per exchange) while maintaining 96% of retrieval quality. The technique enables thousands of exchanges to fit within a single prompt at 1/11th the context cost, addressing the expensive verbatim storage problem for long AI conversations.

AIBullisharXiv – CS AI · Mar 166/10

🧠

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Researchers introduce Cheers, a unified multimodal AI model that combines visual comprehension and generation by decoupling patch details from semantic representations. The model achieves 4x token compression and outperforms existing models like Tar-1.5B while using only 20% of the training cost.

AIBullisharXiv – CS AI · Mar 36/107

🧠

TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning

Researchers propose TC-SSA, a token compression framework that enables large vision-language models to process gigapixel pathology images by reducing visual tokens to 1.7% of original size while maintaining diagnostic accuracy. The method achieves 78.34% overall accuracy on SlideBench and demonstrates strong performance across multiple cancer classification tasks.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Researchers developed CaCoVID, a reinforcement learning-based algorithm that compresses video tokens for large language models by selecting tokens based on their actual contribution to correct predictions rather than attention scores. The method uses combinatorial policy optimization to reduce computational overhead while maintaining video understanding performance.

AINeutralarXiv – CS AI · Mar 34/104

🧠

EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection

Researchers introduce EfficientPosterGen, an AI framework that automatically converts research papers into academic posters using semantic-aware retrieval and token compression techniques. The system addresses key limitations of existing multimodal language models by reducing token consumption while maintaining high-quality poster generation through innovative visual-based context compression and deterministic layout violation detection.