#token-reduction News & Analysis

9 articles tagged with #token-reduction. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBullisharXiv – CS AI · May 127/10

🧠

Reasoning Compression with Mixed-Policy Distillation

Researchers introduce Mixed-Policy Distillation (MPD), a technique that compresses reasoning in smaller language models by having larger teacher models rewrite student-generated reasoning traces into more concise versions. The method reduces token usage by up to 27.1% while maintaining or improving performance, addressing critical deployment constraints around memory, latency, and serving costs.

AIBullisharXiv – CS AI · May 97/10

🧠

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

ReaComp introduces a method to compile reasoning traces from large language models into reusable symbolic program synthesizers that eliminate runtime LLM calls. The approach achieves 91.3% accuracy on benchmark tasks while reducing token usage by 78%, demonstrating that neuro-symbolic hybrid systems can outperform pure LLM inference on complex program synthesis problems.

AIBullisharXiv – CS AI · Apr 67/10

🧠

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Researchers discovered that in Large Reasoning Models like DeepSeek-R1, the first solution is often the best, with alternative solutions being detrimental due to error accumulation. They propose RED, a new framework that achieves up to 19% performance gains while reducing token consumption by 37.7-70.4%.

AIBullisharXiv – CS AI · Mar 177/10

🧠

D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Researchers introduce D-MEM, a biologically-inspired memory architecture for AI agents that uses dopamine-like reward prediction error routing to dramatically reduce computational costs. The system reduces token consumption by over 80% and eliminates quadratic scaling bottlenecks by selectively processing only high-importance information through cognitive restructuring.

AIBullisharXiv – CS AI · Mar 177/10

🧠

SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

Researchers developed SToRM, a new framework that reduces computational costs for autonomous driving systems using multi-modal large language models by up to 30x while maintaining performance. The system uses supervised token reduction techniques to enable real-time end-to-end driving on standard GPUs without sacrificing safety or accuracy.

AIBullisharXiv – CS AI · Mar 37/104

🧠

LightMem: Lightweight and Efficient Memory-Augmented Generation

Researchers introduce LightMem, a new memory system for Large Language Models that mimics human memory structure with three stages: sensory, short-term, and long-term memory. The system achieves up to 7.7% better QA accuracy while reducing token usage by up to 106x and API calls by up to 159x compared to existing methods.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents

Researchers introduce Contextual Memory Virtualisation (CMV), a system that preserves LLM understanding across extended sessions by treating context as version-controlled state using DAG-based management. The system includes a trimming algorithm that reduces token counts by 20-86% while preserving all user interactions, demonstrating particular efficiency in tool-use sessions.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Stop Before You Fail: Operational Capability Boundaries for Mitigating Unproductive Reasoning in Large Reasoning Models

Researchers developed monitoring strategies to detect when Large Reasoning Models are engaging in unproductive reasoning by identifying early failure signals. The new techniques reduce token usage by 62.7-93.6% while maintaining accuracy, significantly improving AI model efficiency.

AIBullisharXiv – CS AI · Mar 36/106

🧠

Stateful Token Reduction for Long-Video Hybrid VLMs

Researchers developed a new token reduction method for hybrid vision-language models that process long videos, achieving 3.8-4.2x speedup while retaining only 25% of visual tokens. The approach uses progressive reduction and unified scoring for both attention and Mamba blocks, maintaining near-baseline accuracy on long-context video benchmarks.

$NEAR