y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#token-reduction News & Analysis

16 articles tagged with #token-reduction. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

16 articles
AIBullisharXiv – CS AI · Jun 27/10
🧠

Latent Collaboration in Multi-Agent Systems

Researchers introduce LatentMAS, a framework enabling LLM agents to collaborate directly in latent space rather than through text, achieving up to 14.6% higher accuracy while reducing token usage by 70.8%-83.7% and improving inference speed 4× faster than text-based multi-agent systems.

AIBullisharXiv – CS AI · May 287/10
🧠

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Researchers introduce CIVIC, a framework that optimizes Vision-Language Models by maintaining compact visual token sequences throughout the entire inference pipeline, reducing KV-cache memory to one-third while achieving measurable hardware acceleration without accuracy loss.

AIBullisharXiv – CS AI · May 127/10
🧠

Reasoning Compression with Mixed-Policy Distillation

Researchers introduce Mixed-Policy Distillation (MPD), a technique that compresses reasoning in smaller language models by having larger teacher models rewrite student-generated reasoning traces into more concise versions. The method reduces token usage by up to 27.1% while maintaining or improving performance, addressing critical deployment constraints around memory, latency, and serving costs.

AIBullisharXiv – CS AI · May 97/10
🧠

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

ReaComp introduces a method to compile reasoning traces from large language models into reusable symbolic program synthesizers that eliminate runtime LLM calls. The approach achieves 91.3% accuracy on benchmark tasks while reducing token usage by 78%, demonstrating that neuro-symbolic hybrid systems can outperform pure LLM inference on complex program synthesis problems.

AIBullisharXiv – CS AI · Apr 67/10
🧠

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Researchers discovered that in Large Reasoning Models like DeepSeek-R1, the first solution is often the best, with alternative solutions being detrimental due to error accumulation. They propose RED, a new framework that achieves up to 19% performance gains while reducing token consumption by 37.7-70.4%.

AIBullisharXiv – CS AI · Mar 177/10
🧠

D-MEM: Dopamine-Gated Agentic Memory via Reward Prediction Error Routing

Researchers introduce D-MEM, a biologically-inspired memory architecture for AI agents that uses dopamine-like reward prediction error routing to dramatically reduce computational costs. The system reduces token consumption by over 80% and eliminates quadratic scaling bottlenecks by selectively processing only high-importance information through cognitive restructuring.

AIBullisharXiv – CS AI · Mar 37/104
🧠

LightMem: Lightweight and Efficient Memory-Augmented Generation

Researchers introduce LightMem, a new memory system for Large Language Models that mimics human memory structure with three stages: sensory, short-term, and long-term memory. The system achieves up to 7.7% better QA accuracy while reducing token usage by up to 106x and API calls by up to 159x compared to existing methods.

AIBullisharXiv – CS AI · Feb 277/107
🧠

Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents

Researchers introduce Contextual Memory Virtualisation (CMV), a system that preserves LLM understanding across extended sessions by treating context as version-controlled state using DAG-based management. The system includes a trimming algorithm that reduces token counts by 20-86% while preserving all user interactions, demonstrating particular efficiency in tool-use sessions.

AINeutralarXiv – CS AI · Jun 56/10
🧠

Differentiable Efficient Operator Search

Researchers propose Efficient Operator Search, a differentiable framework that automates the design of token-reduction operators for multimodal foundation models. The approach unifies previously distinct manual techniques like pruning and merging into a shared search space, discovering hybrid operators that achieve better accuracy-efficiency trade-offs than hand-designed baselines.

AIBullisharXiv – CS AI · Jun 46/10
🧠

Can Reasoning Path still be Effective as Input? Bridging Post-Reasoning to Chain-of-Thought Compression

Researchers propose Upfront CoT (UCoT), a framework that compresses Chain-of-Thought reasoning in large language models by using a lightweight compressor to generate soft token representations of reasoning paths. The method maintains reasoning performance while reducing token usage by 50% on benchmarks, addressing the efficiency-performance tradeoff in advanced LLM inference.

AINeutralarXiv – CS AI · Jun 26/10
🧠

SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval

SkillPager is a novel retrieval framework that optimizes how large language model agents access long procedural documents by selecting minimal, execution-sufficient context from skill documents. The system achieves 78.89% sufficiency while reducing prompt tokens by 47.04% compared to full-document prompting, demonstrating that typed semantic granularity significantly improves efficiency in skill-based LLM agent systems.

AINeutralarXiv – CS AI · Jun 26/10
🧠

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Researchers discover fundamental limits in using token reduction techniques to accelerate unified vision-language model training, finding that visual understanding and generation have conflicting computational requirements. While task-specific optimization achieves efficiency gains individually, joint training creates synergy loss, suggesting that efficient unified VLM development requires new approaches that preserve cross-task parameter sharing.

AIBullisharXiv – CS AI · Jun 26/10
🧠

Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

Researchers propose DySCo, a dynamic sparse communication mechanism for LLM-based multi-agent systems that reduces computational overhead by selectively routing messages between agents rather than using full broadcast. The approach maintains consensus quality while cutting token costs and latency that scale quadratically with agent count, addressing a key efficiency bottleneck in collaborative AI reasoning systems.

AIBullisharXiv – CS AI · Mar 36/106
🧠

Stateful Token Reduction for Long-Video Hybrid VLMs

Researchers developed a new token reduction method for hybrid vision-language models that process long videos, achieving 3.8-4.2x speedup while retaining only 25% of visual tokens. The approach uses progressive reduction and unified scoring for both attention and Mamba blocks, maintaining near-baseline accuracy on long-context video benchmarks.

$NEAR