#llm-optimization News & Analysis

239 articles tagged with #llm-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

239 articles

AI × CryptoBullishCrypto Briefing · May 287/10

🤖

AutoTTS reduces token usage by 69.5% in LLM reasoning strategies

AutoTTS has achieved a 69.5% reduction in token usage for large language model reasoning tasks, potentially lowering operational costs for AI systems. This efficiency gain has significant implications for crypto infrastructure and AI-driven sectors that rely on LLM inference, making computational resources more economical.

AIBullisharXiv – CS AI · May 287/10

🧠

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

GoQuant introduces Orthogonal Residual Projection (ORP), a quantization framework that enables efficient deployment of large language models on edge devices by replacing multiplication operations with bit-shifts. The approach achieves competitive performance at 3-bit precision while reducing calibration time to 15 minutes, addressing fundamental geometric limitations in power-of-two quantization.

🏢 Perplexity

AIBullisharXiv – CS AI · May 287/10

🧠

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

FD-RAG introduces a federated framework for retrieval-augmented generation that enables decentralized LLM deployment across edge devices without centralizing sensitive data. The system achieves 7.8% accuracy improvements and 8.4x latency reductions by splitting lightweight memory access from expensive LLM reasoning, while aggregating anonymized knowledge across fragmented device networks.

AIBullisharXiv – CS AI · May 287/10

🧠

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

Researchers introduce VULPO, an on-policy LLM optimization framework for vulnerability detection that achieves 203% improvement over baseline models by incorporating context-aware reasoning and multidimensional reward signals. The approach combines a new ContextVul dataset with specialized fine-tuning to create more effective security analysis tools that reason through complex code interactions.

AIBullisharXiv – CS AI · May 287/10

🧠

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

Researchers propose HiSME, a hierarchical skill meta-evolving framework that enables AI agents to continuously improve both their skills and the strategies used to evolve those skills at test-time, without expensive model parameter updates. The approach learns meta-skills from task execution traces and demonstrates higher-quality skill libraries compared to static skill evolving approaches.

AIBullisharXiv – CS AI · May 287/10

🧠

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Researchers introduce ZipRL, an adaptive context compression framework that uses reinforcement learning to efficiently reduce token usage in multi-turn LLM agent tasks while preserving task-critical information. The method incorporates Hindsight Response Replay to address sparse reward problems and demonstrates 27-35% performance improvements over existing approaches on benchmark tasks.

AIBullisharXiv – CS AI · May 277/10

🧠

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

Researchers develop a systematic approach to quantization-aware training for large language models using 8-bit floating-point formats, identifying and solving two critical failure modes—amax saturation and catastrophic forgetting—that don't surface in standard training metrics. Their solution achieves near-lossless performance with only 0.43% degradation on benchmark tasks, advancing practical LLM deployment efficiency.

AIBullisharXiv – CS AI · May 277/10

🧠

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

PANDO introduces an efficient multimodal AI agent framework that improves performance while reducing computational costs through online skill distillation, achieving 58.3% success on VisualWebArena tasks with 58-61% fewer tokens than competing approaches. The system addresses inefficiencies in web agent design by maintaining a skill library and employing hierarchical routing, visual compression, and cache-aware prompting without requiring expensive pre-evaluation.