y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-optimization News & Analysis

139 articles tagged with #llm-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

139 articles
AIBullishCrypto Briefing · 2d ago7/10
🧠

MIT’s MeMo boosts LLM performance by 26% without retraining

MIT researchers have developed MeMo, a technique that improves large language model performance by 26% without requiring model retraining. This approach reduces computational costs and enables efficient adaptation across multiple domains, addressing a major pain point in AI deployment.

MIT’s MeMo boosts LLM performance by 26% without retraining
AIBullisharXiv – CS AI · 2d ago7/10
🧠

Pushing the Limits of Block Rotations in Post-Training Quantization

Researchers present PeRQ, a post-training quantization method that uses permutations to optimize block rotations for neural network compression. The approach recovers up to 90% of full-vector rotation performance when quantizing large language models to INT4, significantly outperforming existing block rotation methods.

🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · 2d ago7/10
🧠

Less Is More: Elevating RAG via Performance-Driven Context Compression

Researchers introduce CORE-RAG, a novel framework that compresses context in Retrieval-Augmented Generation systems using performance-driven learning rather than predefined heuristics. The approach achieves a 97% compression ratio while improving accuracy by 3.3 points on exact match scores, addressing a critical bottleneck in LLM efficiency.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

Researchers propose Feature Activation Coverage (FAC), a new metric for measuring data diversity in large language models using sparse autoencoders instead of traditional text-based metrics. The FAC Synthesis framework generates synthetic training data to fill feature gaps, demonstrating consistent improvements across multiple tasks and revealing transferable feature spaces across different model families.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Researchers introduce BitTP, a quantization technique that compresses LLM-based trajectory prediction models to 1.58-bit weights while maintaining full-precision activations, enabling deployment on resource-constrained edge devices. The approach not only reduces memory and latency but actually improves prediction accuracy by 14-21% compared to full-precision baselines, demonstrating that strategic quantization can serve as an effective regularizer.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Accelerating Constrained Decoding with Token Space Compression

Researchers introduce CFGzip, a token space compression technique that dramatically accelerates constrained decoding for large language models using context-free grammars. The method achieves up to 100x latency reduction and 7.5x total speedup, making complex grammar-constrained generation feasible at scale.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Researchers introduce Moment-KV, a momentum-based compression technique that optimizes Key-Value cache usage during LLM decoding phases. The method improves long-generation task performance by 2.3-3.2% while maintaining latency by dynamically tracking token importance through temporal attention patterns rather than static heuristics.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

GRPO is Secretly a Process Reward Model

Researchers demonstrate that Group Relative Policy Optimization (GRPO), a popular reinforcement learning algorithm using outcome rewards, mathematically functions as an implicit process reward model. The discovery enables algorithmic improvements (λ-GRPO) that enhance large language model performance on reasoning tasks without explicit process reward implementation or significant computational overhead.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

PassNet introduces the first large-scale ecosystem for using large language models to generate compiler passes—structured graph transformations that optimize tensor compiler performance. The framework includes 18K computational graphs and 200 curated benchmark tasks, revealing that while LLMs lag frontier models by 37% on average, they achieve up to 3x speedups on individual workloads, indicating consistency rather than capability is the limiting factor.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

Researchers introduce Logit-aware Final-block Quantization (LFQ), a technique that improves low-bit quantization of large language models by optimizing the final transformer block to preserve token probability distributions. This advancement addresses quality degradation in generative tasks while maintaining efficiency gains critical for deploying scaled LLMs.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

Researchers used large language models and evolutionary search to create the first domain-independent heuristics for symbolic AI planning that surpass hand-engineered baselines. These evolved heuristics, written in C++, solve more planning tasks than existing state-of-the-art approaches and maintain the soundness guarantees of traditional planners.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

ParaTool: Shifting Tool Representations from Context to Parameters

ParaTool is a new framework that shifts tool representations from context to parameters in large language models, enabling efficient tool calling without relying on lengthy in-context documentation. The approach uses parametric tool pre-training, soft tool selection, and fine-tuning to reduce inference overhead and hallucination risks while maintaining superior performance on benchmark tests.

AI × CryptoBullishCrypto Briefing · 3d ago7/10
🤖

AutoTTS reduces token usage by 69.5% in LLM reasoning strategies

AutoTTS has achieved a 69.5% reduction in token usage for large language model reasoning tasks, potentially lowering operational costs for AI systems. This efficiency gain has significant implications for crypto infrastructure and AI-driven sectors that rely on LLM inference, making computational resources more economical.

AutoTTS reduces token usage by 69.5% in LLM reasoning strategies
AIBullisharXiv – CS AI · 3d ago7/10
🧠

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

Researchers introduce VULPO, an on-policy LLM optimization framework for vulnerability detection that achieves 203% improvement over baseline models by incorporating context-aware reasoning and multidimensional reward signals. The approach combines a new ContextVul dataset with specialized fine-tuning to create more effective security analysis tools that reason through complex code interactions.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

GoQuant introduces Orthogonal Residual Projection (ORP), a quantization framework that enables efficient deployment of large language models on edge devices by replacing multiplication operations with bit-shifts. The approach achieves competitive performance at 3-bit precision while reducing calibration time to 15 minutes, addressing fundamental geometric limitations in power-of-two quantization.

🏢 Perplexity
AIBullisharXiv – CS AI · 3d ago7/10
🧠

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Researchers introduce ZipRL, an adaptive context compression framework that uses reinforcement learning to efficiently reduce token usage in multi-turn LLM agent tasks while preserving task-critical information. The method incorporates Hindsight Response Replay to address sparse reward problems and demonstrates 27-35% performance improvements over existing approaches on benchmark tasks.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

FD-RAG introduces a federated framework for retrieval-augmented generation that enables decentralized LLM deployment across edge devices without centralizing sensitive data. The system achieves 7.8% accuracy improvements and 8.4x latency reductions by splitting lightweight memory access from expensive LLM reasoning, while aggregating anonymized knowledge across fragmented device networks.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

Researchers propose HiSME, a hierarchical skill meta-evolving framework that enables AI agents to continuously improve both their skills and the strategies used to evolve those skills at test-time, without expensive model parameter updates. The approach learns meta-skills from task execution traces and demonstrates higher-quality skill libraries compared to static skill evolving approaches.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

Researchers develop a systematic approach to quantization-aware training for large language models using 8-bit floating-point formats, identifying and solving two critical failure modes—amax saturation and catastrophic forgetting—that don't surface in standard training metrics. Their solution achieves near-lossless performance with only 0.43% degradation on benchmark tasks, advancing practical LLM deployment efficiency.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

PANDO introduces an efficient multimodal AI agent framework that improves performance while reducing computational costs through online skill distillation, achieving 58.3% success on VisualWebArena tasks with 58-61% fewer tokens than competing approaches. The system addresses inefficiencies in web agent design by maintaining a skill library and employing hierarchical routing, visual compression, and cache-aware prompting without requiring expensive pre-evaluation.

AIBullisharXiv – CS AI · May 127/10
🧠

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

Researchers propose RDKV, a novel compression technique that jointly optimizes eviction and quantization of the Key-Value cache in large language models to reduce memory bottlenecks during inference. The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K context lengths while maintaining 97.81% accuracy, addressing a critical performance constraint in LLM deployment.

AIBullisharXiv – CS AI · May 127/10
🧠

Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection

Echo-LoRA introduces a parameter-efficient fine-tuning method that injects cross-layer representations from deeper neural network layers into shallow LoRA modules during training, achieving 3-5.7% performance improvements on reasoning tasks without adding inference costs. The technique discards its auxiliary training path post-deployment, maintaining the efficiency benefits of standard LoRA while delivering measurable capability gains.

AIBullisharXiv – CS AI · May 127/10
🧠

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

Researchers propose LEAD, a new method that makes large reasoning AI models more efficient by dynamically balancing accuracy and output length during training. Unlike existing approaches using static constraints, LEAD adapts per-problem length targets and reward calibration in real-time, achieving better accuracy and shorter outputs across mathematical reasoning benchmarks.

🏢 OpenAI🧠 o1
AIBullisharXiv – CS AI · May 127/10
🧠

RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step

Researchers introduce RuPLaR, a novel compression framework that enables Large Language Models to generate latent reasoning tokens in a single training stage, eliminating inefficiencies of traditional multi-step Chain-of-Thought approaches. The method achieves 11.1% accuracy improvement over existing latent CoT systems while using minimal tokens, demonstrating significant progress in efficient LLM reasoning.

AIBullisharXiv – CS AI · May 127/10
🧠

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

Researchers introduce BubbleSpec, a framework that optimizes Reinforcement Learning training for Large Language Models by exploiting idle GPU time during synchronous rollouts. The method uses speculative decoding to pre-generate draft outputs during wait periods, achieving 50% reduction in decoding steps and up to 1.8x throughput improvement while maintaining mathematical exactness.

Page 1 of 6Next →