#inference-optimization News & Analysis

319 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

319 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio

Researchers propose CKA-QAD, a new method for quantizing large language models to NVFP4 precision that preserves internal representational geometry rather than just matching output distributions. The approach addresses a critical limitation in existing quantization-aware distillation techniques, showing significant improvements in reasoning and coding task performance across multiple model architectures.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

Researchers introduce Dynamic Thinking-Token Selection (DynTS), a method that optimizes Large Reasoning Models by identifying and retaining only decision-critical tokens during inference while discarding redundant reasoning trace data. This approach significantly reduces memory footprint and computational overhead, addressing a major efficiency bottleneck in LRMs that generate extended reasoning sequences.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Researchers introduce Active Video Perception (AVP), an AI framework that enables agents to actively seek relevant evidence in long videos rather than passively processing entire content. The system uses an iterative plan-observe-reflect process to achieve superior accuracy on five benchmarks while reducing inference time by 82% and token usage by 88% compared to existing agentic methods.

AIBullisharXiv – CS AI · Jun 57/10

🧠

AdaMEM: Test-Time Adaptive Memory for Language Agents

Researchers introduce AdaMEM, a test-time adaptive memory framework that enables language agents to dynamically adjust behavior during inference without updating model parameters. The system combines persistent offline trajectory memory with dynamically generated on-the-fly strategy memory, demonstrating 11-13% performance improvements on complex reasoning and web interaction tasks.

AIBullisharXiv – CS AI · Jun 57/10

🧠

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

Researchers introduce ReTreVal, a training-free framework that enables large language models to learn from failures across multiple problems without fine-tuning. By implementing adaptive tree exploration, typed-failure backtracking, and cross-problem memory, ReTreVal achieves significant performance improvements on mathematical and knowledge reasoning tasks, allowing a 32B model to match much larger systems.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Researchers present Recover-LoRA, a technique that recovers accuracy in large language models aggressively quantized to 2-bit precision by applying low-rank adapters trained on synthetic data. The method achieves 7.5-23.3% throughput improvements while recovering 80-95% of lost accuracy on most benchmarks, enabling practical deployment of compressed models on edge devices.

AIBullisharXiv – CS AI · Jun 47/10

🧠

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

MIRAGE is a new AI framework that enables mobile agents to reason internally using compressed latent representations instead of generating verbose reasoning chains. By aligning hidden states with future interface screenshots, the system achieves comparable performance to explicit chain-of-thought approaches while reducing token generation by 3-5x, offering significant efficiency gains for AI-powered mobile automation.

AIBearisharXiv – CS AI · Jun 47/10

🧠

Widening the Gap: Exploiting LLM Quantization via Outlier Injection

Researchers demonstrate the first practical quantization-conditioned attack that reliably compromises large language models across advanced quantization methods including AWQ, GPTQ, and GGUF. The attack exploits how outlier weights cause rounding errors in modern quantization schemes, allowing adversaries to inject hidden malicious behaviors that activate only after quantization, posing significant security risks to the deployment pipeline.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Researchers introduce Speculative Thinking, a training-free framework that leverages larger AI models to guide smaller ones during inference, improving reasoning accuracy while reducing output length. The method achieves a 6.2% accuracy boost on mathematical reasoning tasks for a 1.5B parameter model with 15.7% shorter outputs, demonstrating efficiency gains without costly retraining.