#llm-efficiency News & Analysis

56 articles tagged with #llm-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

56 articles

AIBullisharXiv – CS AI · May 97/10

🧠

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

Researchers introduce Post-Reasoning, a technique that improves LLM performance by having models justify answers after generating final responses, without increasing latency or token costs. The method demonstrates 17.37% mean performance improvements across 117 model-benchmark settings and establishes a new efficiency frontier for direct-answer AI capabilities.

AIBullisharXiv – CS AI · May 97/10

🧠

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

ReaComp introduces a method to compile reasoning traces from large language models into reusable symbolic program synthesizers that eliminate runtime LLM calls. The approach achieves 91.3% accuracy on benchmark tasks while reducing token usage by 78%, demonstrating that neuro-symbolic hybrid systems can outperform pure LLM inference on complex program synthesis problems.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Reasoning Graphs: Self-Improving, Deterministic RAG through Evidence-Centric Feedback

Researchers introduce reasoning graphs, a persistent knowledge structure that improves language model reasoning accuracy by storing and reusing chains of thought tied to evidence items. The system achieves 47% error reduction on multi-hop questions and maintains deterministic outputs without model retraining, using only context engineering.

AIBullisharXiv – CS AI · Apr 147/10

🧠

MEMENTO: Teaching LLMs to Manage Their Own Context

Researchers introduce MEMENTO, a method enabling large language models to compress their reasoning into dense summaries (mementos) organized into blocks, reducing KV cache usage by 2.5x and improving throughput by 1.75x while maintaining accuracy. The technique is validated across multiple model families using OpenMementos, a new dataset of 228K annotated reasoning traces.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Introspective Diffusion Language Models

Researchers introduce Introspective Diffusion Language Models (I-DLM), a new approach that combines the parallel generation speed of diffusion models with the quality of autoregressive models by ensuring models verify their own outputs. I-DLM achieves performance matching conventional large language models while delivering 3x higher throughput, potentially reshaping how AI systems are deployed at scale.

AIBearisharXiv – CS AI · Jun 256/10

🧠

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

A research study challenges the widespread practice of using context files (like AGENTS.md) to enhance coding agent performance, finding that these files provide no measurable improvement in task completion rates while increasing inference costs by over 20%. The findings suggest that while context files help agents follow instructions, repository overviews—commonly recommended by model providers—offer minimal practical value.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

Researchers have developed a benchmark for evaluating efficient multimodal language models on pulmonary embolism diagnosis and risk assessment using a dataset of 23,248 CTPA studies. The study demonstrates that compact models like Gemma4 perform significantly better when combining imaging and electronic health record data, with diagnostic tasks outperforming prognostic predictions.

AIBullisharXiv – CS AI · Jun 116/10

🧠

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

APEX introduces a data-efficient framework for automatic prompt optimization in large language models by dynamically categorizing training data into Easy, Hard, and Mixed tiers. The system prioritizes Mixed-tier data to identify high-leverage subsets that improve prompt quality, achieving 11.2% performance gains on Gemini 2.5 Flash with 40% fewer evaluations than static approaches.

🧠 Gemini

AIBullisharXiv – CS AI · Jun 106/10

🧠

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

Researchers propose ADAS, a training-free reranking algorithm that improves parallel token decoding in masked diffusion language models by using attention weights as soft penalties to avoid committing to correlated predictions simultaneously. The method achieves 9-10 percentage point improvements on benchmarks like GSM8K and HumanEval with minimal computational overhead, advancing the efficiency of faster language model inference.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Larch: Learned Query Optimization for Semantic Predicates

Larch is a new optimization framework that improves the efficiency of semantic SQL queries by reducing token usage and computational costs when processing unstructured data with Large Language Models. The framework uses two approaches—reinforcement learning and supervised learning—to optimize the order of filter evaluation, achieving 3x-19x token cost reductions compared to existing solutions.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

Researchers present an adaptive two-phase semantic filtering method that improves LLM-based document classification efficiency by 1.6-2.0x compared to existing approaches. The method combines model-free clustering with online proxy training using soft labels and adaptive calibration, achieving 90% accuracy targets while reducing expensive LLM oracle calls.

AINeutralarXiv – CS AI · Jun 56/10

🧠

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Researchers introduce CoT-Space, a theoretical framework that explains how Large Language Models improve reasoning through multi-step Chain-of-Thought processes via reinforcement learning. The framework models reasoning as an optimization problem in continuous semantic space, demonstrating that optimal reasoning length emerges naturally from the underfitting-overfitting trade-off, providing a principled foundation for understanding test-time scaling in modern LLMs.

AINeutralHugging Face Blog · Jun 46/10

🧠

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

NVIDIA researchers introduced a task-seeded synthetic Q&A generation method to improve pretraining of the Nemotron language model, demonstrating enhanced performance on downstream tasks through strategically generated training data. This approach addresses a key challenge in LLM development by optimizing synthetic data quality and relevance during the pretraining phase.

AIBullisharXiv – CS AI · Jun 26/10

🧠

FLARE: Diffusion for Hybrid Language Model

Researchers introduce FLARE, a conversion framework that enables large language models with hybrid attention mechanisms to function as both autoregressive and diffusion models, addressing a key limitation in parallel decoding while maintaining model capability. The approach demonstrates competitive performance with existing diffusion language models while delivering throughput gains in concurrent serving scenarios.

AIBullisharXiv – CS AI · Jun 26/10

🧠

SimSD: Simple Speculative Decoding in Diffusion Language Models

Researchers propose SimSD, a novel speculative decoding algorithm that enables diffusion language models to achieve up to 7.46x faster inference speeds while maintaining generation quality. By introducing a plug-and-play masking strategy, SimSD addresses the fundamental incompatibility between diffusion models' bidirectional attention and token-level speculative verification, a technique proven effective for autoregressive models.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

Researchers propose a Bayesian stopping strategy that reduces LLM inference costs by up to 50% while maintaining answer accuracy. The method samples multiple LLM responses and stops once sufficient consistency is detected, using an efficient L-aggregated policy that tracks only the top 3 answer frequencies and achieves theoretical optimality.

AINeutralarXiv – CS AI · May 296/10

🧠

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

Researchers propose a neuro-symbolic framework for constructing knowledge graphs that combines LLM-based extraction with post-hoc ontology constraint validation, reducing token costs while improving consistency for complex question-answering tasks. The method defers corrections to after extraction rather than during it, enabling SQL-like querying capabilities for multi-hop reasoning across documents.

AINeutralarXiv – CS AI · May 296/10

🧠

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Researchers benchmark token-optimized data formats (TRON and TOON) against JSON in agentic AI systems, finding TRON reduces token consumption by up to 27% with acceptable accuracy trade-offs. The study reveals that while these alternatives show promise in isolated tasks, their real-world performance in multi-turn agent loops exposes limitations, particularly with TOON's parsing cascades and parallel tool-call handling.

AIBullisharXiv – CS AI · May 296/10

🧠

Parallax: Parameterized Local Linear Attention for Language Modeling

Researchers introduce Parallax, a scalable Local Linear Attention mechanism that improves upon traditional softmax attention in large language models by learning query-like projectors to probe key-value covariance. Pretraining experiments at 0.6B and 1.7B parameters demonstrate consistent perplexity improvements and downstream benchmark gains, with performance matching or exceeding FlashAttention while revealing novel architecture-optimizer codesign benefits with the Muon optimizer.

🏢 Perplexity

AINeutralarXiv – CS AI · May 296/10

🧠

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

Researchers propose replacing LLM-based triggers in proactive agent systems with a lightweight temporal graph learning (TGL) model that processes structured event streams directly. The approach achieves 16.7% mean F1 improvement while running 4-7x faster on GPUs and 12-83x faster on consumer hardware, with a 220 MiB footprint suitable for on-device deployment.

AINeutralarXiv – CS AI · May 286/10

🧠

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Researchers introduced HRBench, a unified evaluation framework for testing hybrid-reasoning LLMs that allow dynamic switching between fast and slow reasoning modes. The framework systematically compares 12+ prior methods across three switching strategy families and four training approaches, revealing that prompt-based methods offer better token-accuracy trade-offs while routing methods provide more stable cost reduction.

AIBullisharXiv – CS AI · May 286/10

🧠

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

Researchers introduce FPMoE, a sparse Mixture-of-Experts model optimized for functional programming languages like Haskell, OCaml, and Scala, addressing a significant gap in LLM-based code generation. With only 3B active parameters, the model matches the performance of much larger models while using a novel architecture combining language-specific experts with a shared expert for cross-language functional patterns.

AIBullisharXiv – CS AI · May 286/10

🧠

Learning the Error Patterns of Language Models

Researchers propose Palla, an algorithm that learns symbolic constraint functions called prefix filters to capture and correct systematic error patterns in large language models. By analyzing domain-specific failures (e.g., using Python syntax in TypeScript code), Palla enables constrained sampling to significantly improve compilation rates and output validity without retraining models.

🧠 Llama

AINeutralarXiv – CS AI · May 276/10

🧠

Tracing Computation Density in LLMs

Researchers introduce the s-Trace method to analyze how transformer-based LLMs utilize their computational capacity, revealing that model computation organizes into two distinct phases: a sparse early-layer core providing rough predictions, refined through denser later-layer computations. The findings suggest LLMs operate with modular efficiency rather than fully exploiting their parameter capacity across all inputs.

AIBullisharXiv – CS AI · May 126/10

🧠

LEVI: Stronger Search Architectures Can Substitute for Larger LLMs in Evolutionary Search

Researchers introduce LEVI, an open-source evolutionary search framework that achieves superior results on AI research benchmarks while reducing computational costs by 3.3x to 35x compared to existing methods. By optimizing search architecture rather than relying on larger language models, LEVI demonstrates that algorithmic efficiency can significantly reduce the expense of LLM-guided evolutionary discovery.

← PrevPage 2 of 3Next →