#test-time-compute News & Analysis

15 articles tagged with #test-time-compute. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

15 articles

AIBullisharXiv – CS AI · Jun 87/10

🧠

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster is a unified framework that standardizes test-time compute scaling for large language models, providing a modular library, benchmarking suite, and production-ready API for improving LLM reasoning efficiency during inference. The framework enables developers to evaluate and deploy adaptive reasoning strategies with transparent performance-compute trade-offs across mathematical and coding tasks.

🏢 OpenAI

AIBullisharXiv – CS AI · Jun 57/10

🧠

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

Researchers propose Agentic Monte Carlo (AMC), a novel method for optimizing black-box LLM agents without API access by using Sequential Monte Carlo sampling to steer agents toward optimal behavior. The technique bridges the gap between reinforcement learning and Bayesian inference, demonstrating competitive performance against RL baselines while maintaining the black-box model architecture.

AIBearisharXiv – CS AI · Jun 37/10

🧠

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

Researchers demonstrate that Large Reasoning Models (LRMs) frequently 'overthink' problems after reaching correct answers, with continued reasoning degrading accuracy by up to 21%. The study introduces a protocol to measure reasoning sufficiency and reveals that harmful overthinking—where additional reasoning destabilizes correct solutions—represents a broader reliability risk affecting both multimodal and language-only models.

AIBullisharXiv – CS AI · May 297/10

🧠

Self-Trained Verification for Training- and Test-Time Self-Improvement

Researchers propose Self-Trained Verification (STV), a novel approach that improves AI reasoning models by training verifiers to catch self-generated errors using reference solutions as supervision. The method doubles accuracy on hard math problems and achieves 14x improvement on scientific reasoning tasks, while also enabling more effective self-training through verifier-in-the-loop training that further boosts performance by 33%.

AIBullisharXiv – CS AI · May 97/10

🧠

ZAYA1-8B Technical Report

Zyphra has unveiled ZAYA1-8B, a compact reasoning-focused AI model with only 700M active parameters that matches larger competitors like DeepSeek-R1 on mathematics and coding tasks. The model introduces Markovian RSA, a novel test-time compute method that achieves 91.9% on AIME'25 benchmarks while maintaining computational efficiency, suggesting small models can compete with much larger reasoning systems through architectural innovation.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · Apr 147/10

🧠

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Researchers challenge the assumption that longer reasoning chains always improve LLM performance, discovering that extended test-time compute leads to diminishing returns and 'overthinking' where models abandon correct answers. The study demonstrates that optimal compute allocation varies by problem difficulty, enabling significant efficiency gains without sacrificing accuracy.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.

AIBullisharXiv – CS AI · Mar 47/104

🧠

Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute

Researchers propose 'best-of-∞' approach for large language models that uses majority voting with infinite samples, achieving superior performance but requiring infinite computation. They develop an adaptive generation scheme that dynamically selects the optimal number of samples based on answer agreement and extend the framework to weighted ensembles of multiple LLMs.

AIBullisharXiv – CS AI · Jun 116/10

🧠

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

Researchers introduce DIRECT, a routing framework that intelligently allocates computational resources at test-time for Vision-Language Models used in embodied AI planning. The system selectively chooses when to deploy expensive scaling strategies (deeper reasoning chains, larger models, expanded memory), achieving up to 65% lower latency than baseline approaches while maintaining or exceeding performance on robotic manipulation tasks.

AINeutralarXiv – CS AI · Jun 46/10

🧠

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Researchers introduce MesaNet, an improved recurrent neural network architecture that optimizes sequence modeling through test-time training, achieving better language modeling performance than previous RNNs while requiring additional inference-time compute. The work advances the trend toward linearized transformers that maintain constant memory costs during inference, positioning computational efficiency against performance gains.

🏢 Perplexity

AINeutralarXiv – CS AI · May 126/10

🧠

Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching

Sketch-and-Verify is an inference-time scaling technique that improves small language model performance by having the LLM generate multiple algorithmic strategies as program sketches, then filling and verifying them. On HumanEval+, this approach delivers superior cost-performance within a model tier compared to flat sampling, though upgrading to a stronger model tier remains more effective than scaling test-time compute on smaller models.

🧠 Gemini

AINeutralarXiv – CS AI · May 116/10

🧠

Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents

Researchers demonstrate that adaptive compute gates for LLM agents produce unstable and reversible signals across different environments and models, where the same confidence metric predicts both beneficial and harmful outcomes. They propose DIAL, a learned gating mechanism trained through counterfactual exploration, which outperforms fixed-direction baselines by accounting for task-specific utility directions.

AINeutralarXiv – CS AI · May 116/10

🧠

Test-Time Compute Games

Researchers identify a market inefficiency in LLM-as-a-service pricing where providers are financially incentivized to increase test-time compute usage beyond what meaningfully improves output quality, inflating costs for users. They propose a reverse second-price auction mechanism where providers compete on both price and quality, with users paying only for marginal value created relative to alternatives.

🧠 Llama

AIBullisharXiv – CS AI · Mar 27/1016

🧠

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

Researchers propose ODAR-Expert, an adaptive routing framework for large language models that optimizes accuracy-efficiency trade-offs by dynamically routing queries between fast and slow processing agents. The system achieved 98.2% accuracy on MATH benchmarks while reducing computational costs by 82%, suggesting that optimal AI scaling requires adaptive resource allocation rather than simply increasing test-time compute.

AIBullishLil'Log (Lilian Weng) · May 16/10

🧠

Why We Think

This article introduces a review of recent developments in test-time compute and Chain-of-thought (CoT) techniques for AI models. The post examines how providing models with 'thinking time' during inference leads to significant performance improvements while raising new research questions.