AIBullisharXiv – CS AI · 2d ago7/10
🧠ThinkBooster is a unified framework that standardizes test-time compute scaling for large language models, providing a modular library, benchmarking suite, and production-ready API for improving LLM reasoning efficiency during inference. The framework enables developers to evaluate and deploy adaptive reasoning strategies with transparent performance-compute trade-offs across mathematical and coding tasks.
🏢 OpenAI
AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers propose Agentic Monte Carlo (AMC), a novel method for optimizing black-box LLM agents without API access by using Sequential Monte Carlo sampling to steer agents toward optimal behavior. The technique bridges the gap between reinforcement learning and Bayesian inference, demonstrating competitive performance against RL baselines while maintaining the black-box model architecture.
AIBearisharXiv – CS AI · Jun 37/10
🧠Researchers demonstrate that Large Reasoning Models (LRMs) frequently 'overthink' problems after reaching correct answers, with continued reasoning degrading accuracy by up to 21%. The study introduces a protocol to measure reasoning sufficiency and reveals that harmful overthinking—where additional reasoning destabilizes correct solutions—represents a broader reliability risk affecting both multimodal and language-only models.
AIBullisharXiv – CS AI · May 297/10
🧠Researchers propose Self-Trained Verification (STV), a novel approach that improves AI reasoning models by training verifiers to catch self-generated errors using reference solutions as supervision. The method doubles accuracy on hard math problems and achieves 14x improvement on scientific reasoning tasks, while also enabling more effective self-training through verifier-in-the-loop training that further boosts performance by 33%.
AIBullisharXiv – CS AI · May 97/10
🧠Zyphra has unveiled ZAYA1-8B, a compact reasoning-focused AI model with only 700M active parameters that matches larger competitors like DeepSeek-R1 on mathematics and coding tasks. The model introduces Markovian RSA, a novel test-time compute method that achieves 91.9% on AIME'25 benchmarks while maintaining computational efficiency, suggesting small models can compete with much larger reasoning systems through architectural innovation.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers challenge the assumption that longer reasoning chains always improve LLM performance, discovering that extended test-time compute leads to diminishing returns and 'overthinking' where models abandon correct answers. The study demonstrates that optimal compute allocation varies by problem difficulty, enabling significant efficiency gains without sacrificing accuracy.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.
AIBullisharXiv – CS AI · Mar 47/104
🧠Researchers propose 'best-of-∞' approach for large language models that uses majority voting with infinite samples, achieving superior performance but requiring infinite computation. They develop an adaptive generation scheme that dynamically selects the optimal number of samples based on answer agreement and extend the framework to weighted ensembles of multiple LLMs.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce MesaNet, an improved recurrent neural network architecture that optimizes sequence modeling through test-time training, achieving better language modeling performance than previous RNNs while requiring additional inference-time compute. The work advances the trend toward linearized transformers that maintain constant memory costs during inference, positioning computational efficiency against performance gains.
🏢 Perplexity
AINeutralarXiv – CS AI · May 126/10
🧠Sketch-and-Verify is an inference-time scaling technique that improves small language model performance by having the LLM generate multiple algorithmic strategies as program sketches, then filling and verifying them. On HumanEval+, this approach delivers superior cost-performance within a model tier compared to flat sampling, though upgrading to a stronger model tier remains more effective than scaling test-time compute on smaller models.
🧠 Gemini
AINeutralarXiv – CS AI · May 116/10
🧠Researchers demonstrate that adaptive compute gates for LLM agents produce unstable and reversible signals across different environments and models, where the same confidence metric predicts both beneficial and harmful outcomes. They propose DIAL, a learned gating mechanism trained through counterfactual exploration, which outperforms fixed-direction baselines by accounting for task-specific utility directions.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers identify a market inefficiency in LLM-as-a-service pricing where providers are financially incentivized to increase test-time compute usage beyond what meaningfully improves output quality, inflating costs for users. They propose a reverse second-price auction mechanism where providers compete on both price and quality, with users paying only for marginal value created relative to alternatives.
🧠 Llama
AIBullisharXiv – CS AI · Mar 27/1016
🧠Researchers propose ODAR-Expert, an adaptive routing framework for large language models that optimizes accuracy-efficiency trade-offs by dynamically routing queries between fast and slow processing agents. The system achieved 98.2% accuracy on MATH benchmarks while reducing computational costs by 82%, suggesting that optimal AI scaling requires adaptive resource allocation rather than simply increasing test-time compute.
AIBullishLil'Log (Lilian Weng) · May 16/10
🧠This article introduces a review of recent developments in test-time compute and Chain-of-thought (CoT) techniques for AI models. The post examines how providing models with 'thinking time' during inference leads to significant performance improvements while raising new research questions.