#test-time-scaling News & Analysis

34 articles tagged with #test-time-scaling. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

34 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Researchers introduce SPARC, a modular framework that decouples visual perception from reasoning in vision-language models to improve test-time scaling efficiency. By separating tasks into explicit visual search and conditional reasoning stages, SPARC achieves significant performance gains on visual reasoning benchmarks while reducing computational token requirements by up to 200×.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Active Inference as the Test-Time Scaling Law for Physical AI Agents

Researchers introduce a novel test-time scaling law for physical AI agents based on active inference principles, enabling agents to generalize to unforeseen scenarios by dynamically updating policies through reasoning about prediction errors. The approach outperforms existing reinforcement learning methods by 36% in inference efficiency on autonomous driving tasks and scales with real-world experience rather than just training data or model size.

AIBullisharXiv – CS AI · Jun 107/10

🧠

A History-Aware Visually Grounded Critic for Computer Use Agents

Researchers introduce HiViG, a test-time framework that enhances Computer Use Agents through history-aware and visually grounded critic models. The system improves GUI task performance by 5.8-9.0% across web, mobile, and desktop platforms by maintaining action history and verifying execution coordinates against visual interfaces.

🧠 Gemini

AIBullisharXiv – CS AI · Jun 107/10

🧠

Decentralized Multi-Agent Systems with Shared Context

Researchers propose Decentralized Language Models (DeLM), a new multi-agent system framework that eliminates centralized coordination bottlenecks by enabling parallel agents to share a verified context and asynchronously claim tasks. The approach achieves significant performance improvements on software engineering and long-context reasoning benchmarks while reducing computational costs by approximately 50%.

AIBullisharXiv – CS AI · Jun 27/10

🧠

SDR: Set-Distance Rewards for Radiology Report Generation

Researchers introduce Set-Distance Rewards (SDR), a novel reinforcement learning approach for chest X-ray report generation that treats medical reports as unordered sets rather than causal chains. The method achieves 4-8% improvements over supervised fine-tuning across multiple vision-language models and enables efficient test-time scaling by pruning low-quality candidates mid-generation.

🧠 GPT-4🧠 Gemini

AIBullisharXiv – CS AI · Jun 27/10

🧠

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

Researchers introduce stochastic backtracking, a novel test-time scaling method for language models that revisits previously generated solution paths rather than committing irreversibly to frontier candidates. The approach uses subpool selection and power backtrack sequential Monte Carlo to improve reasoning accuracy while reducing token generation, outperforming existing PRM-guided methods across mathematical benchmarks.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

Researchers present LiDAR, a test-time scaling method for diffusion models that improves sample quality alignment with human intent using efficient reward guidance. The approach achieves comparable performance to existing gradient guidance methods while delivering 9.5x faster sampling speeds by computing expected future rewards from marginal samples without neural backpropagation.

AIBullisharXiv – CS AI · Jun 27/10

🧠

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Researchers present AVIC, an adaptive framework that optimizes when and how much multimodal language models should use world models for visual imagination during spatial reasoning tasks. The system learns to selectively invoke visual imagination only when necessary, reducing computational costs while matching or exceeding performance of fixed imagination strategies and proprietary baselines like GPT-4o.

🧠 GPT-4

AIBullisharXiv – CS AI · May 287/10

🧠

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

Researchers introduce MCTS-Judge, a test-time scaling framework that enhances LLM-based code evaluation by applying Monte Carlo Tree Search to improve reasoning accuracy. The system achieves 80% accuracy on code correctness tasks—surpassing OpenAI's o1 models while using 3x fewer tokens—addressing a critical limitation in using LLMs as reliable judges for complex technical problems.

AIBullisharXiv – CS AI · May 287/10

🧠

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

Researchers introduce EAGer, a training-free method that optimizes inference-time computation for reasoning language models by dynamically allocating compute budgets based on token-level entropy. The approach reduces computational waste while improving performance, achieving up to 37% gains in Pass@k metrics with 59% fewer tokens in supervised settings.

AIBullisharXiv – CS AI · May 277/10

🧠

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

Researchers propose STARS, a training framework that stabilizes Looped Language Models (LoopLMs) to enable reliable test-time scaling through latent reasoning. The method uses Jacobian Spectral Radius Regularization to constrain neural states toward stable fixed points, addressing a critical problem where model performance peaks then collapses with increased recurrence depth.

AIBullisharXiv – CS AI · Apr 207/10

🧠

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Researchers introduce AgentV-RL, an agentic verifier framework that enhances reward modeling for large language models by combining bidirectional reasoning agents with tool-use capabilities. The system addresses critical limitations in LLM verification by enabling forward and backward tracing of solutions, achieving 25.2% performance gains over existing methods and positioning agentic reward modeling as a promising new paradigm.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Researchers introduce RL^V, a reinforcement learning method that unifies LLM reasoners with generative verifiers to improve test-time compute scaling. The approach achieves over 20% accuracy gains on MATH benchmarks and enables 8-32x more efficient test-time scaling compared to existing RL methods by preserving and leveraging learned value functions.

AIBullisharXiv – CS AI · Apr 67/10

🧠

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Researchers discovered that in Large Reasoning Models like DeepSeek-R1, the first solution is often the best, with alternative solutions being detrimental due to error accumulation. They propose RED, a new framework that achieves up to 19% performance gains while reducing token consumption by 37.7-70.4%.

AIBullisharXiv – CS AI · Mar 267/10

🧠

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Researchers demonstrate that large language models can perform reinforcement learning during inference through a new 'in-context RL' prompting framework. The method shows LLMs can optimize scalar reward signals to improve response quality across multiple rounds, achieving significant improvements on complex tasks like mathematical competitions and creative writing.

AIBullisharXiv – CS AI · Mar 37/105

🧠

Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

Researchers provide mathematical proof that implicit models can achieve greater expressive power through increased test-time computation, explaining how these memory-efficient architectures can match larger explicit networks. The study validates this scaling property across image reconstruction, scientific computing, operations research, and LLM reasoning domains.

AINeutralarXiv – CS AI · Jun 116/10

🧠

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

Researchers introduce AVIS, a lightweight adaptive policy that optimizes inference efficiency in Vision-Language Models by jointly scaling visual context and reasoning computation. The method uses token pruning and difficulty prediction to reduce computational costs while maintaining or improving accuracy across image and video reasoning tasks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Researchers introduce IMUG-Bench, a comprehensive benchmark designed to evaluate unified multimodal models (UMMs) on their ability to handle multi-turn interleaved image-text dialogues. The benchmark reveals that current models struggle with exposure bias in generation tasks and that test-time scaling strategies like Chain-of-Thought can improve performance.

AINeutralarXiv – CS AI · Jun 56/10

🧠

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Researchers introduce CoT-Space, a theoretical framework that explains how Large Language Models improve reasoning through multi-step Chain-of-Thought processes via reinforcement learning. The framework models reasoning as an optimization problem in continuous semantic space, demonstrating that optimal reasoning length emerges naturally from the underfitting-overfitting trade-off, providing a principled foundation for understanding test-time scaling in modern LLMs.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Researchers propose Budget-Guided MCTS, a tree-search algorithm that optimizes large language model inference by dynamically adjusting exploration and refinement strategies based on remaining token budgets. The method addresses a practical deployment challenge where fixed computational budgets vary across use cases, outperforming budget-agnostic approaches on mathematical and physics reasoning tasks.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Researchers propose a consequence-aware compute allocation system for reasoning models that prioritizes high-impact tasks based on real-world failure costs rather than just predicted difficulty. Testing on software engineering benchmarks shows the method reduces cost-weighted loss by 22-33% compared to difficulty-based routing, with a practical predictor-driven variant retaining over 90% of theoretical gains.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

Researchers propose using statistical features from failed reasoning traces in language models to diagnose which failures can be fixed through intervention versus those requiring resampling. Their method achieves 84.3% accuracy in categorizing failure types and enables training-free routing that improves rescue rates by 12.2% on difficult problems, converting previously discarded data into actionable diagnostic signals.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Researchers demonstrate that multi-agent debate (MAD) for large language models significantly improves when agents have diverse initial viewpoints and explicitly communicate calibrated confidence levels. The study shows that vanilla MAD often underperforms simple majority voting despite higher computational costs, but two lightweight interventions—diversity-aware initialization and confidence-modulated debate protocols—consistently outperform both baseline approaches across multiple reasoning benchmarks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

Researchers introduce PETS, a framework for optimizing how many reasoning trajectories to sample from AI models during inference to maintain accuracy while reducing computational costs. By modeling trajectory allocation as a crowdsourcing problem, the approach achieves up to 75% budget savings on benchmarks while maintaining perfect consistency, addressing a key efficiency challenge in test-time scaling.

AINeutralarXiv – CS AI · Jun 16/10

🧠

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

UniScale introduces a unified framework that combines model routing and test-time scaling to optimize large language model inference, balancing quality and computational cost. The system uses online learning via contextual multi-armed bandits to adapt inference policies dynamically, achieving fine-grained performance improvements over existing decoupled approaches.

Page 1 of 2Next →