#multi-turn-reasoning News & Analysis

8 articles tagged with #multi-turn-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AINeutralarXiv – CS AI · Jun 197/10

🧠

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

Amazon researchers introduced StaminaBench, a benchmark that evaluates coding agents' ability to handle extended multi-turn interactions (up to 100 consecutive change requests), revealing that current LLMs fail within 5-6 turns and that test feedback can improve performance up to 12x.

AIBullisharXiv – CS AI · May 117/10

🧠

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

Researchers introduce MedAction, a new framework and dataset designed to improve how large language models perform clinical diagnosis by simulating real-world multi-turn diagnostic processes. The approach addresses fundamental limitations in current medical LLMs through a tree-structured distillation pipeline that generates high-quality diagnostic trajectories, achieving state-of-the-art performance among open-source models.

AIBullisharXiv – CS AI · Apr 147/10

🧠

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Jun 46/10

🧠

Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

Researchers introduced MANTA, a 1,088-conversation benchmark evaluating how large language models maintain animal welfare values under adversarial pressure across five-turn exchanges. The study reveals that models significantly change performance rankings when subjected to sustained questioning rather than single-turn queries, with some models like Gemini Flash Lite dropping dramatically in value stability despite initial moral sensitivity.

🧠 GPT-5🧠 Claude🧠 Opus

AINeutralarXiv – CS AI · Jun 26/10

🧠

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Researchers introduced TimeSage-MT, a multi-turn benchmark with 240 tasks designed to evaluate how well LLM agents handle time series analysis across extended conversations. The benchmark reveals significant performance gaps in current AI systems, particularly in decision-making, memory retention, and uncertainty handling across real-world domains.

AIBullisharXiv – CS AI · May 296/10

🧠

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Researchers propose Canonical-Context On-Policy Distillation (CCOPD), a training method that improves large language models' ability to solve problems when information is revealed incrementally across multiple conversation turns rather than all at once. By using a frozen teacher model with complete context to guide a student model receiving fragmented information, CCOPD achieves 32% relative performance improvement on multi-turn tasks while maintaining single-prompt performance.

AINeutralarXiv – CS AI · May 286/10

🧠

Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

Researchers conducted a mechanistic analysis of how large language models allocate computational depth when operating as autonomous agents performing multi-turn planning and tool use. The study reveals that agents progressively recruit deeper layers as task complexity increases, contrasting with prior findings that LLMs underutilize depth in single-turn tasks, suggesting adaptive depth allocation emerges in sequential reasoning scenarios.

AIBullisharXiv – CS AI · May 116/10

🧠

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Researchers introduce MemSearcher, an AI agent framework that optimizes how large language models handle multi-turn interactions by maintaining compact memory instead of concatenating full conversation history. The approach uses a novel multi-context GRPO training method and demonstrates superior performance while maintaining stable token counts, reducing computational overhead.