#benchmark-improvement News & Analysis

25 articles tagged with #benchmark-improvement. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

25 articles

AIBullisharXiv – CS AI · Jun 107/10

🧠

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

Researchers introduce 3SPO (State-Score-Supervised Policy Optimization), a reinforcement learning algorithm that optimizes LLM agent policies at each step rather than after complete episodes, addressing credit assignment challenges in sparse-reward environments. Experiments demonstrate 22.6% improvement over existing methods on ALFWorld benchmarks with 2.4x more state exploration and 1.8x faster convergence.

AIBullisharXiv – CS AI · Jun 97/10

🧠

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

Researchers introduce MemToolAgent, a framework that enhances LLM agents' ability to use tools effectively by implementing memory management systems that store and retrieve past experiences. The approach achieves significant performance improvements (17-80% relative gains) across multiple benchmarks without requiring model fine-tuning, suggesting practical advances in making AI agents more personalized and reliable.

AIBullisharXiv – CS AI · Jun 97/10

🧠

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

INFUSER is a novel self-evolution framework that enables language models to improve their reasoning capabilities through an iterative co-training process between a Generator and Solver, using an influence-aware scoring mechanism rather than difficulty heuristics. The method achieves 20% relative improvement on mathematical and coding benchmarks, demonstrating that adaptive curriculum learning can outperform larger frozen models.

AIBullisharXiv – CS AI · Jun 57/10

🧠

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA introduces a framework for extending reinforcement learning with verifiable rewards (RLVR) beyond STEM fields by systematically curating data from natural instruction datasets. A 25K-instance dataset trained on smaller models achieves 64.4 percentage point gains on complex reasoning benchmarks, with improvements generalizing across model scales and families.

AIBullisharXiv – CS AI · May 297/10

🧠

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

SkillsInjector introduces a dynamic method for optimizing how large language model agents access and utilize skill libraries. Rather than treating skill selection as static, the approach adaptively determines which skills to include, how many to present, and how to describe them based on task requirements, achieving measurable performance improvements across multiple benchmarks.

AIBullisharXiv – CS AI · May 277/10

🧠

Identifiable Token Correspondence for World Models

Researchers introduce Identifiable Token Correspondence (ITC), a decoding technique that improves token-based transformer world models for visual reinforcement learning by treating next-frame prediction as a structured assignment problem. The method addresses temporal inconsistency issues like object duplication and disappearance, achieving state-of-the-art results on multiple benchmarks including a significant performance jump on Craftax-classic.

AIBullisharXiv – CS AI · May 127/10

🧠

Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

Researchers introduce MAGIC-Video, a training-free framework that enables multimodal AI systems to process and reason about ultra-long videos spanning days or weeks by combining a structured memory graph with narrative chains. The system outperforms existing baselines on multiple benchmarks, addressing a critical limitation where current LLMs can only handle tens of minutes of video despite having million-token context windows.

AIBullisharXiv – CS AI · May 97/10

🧠

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

Researchers introduce TACT, a technique using activation steering to detect and correct 'agent drift' in language model coding agents, where models either repeatedly reason over known information or issue tool calls without proper reasoning. The method improves task resolution rates by 4.8-5.8 percentage points across multiple benchmarks while reducing steps needed to complete tasks by up to 26%.

AIBullisharXiv – CS AI · May 47/10

🧠

E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory

Researchers propose E-mem, a new framework for LLM agent memory that reconstructs episodic context instead of compressing it, enabling more rigorous reasoning over extended tasks. The approach uses multiple assistant agents managing uncompressed memory while a master agent coordinates planning, achieving 54% F1 on benchmarks with 70% lower token costs than existing methods.

AIBullisharXiv – CS AI · Apr 137/10

🧠

EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools

Researchers introduce Q+, a structured reasoning toolkit that enhances AI research agents by making web search more deliberate and organized. Integrated into Eigent's browser agent, Q+ demonstrates consistent benchmark improvements of 0.6 to 3.8 percentage points across multiple deep-research tasks, suggesting meaningful progress in autonomous AI agent reliability.

🏢 Anthropic🧠 GPT-4🧠 GPT-5

AIBullisharXiv – CS AI · Feb 277/105

🧠

Towards Autonomous Memory Agents

Researchers introduce U-Mem, an autonomous memory agent system that actively acquires and validates knowledge for large language models. The system uses cost-aware knowledge extraction and semantic Thompson sampling to improve performance, showing significant gains on benchmarks like HotpotQA and AIME25.

AIBullisharXiv – CS AI · Feb 277/108

🧠

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Researchers propose AgentDropoutV2, a test-time framework that optimizes multi-agent systems by dynamically correcting or removing erroneous outputs without requiring retraining. The system acts as an active firewall with retrieval-augmented rectification, achieving 6.3 percentage point accuracy gains on math benchmarks while preventing error propagation between AI agents.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

Researchers present ToolGraph, a framework that improves multi-turn tool-using AI agents through self-evolution via preference learning. By combining schema-derived topology with divergence-point preference optimization, the system achieves 16.8% improvement over baseline performance on benchmark tasks, with gains concentrated in airline and retail domains.

AIBullisharXiv – CS AI · Jun 116/10

🧠

Mind the Perspective: Let's Reason Recursively for Theory of Mind

Researchers introduce RecToM, a framework that improves Large Language Models' Theory of Mind reasoning by modeling nested beliefs through recursive perspective construction. The approach achieves state-of-the-art results on multiple benchmarks, including 100% accuracy on Hi-ToM, demonstrating significant advances in how AI systems infer agent beliefs and intentions.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 96/10

🧠

PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection

Researchers propose PAI, a novel anomaly scoring scheme that addresses a critical limitation in representation-based time-series anomaly detection by explicitly preserving amplitude information in learned embeddings. The method achieves significant performance improvements, with average gains of 98.4% on TSB-AD-U-Eva and 36.8% on TAB UV datasets, suggesting that amplitude retention is crucial for robust anomaly detection.

AINeutralarXiv – CS AI · Jun 95/10

🧠

Proposal Refinement for Few-Shot Object Detection

Researchers propose a proposal refinement approach for few-shot object detection that addresses the unbalanced distribution of region proposals between novel and base classes. The method introduces a refinement loss during base training and a refinement branch for RPN during fine-tuning, achieving 1-6% performance improvements on benchmarks without additional inference costs.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Researchers introduce a critic-guided multi-agent framework that improves LLM reasoning reliability for mathematical problem-solving by combining heterogeneous AI agents with adaptive feedback loops. The approach achieves 13% accuracy improvements on benchmarks while demonstrating that smaller models can match larger ones when equipped with critique mechanisms.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

Researchers propose MRAgent, a framework that reimagines how large language model agents access memory by using a dynamic graph-based reconstruction approach instead of static retrieval methods. The system demonstrates up to 23% performance improvements on benchmarks while reducing computational costs, addressing a fundamental limitation in LLM agents' ability to reason over extended interaction histories.

AINeutralarXiv – CS AI · Jun 25/10

🧠

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

Researchers demonstrate that robust identity tracking in thermal video pedestrian detection can be achieved through lightweight post-processing with scene-level spatial-temporal consistency rather than complex re-identification models. By adding modular identity-repair components to YOLOv8 and SORT baselines, they improved IDF1 scores from 82.25 to 84.93 on thermal MOT benchmarks, suggesting that conservative trajectory relinking outperforms increasing tracker complexity.

AIBullisharXiv – CS AI · May 296/10

🧠

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

Frontier large language models from Anthropic and OpenAI have demonstrated competitive performance with human experts at annotating natural phenotypes to ontology terms, a previously labor-intensive bottleneck in biological research. When evaluated against the same Gold Standard benchmark used in a 2018 study, these AI agents performed within the range of trained human curators and substantially outperformed prior NLP tools, suggesting significant potential to scale phenotype annotation workflows.

🏢 OpenAI🏢 Anthropic

AIBullisharXiv – CS AI · May 286/10

🧠

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.

AIBullisharXiv – CS AI · May 116/10

🧠

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Researchers developed a novel framework for synthesizing training data that enables reasoning models to generate high-quality mathematical and reasoning problems by explicitly planning problem directions and adapting difficulty to solver capabilities. The approach achieved a 3.4% cumulative improvement across 10 benchmarks, demonstrating scalable alternatives to manual dataset curation.

AIBullisharXiv – CS AI · May 96/10

🧠

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

PRISM is a new AI framework that improves embodied agents by coupling Vision-Language Models with Large Language Models through dynamic question-answer interactions, addressing the perception-reasoning gap in multimodal AI systems. The framework demonstrates significant performance improvements on benchmark tasks like ALFWorld and R2R, showing that interactive, goal-oriented perception yields superior understanding compared to standalone visual analysis.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Researchers introduce Commander-GPT, a modular framework that orchestrates multiple specialized AI agents for multimodal sarcasm detection rather than relying on a single LLM. The system achieves 4.4-11.7% F1 score improvements over existing baselines on standard benchmarks, demonstrating that task decomposition and intelligent routing can overcome LLM limitations in understanding sarcasm.

🧠 GPT-4🧠 Gemini

AIBullisharXiv – CS AI · Mar 166/10

🧠

CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

Researchers introduce CRAFT-GUI, a curriculum learning framework that uses reinforcement learning to improve AI agents' performance in graphical user interface tasks. The method addresses difficulty variation across GUI tasks and provides more nuanced feedback, achieving 5.6% improvement on Android Control benchmarks and 10.3% on internal benchmarks.