AIBullisharXiv – CS AI · Jun 107/10
🧠Researchers introduce 3SPO (State-Score-Supervised Policy Optimization), a reinforcement learning algorithm that optimizes LLM agent policies at each step rather than after complete episodes, addressing credit assignment challenges in sparse-reward environments. Experiments demonstrate 22.6% improvement over existing methods on ALFWorld benchmarks with 2.4x more state exploration and 1.8x faster convergence.
AIBullisharXiv – CS AI · Jun 97/10
🧠Researchers introduce MemToolAgent, a framework that enhances LLM agents' ability to use tools effectively by implementing memory management systems that store and retrieve past experiences. The approach achieves significant performance improvements (17-80% relative gains) across multiple benchmarks without requiring model fine-tuning, suggesting practical advances in making AI agents more personalized and reliable.
AIBullisharXiv – CS AI · Jun 97/10
🧠INFUSER is a novel self-evolution framework that enables language models to improve their reasoning capabilities through an iterative co-training process between a Generator and Solver, using an influence-aware scoring mechanism rather than difficulty heuristics. The method achieves 20% relative improvement on mathematical and coding benchmarks, demonstrating that adaptive curriculum learning can outperform larger frozen models.
AIBullisharXiv – CS AI · Jun 57/10
🧠SUPERNOVA introduces a framework for extending reinforcement learning with verifiable rewards (RLVR) beyond STEM fields by systematically curating data from natural instruction datasets. A 25K-instance dataset trained on smaller models achieves 64.4 percentage point gains on complex reasoning benchmarks, with improvements generalizing across model scales and families.
AIBullisharXiv – CS AI · May 297/10
🧠SkillsInjector introduces a dynamic method for optimizing how large language model agents access and utilize skill libraries. Rather than treating skill selection as static, the approach adaptively determines which skills to include, how many to present, and how to describe them based on task requirements, achieving measurable performance improvements across multiple benchmarks.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers introduce Identifiable Token Correspondence (ITC), a decoding technique that improves token-based transformer world models for visual reinforcement learning by treating next-frame prediction as a structured assignment problem. The method addresses temporal inconsistency issues like object duplication and disappearance, achieving state-of-the-art results on multiple benchmarks including a significant performance jump on Craftax-classic.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce MAGIC-Video, a training-free framework that enables multimodal AI systems to process and reason about ultra-long videos spanning days or weeks by combining a structured memory graph with narrative chains. The system outperforms existing baselines on multiple benchmarks, addressing a critical limitation where current LLMs can only handle tens of minutes of video despite having million-token context windows.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce TACT, a technique using activation steering to detect and correct 'agent drift' in language model coding agents, where models either repeatedly reason over known information or issue tool calls without proper reasoning. The method improves task resolution rates by 4.8-5.8 percentage points across multiple benchmarks while reducing steps needed to complete tasks by up to 26%.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers propose E-mem, a new framework for LLM agent memory that reconstructs episodic context instead of compressing it, enabling more rigorous reasoning over extended tasks. The approach uses multiple assistant agents managing uncompressed memory while a master agent coordinates planning, achieving 54% F1 on benchmarks with 70% lower token costs than existing methods.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers introduce Q+, a structured reasoning toolkit that enhances AI research agents by making web search more deliberate and organized. Integrated into Eigent's browser agent, Q+ demonstrates consistent benchmark improvements of 0.6 to 3.8 percentage points across multiple deep-research tasks, suggesting meaningful progress in autonomous AI agent reliability.
🏢 Anthropic🧠 GPT-4🧠 GPT-5
AIBullisharXiv – CS AI · Feb 277/105
🧠Researchers introduce U-Mem, an autonomous memory agent system that actively acquires and validates knowledge for large language models. The system uses cost-aware knowledge extraction and semantic Thompson sampling to improve performance, showing significant gains on benchmarks like HotpotQA and AIME25.
AIBullisharXiv – CS AI · Feb 277/108
🧠Researchers propose AgentDropoutV2, a test-time framework that optimizes multi-agent systems by dynamically correcting or removing erroneous outputs without requiring retraining. The system acts as an active firewall with retrieval-augmented rectification, achieving 6.3 percentage point accuracy gains on math benchmarks while preventing error propagation between AI agents.
AIBullisharXiv – CS AI · Jun 116/10
🧠Researchers introduce RecToM, a framework that improves Large Language Models' Theory of Mind reasoning by modeling nested beliefs through recursive perspective construction. The approach achieves state-of-the-art results on multiple benchmarks, including 100% accuracy on Hi-ToM, demonstrating significant advances in how AI systems infer agent beliefs and intentions.
🧠 GPT-5
AIBullisharXiv – CS AI · Jun 96/10
🧠Researchers propose PAI, a novel anomaly scoring scheme that addresses a critical limitation in representation-based time-series anomaly detection by explicitly preserving amplitude information in learned embeddings. The method achieves significant performance improvements, with average gains of 98.4% on TSB-AD-U-Eva and 36.8% on TAB UV datasets, suggesting that amplitude retention is crucial for robust anomaly detection.
AINeutralarXiv – CS AI · Jun 95/10
🧠Researchers propose a proposal refinement approach for few-shot object detection that addresses the unbalanced distribution of region proposals between novel and base classes. The method introduces a refinement loss during base training and a refinement branch for RPN during fine-tuning, achieving 1-6% performance improvements on benchmarks without additional inference costs.
AIBullisharXiv – CS AI · Jun 56/10
🧠Researchers introduce a critic-guided multi-agent framework that improves LLM reasoning reliability for mathematical problem-solving by combining heterogeneous AI agents with adaptive feedback loops. The approach achieves 13% accuracy improvements on benchmarks while demonstrating that smaller models can match larger ones when equipped with critique mechanisms.
AIBullisharXiv – CS AI · Jun 56/10
🧠Researchers propose MRAgent, a framework that reimagines how large language model agents access memory by using a dynamic graph-based reconstruction approach instead of static retrieval methods. The system demonstrates up to 23% performance improvements on benchmarks while reducing computational costs, addressing a fundamental limitation in LLM agents' ability to reason over extended interaction histories.
AINeutralarXiv – CS AI · Jun 25/10
🧠Researchers demonstrate that robust identity tracking in thermal video pedestrian detection can be achieved through lightweight post-processing with scene-level spatial-temporal consistency rather than complex re-identification models. By adding modular identity-repair components to YOLOv8 and SORT baselines, they improved IDF1 scores from 82.25 to 84.93 on thermal MOT benchmarks, suggesting that conservative trajectory relinking outperforms increasing tracker complexity.
AIBullisharXiv – CS AI · May 296/10
🧠Frontier large language models from Anthropic and OpenAI have demonstrated competitive performance with human experts at annotating natural phenotypes to ontology terms, a previously labor-intensive bottleneck in biological research. When evaluated against the same Gold Standard benchmark used in a 2018 study, these AI agents performed within the range of trained human curators and substantially outperformed prior NLP tools, suggesting significant potential to scale phenotype annotation workflows.
🏢 OpenAI🏢 Anthropic
AIBullisharXiv – CS AI · May 286/10
🧠SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers developed a novel framework for synthesizing training data that enables reasoning models to generate high-quality mathematical and reasoning problems by explicitly planning problem directions and adapting difficulty to solver capabilities. The approach achieved a 3.4% cumulative improvement across 10 benchmarks, demonstrating scalable alternatives to manual dataset curation.
AIBullisharXiv – CS AI · May 96/10
🧠PRISM is a new AI framework that improves embodied agents by coupling Vision-Language Models with Large Language Models through dynamic question-answer interactions, addressing the perception-reasoning gap in multimodal AI systems. The framework demonstrates significant performance improvements on benchmark tasks like ALFWorld and R2R, showing that interactive, goal-oriented perception yields superior understanding compared to standalone visual analysis.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers introduce Commander-GPT, a modular framework that orchestrates multiple specialized AI agents for multimodal sarcasm detection rather than relying on a single LLM. The system achieves 4.4-11.7% F1 score improvements over existing baselines on standard benchmarks, demonstrating that task decomposition and intelligent routing can overcome LLM limitations in understanding sarcasm.
🧠 GPT-4🧠 Gemini
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers introduce CRAFT-GUI, a curriculum learning framework that uses reinforcement learning to improve AI agents' performance in graphical user interface tasks. The method addresses difficulty variation across GUI tasks and provides more nuanced feedback, achieving 5.6% improvement on Android Control benchmarks and 10.3% on internal benchmarks.