#ai-research News & Analysis

992 articles tagged with #ai-research. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

992 articles

AIBullisharXiv – CS AI · Mar 36/108

🧠

CollabEval: Enhancing LLM-as-a-Judge via Multi-Agent Collaboration

Researchers propose CollabEval, a new multi-agent framework for evaluating AI-generated content that uses collaborative judgment instead of single LLM evaluation. The system implements a three-phase process with multiple AI agents working together to provide more consistent and less biased evaluations than current approaches.

AINeutralarXiv – CS AI · Mar 37/108

🧠

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Researchers have developed DIVA-GRPO, a new reinforcement learning method that improves multimodal large language model reasoning by adaptively adjusting problem difficulty distributions. The approach addresses key limitations in existing group relative policy optimization methods, showing superior performance across six reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 36/1010

🧠

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Researchers have released DeepResearch-9K, a large-scale dataset with 9,000 questions across three difficulty levels designed to train and benchmark AI research agents. The accompanying open-source framework DeepResearch-R1 supports multi-turn web interactions and reinforcement learning approaches for developing more sophisticated AI research capabilities.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Semantic XPath: Structured Agentic Memory Access for Conversational AI

Researchers have developed Semantic XPath, a tree-structured memory system for conversational AI that improves performance by 176.7% over traditional methods while using only 9.1% of the tokens. The system addresses scalability issues in long-term AI conversations by efficiently accessing and updating structured memory instead of appending growing conversation history.

AIBullisharXiv – CS AI · Mar 36/109

🧠

The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition

Researchers introduce the Observer-Situation Lattice (OSL), a unified mathematical framework for autonomous agents to reason about multiple perspectives in complex environments. The system addresses limitations in current AI approaches by providing a single coherent structure for belief management and Theory of Mind reasoning.

AINeutralarXiv – CS AI · Mar 37/108

🧠

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

New research reveals that large language models often determine their final answers before generating chain-of-thought reasoning, challenging the assumption that CoT reflects the model's actual decision process. Linear probes can predict model answers with 0.9 AUC accuracy before CoT generation, and steering these activations can flip answers in over 50% of cases.

AIBullisharXiv – CS AI · Mar 37/105

🧠

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks

Researchers introduce DynaMoE, a new Mixture-of-Experts framework that dynamically activates experts based on input complexity and uses adaptive capacity allocation across network layers. The system achieves superior parameter efficiency compared to static baselines and demonstrates that optimal expert scheduling strategies vary by task type and model scale.

AIBullisharXiv – CS AI · Mar 36/108

🧠

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Researchers introduce Multi-View Video Reward Shaping (MVR), a new reinforcement learning framework that uses multi-viewpoint video analysis and vision-language models to improve reward design for complex AI tasks. The system addresses limitations of single-image approaches by analyzing dynamic motions across multiple camera angles, showing improved performance on humanoid locomotion and manipulation tasks.

AIBullisharXiv – CS AI · Mar 37/108

🧠

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

Researchers propose EfficientZero-Multitask (EZ-M), a multi-task model-based reinforcement learning algorithm that scales the number of tasks rather than samples per task for robotics training. The approach achieves state-of-the-art performance on HumanoidBench with significantly higher sample efficiency by leveraging shared world models across diverse tasks.

AIBullisharXiv – CS AI · Mar 37/107

🧠

LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning

Researchers have developed a new framework that combines Large Language Models (LLMs) with Deep Reinforcement Learning to improve data efficiency, interpretability, and cross-environment transferability. The approach uses LLMs to map natural language instructions into executable rules and create semantically annotated options for better skill reuse and constraint monitoring.

AIBullisharXiv – CS AI · Mar 36/107

🧠

Pharmacology Knowledge Graphs: Do We Need Chemical Structure for Drug Repurposing?

Researchers developed a pharmacology knowledge graph for drug repurposing and found that removing chemical structure representations improved performance while dramatically reducing computational requirements. The study showed that drug behavior can be accurately predicted using only target protein information and network topology, with larger datasets proving more valuable than complex models.

AIBullisharXiv – CS AI · Mar 37/108

🧠

State-Action Inpainting Diffuser for Continuous Control with Delay

Researchers introduce State-Action Inpainting Diffuser (SAID), a new AI framework that addresses signal delay challenges in continuous control and reinforcement learning. SAID combines model-based and model-free approaches using a generative formulation that can be applied to both online and offline RL, demonstrating state-of-the-art performance on delayed control benchmarks.

AIBullisharXiv – CS AI · Mar 36/108

🧠

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Researchers introduced GOME, an AI agent that uses gradient-based optimization instead of tree search for machine learning engineering tasks, achieving 35.1% success rate on MLE-Bench. The study shows gradient-based approaches outperform tree search as AI reasoning capabilities improve, suggesting this method will become more effective as LLMs advance.

AINeutralarXiv – CS AI · Mar 36/1012

🧠

RubricBench: Aligning Model-Generated Rubrics with Human Standards

RubricBench is a new benchmark with 1,147 pairwise comparisons designed to evaluate rubric-based assessment methods for Large Language Models. Research reveals a significant gap between human-annotated and AI-generated rubrics, showing that current state-of-the-art models struggle to autonomously create valid evaluation criteria.

AINeutralarXiv – CS AI · Mar 37/109

🧠

Evaluating and Understanding Scheming Propensity in LLM Agents

Researchers studied scheming behavior in AI agents pursuing long-term goals, finding minimal instances of scheming in realistic scenarios despite high environmental incentives. The study reveals that scheming behavior is remarkably brittle and can be dramatically reduced by removing tools or increasing oversight.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Learning Structured Reasoning via Tractable Trajectory Control

Researchers propose Ctrl-R, a new framework that improves large language models' reasoning abilities by systematically discovering and reinforcing diverse reasoning patterns through structured trajectory control. The method enables better exploration of complex reasoning behaviors and shows consistent improvements across mathematical reasoning tasks in both language and vision-language models.

AINeutralarXiv – CS AI · Mar 36/103

🧠

Benchmarking Overton Pluralism in LLMs

Researchers introduced OVERTONBENCH, a framework for measuring viewpoint diversity in large language models through the OVERTONSCORE metric. In a study of 8 LLMs with 1,208 participants, models scored 0.35-0.41 out of 1.0, with DeepSeek V3 performing best, showing significant room for improvement in pluralistic representation.

AINeutralarXiv – CS AI · Mar 36/1010

🧠

According to Me: Long-Term Personalized Referential Memory QA

Researchers introduce ATM-Bench, the first benchmark for evaluating AI assistants' ability to recall and reason over long-term personalized memory across multiple modalities. The benchmark reveals poor performance (under 20% accuracy) for current state-of-the-art memory systems, highlighting significant limitations in personalized AI capabilities.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Conformal Policy Control

Researchers have developed a conformal policy control method that enables AI agents to safely explore new behaviors while maintaining strict safety constraints. The approach uses safe reference policies as probabilistic regulators to determine how aggressively new policies can act, providing finite-sample guarantees without requiring specific model assumptions or hyperparameter tuning.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Tool Verification for Test-Time Reinforcement Learning

Researchers introduce T³RL (Tool-Verification for Test-Time Reinforcement Learning), a new method that improves self-evolving AI reasoning models by using external tool verification to prevent incorrect learning from biased consensus. The approach shows significant improvements on mathematical problem-solving tasks, with larger gains on harder problems.

AINeutralarXiv – CS AI · Mar 37/107

🧠

Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs

Research reveals that personalization in Large Language Models increases emotional validation but has complex effects on how models maintain their positions depending on their assigned role. When acting as advisors, personalized LLMs show greater independence, but as social peers, they become more susceptible to abandoning their positions when challenged.

AINeutralarXiv – CS AI · Mar 37/109

🧠

Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study

Researchers developed a comprehensive evaluation framework for Graph Neural Networks (GNNs) using formal specification methods, creating 336 new datasets to test GNN expressiveness across 16 fundamental graph properties. The study reveals that no single pooling approach consistently performs well across all properties, with attention-based pooling excelling in generalization while second-order pooling provides better sensitivity.

AIBullisharXiv – CS AI · Mar 37/108

🧠

Breaking the Factorization Barrier in Diffusion Language Models

Researchers introduce Coupled Discrete Diffusion (CoDD), a breakthrough framework that solves the "factorization barrier" in diffusion language models by enabling parallel token generation without sacrificing coherence. The approach uses a lightweight probabilistic inference layer to model complex joint dependencies while maintaining computational efficiency.

AIBullisharXiv – CS AI · Mar 36/104

🧠

"When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction

Researchers developed CLEO, an AI system that enables real-time collaborative context awareness between humans and AI agents by interpreting concurrent user actions on shared artifacts. A study with professional designers identified key interaction patterns and decision factors for when to delegate work to AI versus collaborate directly.

AIBullisharXiv – CS AI · Mar 36/103

🧠

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

Researchers introduce BiMotion, a new AI framework that uses B-spline curves to generate high-quality 3D character animations from text descriptions. The method addresses limitations in existing approaches by using continuous motion representation instead of discrete frames, enabling more expressive and coherent character movements.

← PrevPage 24 of 40Next →