🧠

AI

13,245 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

13245 articles

AIBullisharXiv – CS AI · Mar 36/103

🧠

Calibrating Verbalized Confidence with Self-Generated Distractors

Researchers introduce DINCO (Distractor-Normalized Coherence), a method to improve confidence calibration in large language models by using self-generated alternative claims to reduce overconfidence bias. The approach addresses LLM suggestibility issues that cause models to express high confidence on low-accuracy outputs, potentially improving AI safety and trustworthiness.

AIBullisharXiv – CS AI · Mar 36/1010

🧠

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Researchers have released DeepResearch-9K, a large-scale dataset with 9,000 questions across three difficulty levels designed to train and benchmark AI research agents. The accompanying open-source framework DeepResearch-R1 supports multi-turn web interactions and reinforcement learning approaches for developing more sophisticated AI research capabilities.

AIBullisharXiv – CS AI · Mar 36/107

🧠

AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution

AutoSkill is a new framework that enables AI language models to learn and reuse personalized skills from user interactions without retraining the underlying model. The system abstracts user preferences into reusable capabilities that can be shared across different agents and tasks, addressing the current limitation where LLMs fail to retain personalized learning between sessions.

AIBullisharXiv – CS AI · Mar 36/104

🧠

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Researchers have developed EasySteer, a unified framework for controlling large language model behavior at inference time that achieves 10.8-22.3x speedup over existing frameworks. The system offers modular architecture with pre-computed steering vectors for eight application domains and transforms steering from a research technique into production-ready capability.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Semantic XPath: Structured Agentic Memory Access for Conversational AI

Researchers have developed Semantic XPath, a tree-structured memory system for conversational AI that improves performance by 176.7% over traditional methods while using only 9.1% of the tokens. The system addresses scalability issues in long-term AI conversations by efficiently accessing and updating structured memory instead of appending growing conversation history.

AIBullisharXiv – CS AI · Mar 35/102

🧠

Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Researchers introduce Purrception, a new variational flow matching approach for AI image generation that combines continuous transport dynamics with discrete supervision. The method demonstrates faster training convergence than existing baselines while achieving competitive quality scores on ImageNet-1k 256x256 generation tasks.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Distillation of Large Language Models via Concrete Score Matching

Researchers propose Concrete Score Distillation (CSD), a new knowledge distillation method that improves efficiency of large language models by better preserving logit information compared to traditional softmax-based approaches. CSD demonstrates consistent performance improvements across multiple models including GPT-2, OpenLLaMA, and GEMMA while maintaining training stability.

AIBullisharXiv – CS AI · Mar 36/108

🧠

FCN-LLM: Empower LLM for Brain Functional Connectivity Network Understanding via Graph-level Multi-task Instruction Tuning

Researchers have developed FCN-LLM, a framework that enables Large Language Models to understand brain functional connectivity networks from fMRI scans through multi-task instruction tuning. The system uses a multi-scale encoder to capture brain features and demonstrates strong zero-shot generalization across unseen datasets, outperforming conventional supervised models.

AINeutralarXiv – CS AI · Mar 36/108

🧠

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Researchers released ASTRA-bench, a new benchmark for evaluating AI agents' ability to handle complex, multi-step reasoning with personal context and tool usage. Testing revealed that current state-of-the-art models like Claude-4.5-Opus and DeepSeek-V3.2 show significant performance degradation in high-complexity scenarios.

AIBullisharXiv – CS AI · Mar 36/109

🧠

Alien Science: Sampling Coherent but Cognitively Unavailable Research Directions from Idea Atoms

Researchers developed a method to generate 'alien' research directions by decomposing academic papers into 'idea atoms' and using AI models to identify coherent but non-obvious research paths. The system analyzes ~7,500 machine learning papers to find viable research directions that current researchers are unlikely to naturally propose.

AINeutralarXiv – CS AI · Mar 37/108

🧠

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Researchers have developed DIVA-GRPO, a new reinforcement learning method that improves multimodal large language model reasoning by adaptively adjusting problem difficulty distributions. The approach addresses key limitations in existing group relative policy optimization methods, showing superior performance across six reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 36/102

🧠

COMRES-VLM: Coordinated Multi-Robot Exploration and Search using Vision Language Models

Researchers developed COMRES-VLM, a new framework using Vision Language Models to coordinate multiple robots for exploration and object search in indoor environments. The system achieved 10.2% faster exploration and 55.7% higher search efficiency compared to existing methods, while enabling natural language-based human guidance.

AINeutralarXiv – CS AI · Mar 37/107

🧠

How Well Does Agent Development Reflect Real-World Work?

A research study analyzing 43 AI agent benchmarks and 72,342 tasks reveals significant misalignment between current agent development efforts and real-world human work patterns across 1,016 U.S. occupations. The study finds that agent development is overly programming-centric compared to where human labor and economic value are actually concentrated in the economy.

AINeutralarXiv – CS AI · Mar 35/104

🧠

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Researchers introduced SimuHome, a high-fidelity smart home simulator and benchmark with 600 episodes for testing LLM-based smart home agents. The system uses the Matter protocol standard and enables time-accelerated simulation to evaluate how AI agents handle device control, environmental monitoring, and workflow scheduling in smart homes.

AIBullisharXiv – CS AI · Mar 36/102

🧠

Characteristic Root Analysis and Regularization for Linear Time Series Forecasting

Researchers present a systematic study of linear models for time series forecasting, focusing on characteristic roots in temporal dynamics and introducing two regularization strategies (Reduced-Rank Regression and Root Purge) to address noise-induced spurious roots. The work achieves state-of-the-art results by combining classical linear systems theory with modern machine learning techniques.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Researchers demonstrate that Group Relative Policy Optimization (GRPO), traditionally viewed as an on-policy reinforcement learning algorithm, can be reinterpreted as an off-policy algorithm through first-principles analysis. This theoretical breakthrough provides new insights for optimizing reinforcement learning applications in large language models and offers principled approaches for off-policy RL algorithm design.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Prompt and Parameter Co-Optimization for Large Language Models

Researchers introduce MetaTuner, a new framework that combines prompt optimization with fine-tuning for Large Language Models, using shared neural networks to discover optimal combinations of prompts and parameters. The approach addresses the discrete-continuous optimization challenge through supervised regularization and demonstrates consistent performance improvements across benchmarks.

AIBullisharXiv – CS AI · Mar 36/108

🧠

Tracking Capabilities for Safer Agents

Researchers propose a new safety framework for AI agents using Scala 3 with capture checking to prevent information leakage and malicious behaviors. The system creates a 'safety harness' that tracks capabilities through static type checking, allowing fine-grained control over agent actions while maintaining task performance.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

Researchers propose Quantile Advantage Estimation (QAE) to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for large language model reasoning. The method replaces mean baselines with group-wise K-quantile baselines to prevent entropy collapse and explosion, showing sustained improvements on mathematical reasoning tasks.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Researchers introduce ReMemR1, a new approach to improve large language models' ability to handle long-context question answering by integrating memory retrieval into the memory update process. The system enables non-linear reasoning through selective callback of historical memories and uses multi-level reward design to strengthen training.

AIBullisharXiv – CS AI · Mar 36/107

🧠

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

Researchers developed BioProAgent, a neuro-symbolic AI framework that combines large language models with deterministic constraints to enable reliable scientific planning in wet-lab environments. The system achieves 95.6% physical compliance compared to 21.0% for existing methods by using finite state machines to prevent costly experimental failures.

AIBullisharXiv – CS AI · Mar 36/108

🧠

CollabEval: Enhancing LLM-as-a-Judge via Multi-Agent Collaboration

Researchers propose CollabEval, a new multi-agent framework for evaluating AI-generated content that uses collaborative judgment instead of single LLM evaluation. The system implements a three-phase process with multiple AI agents working together to provide more consistent and less biased evaluations than current approaches.

AIBullisharXiv – CS AI · Mar 37/109

🧠

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

Researchers introduce HiMAC, a hierarchical reinforcement learning framework that improves LLM agent performance on long-horizon tasks by separating macro-level planning from micro-level execution. The approach demonstrates state-of-the-art results across multiple environments, showing that structured hierarchy is more effective than simply scaling model size for complex agent tasks.

AIBullisharXiv – CS AI · Mar 36/109

🧠

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Researchers introduced ARC (Adaptive Rewarding by self-Confidence), a new framework for improving text-to-image generation models through self-confidence signals rather than external rewards. The method uses internal self-denoising probes to evaluate model accuracy and converts this into scalar rewards for unsupervised optimization, showing improvements in compositional generation and text-image alignment.

AINeutralarXiv – CS AI · Mar 36/103

🧠

Understanding the Role of Training Data in Test-Time Scaling

Research paper analyzes test-time scaling in large language models, revealing that longer reasoning chains (CoTs) can reduce training data requirements but may harm performance if relevant skills aren't present in training data. The study provides theoretical framework showing that diverse, relevant, and challenging training tasks optimize test-time scaling performance.

← PrevPage 242 of 530Next →