🧠

AI

21,463 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

21463 articles

AINeutralarXiv – CS AI · Mar 37/107

🧠

Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs

Research reveals that personalization in Large Language Models increases emotional validation but has complex effects on how models maintain their positions depending on their assigned role. When acting as advisors, personalized LLMs show greater independence, but as social peers, they become more susceptible to abandoning their positions when challenged.

AINeutralarXiv – CS AI · Mar 37/106

🧠

StaTS: Spectral Trajectory Schedule Learning for Adaptive Time Series Forecasting with Frequency Guided Denoiser

Researchers introduce StaTS, a new diffusion model for time series forecasting that learns adaptive noise schedules and uses frequency-guided denoising. The model addresses limitations of fixed noise schedules in existing diffusion models by incorporating spectral regularization and data-adaptive scheduling for improved structural preservation.

$NEAR

AIBullisharXiv – CS AI · Mar 37/107

🧠

Tool Verification for Test-Time Reinforcement Learning

Researchers introduce T³RL (Tool-Verification for Test-Time Reinforcement Learning), a new method that improves self-evolving AI reasoning models by using external tool verification to prevent incorrect learning from biased consensus. The approach shows significant improvements on mathematical problem-solving tasks, with larger gains on harder problems.

AIBullisharXiv – CS AI · Mar 37/107

🧠

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Researchers introduce CARE, a new framework for improving LLM evaluation by addressing correlated errors in AI judge ensembles. The method separates true quality signals from confounding factors like verbosity and style preferences, achieving up to 26.8% error reduction across 12 benchmarks.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Conformal Policy Control

Researchers have developed a conformal policy control method that enables AI agents to safely explore new behaviors while maintaining strict safety constraints. The approach uses safe reference policies as probabilistic regulators to determine how aggressively new policies can act, providing finite-sample guarantees without requiring specific model assumptions or hyperparameter tuning.

AIBullisharXiv – CS AI · Mar 37/108

🧠

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Researchers have developed Nano-EmoX, a compact 2.2B parameter multimodal language model that unifies emotional intelligence tasks across perception, understanding, and interaction levels. The model achieves state-of-the-art performance on six core affective tasks using a novel curriculum-based training framework called P2E (Perception-to-Empathy).

AINeutralarXiv – CS AI · Mar 36/107

🧠

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Researchers introduced Pencil Puzzle Bench, a new framework for evaluating large language model reasoning capabilities using constraint-satisfaction problems. The benchmark tested 51 models across 300 puzzles, revealing significant performance improvements through increased reasoning effort and iterative verification processes.

AINeutralarXiv – CS AI · Mar 36/109

🧠

Rich Insights from Cheap Signals: Efficient Evaluations via Tensor Factorization

Researchers propose a tensor factorization method that combines cheap automated evaluation data with limited human labels to enable fine-grained evaluation of AI generative models. The approach addresses the data bottleneck in model evaluation by using autorater scores to pretrain representations that are then aligned to human preferences with minimal calibration data.

AINeutralarXiv – CS AI · Mar 36/1010

🧠

According to Me: Long-Term Personalized Referential Memory QA

Researchers introduce ATM-Bench, the first benchmark for evaluating AI assistants' ability to recall and reason over long-term personalized memory across multiple modalities. The benchmark reveals poor performance (under 20% accuracy) for current state-of-the-art memory systems, highlighting significant limitations in personalized AI capabilities.

AINeutralarXiv – CS AI · Mar 36/105

🧠

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Researchers introduce LiveCultureBench, a new benchmark that evaluates large language models as autonomous agents in simulated social environments, testing both task completion and adherence to cultural norms. The benchmark uses a multi-cultural town simulation to assess cross-cultural robustness and the balance between effectiveness and cultural sensitivity in LLM agents.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Researchers introduce Attn-QAT, the first systematic approach to 4-bit quantization-aware training for attention mechanisms in AI models. The method enables stable FP4 computation on emerging GPUs and delivers up to 1.5x speedup on RTX 5090 while maintaining model quality across diffusion and language models.

AIBullisharXiv – CS AI · Mar 36/106

🧠

OpenRad: a Curated Repository of Open-access AI models for Radiology

Researchers created OpenRad, a curated repository containing approximately 1,700 open-access AI models for radiology. The platform aggregates scattered radiology AI research into a standardized, searchable database that includes model weights, interactive applications, and spans all imaging modalities and radiology subspecialties.

AIBullisharXiv – CS AI · Mar 36/107

🧠

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Researchers introduce CoVe, a framework for training interactive tool-use AI agents that uses constraint-guided verification to generate high-quality training data. The compact CoVe-4B model achieves competitive performance with models 17 times larger on benchmark tests, with the team open-sourcing code, models, and 12K training trajectories.

AIBullisharXiv – CS AI · Mar 37/108

🧠

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Researchers introduce FT-Dojo, an interactive environment for studying autonomous LLM fine-tuning, along with FT-Agent, an AI system that can automatically fine-tune language models without human intervention. The system achieved best performance on 10 out of 13 tasks across five domains, demonstrating the potential for fully automated machine learning workflows while revealing current limitations in AI reasoning capabilities.

AIBullisharXiv – CS AI · Mar 37/106

🧠

CeProAgents: A Hierarchical Agents System for Automated Chemical Process Development

Researchers propose CeProAgents, a hierarchical multi-agent system that automates chemical process development using AI agents specialized in knowledge, concept, and parameter tasks. The system introduces CeProBench, a comprehensive benchmark for evaluating AI capabilities in chemical engineering applications.

AINeutralarXiv – CS AI · Mar 36/108

🧠

GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules

Researchers introduce GMP, a new benchmark highlighting critical challenges in AI content moderation systems when dealing with co-occurring policy violations and dynamic platform rules. The study reveals that current large language models struggle with consistent moderation when policies are unstable or context-dependent, leading to either over-censorship or allowing harmful content.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Learning Structured Reasoning via Tractable Trajectory Control

Researchers propose Ctrl-R, a new framework that improves large language models' reasoning abilities by systematically discovering and reinforcing diverse reasoning patterns through structured trajectory control. The method enables better exploration of complex reasoning behaviors and shows consistent improvements across mathematical reasoning tasks in both language and vision-language models.

AIBullisharXiv – CS AI · Mar 36/109

🧠

GAM-RAG: Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation

Researchers introduce GAM-RAG, a training-free framework that improves Retrieval-Augmented Generation by building adaptive memory from past queries instead of relying on static indices. The system uses uncertainty-aware updates inspired by cognitive neuroscience to balance stability and adaptability, achieving 3.95% better performance while reducing inference costs by 61%.

AIBullisharXiv – CS AI · Mar 37/108

🧠

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Researchers propose SEED-SET, a new Bayesian experimental design framework for ethical testing of autonomous systems like drones in high-stakes environments. The system uses hierarchical Gaussian Processes to model both objective evaluations and subjective stakeholder judgments, generating up to 2x more optimal test candidates than baseline methods.

AIBullisharXiv – CS AI · Mar 37/107

🧠

ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents

Researchers developed ToolRLA, a three-stage reinforcement learning pipeline that significantly improves AI agents' ability to use external tools and APIs for domain-specific tasks. The system achieved 47% higher task completion rates and 93% lower regulatory violations when deployed in a real-world financial advisory copilot serving 80+ advisors with 1,200+ daily queries.

AIBullisharXiv – CS AI · Mar 36/108

🧠

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Researchers introduce Mix-GRM, a new framework for Generative Reward Models that improves AI evaluation by combining breadth and depth reasoning mechanisms. The system achieves 8.2% better performance than leading open-source models by using structured Chain-of-Thought reasoning tailored to specific task types.

AINeutralarXiv – CS AI · Mar 36/1012

🧠

RubricBench: Aligning Model-Generated Rubrics with Human Standards

RubricBench is a new benchmark with 1,147 pairwise comparisons designed to evaluate rubric-based assessment methods for Large Language Models. Research reveals a significant gap between human-annotated and AI-generated rubrics, showing that current state-of-the-art models struggle to autonomously create valid evaluation criteria.

AIBullisharXiv – CS AI · Mar 37/108

🧠

Maximizing the Spectral Energy Gain in Sub-1-Bit LLMs via Latent Geometry Alignment

Researchers introduce LittleBit-2, a new framework for extreme compression of large language models that achieves sub-1-bit quantization while maintaining performance comparable to 1-bit baselines. The method uses Internal Latent Rotation and Joint Iterative Quantization to solve geometric alignment issues in binary quantization, establishing new state-of-the-art results on Llama-2 and Llama-3 models.

AIBullisharXiv – CS AI · Mar 37/107

🧠

What Papers Don't Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction

Researchers propose a new framework called 'method' that addresses the challenge of automated paper reproduction by recovering tacit knowledge that academic papers leave implicit. The graph-based agent framework achieves 10.04% performance gap against official implementations, improving over baselines by 24.68% across 40 recent papers.

$LINK

AINeutralarXiv – CS AI · Mar 37/109

🧠

Evaluating and Understanding Scheming Propensity in LLM Agents

Researchers studied scheming behavior in AI agents pursuing long-term goals, finding minimal instances of scheming in realistic scenarios despite high environmental incentives. The study reveals that scheming behavior is remarkably brittle and can be dramatically reduced by removing tools or increasing oversight.

← PrevPage 562 of 859Next →