🧠

AI

9,148 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

9148 articles

AIBullisharXiv – CS AI · Feb 275/107

🧠

RepSPD: Enhancing SPD Manifold Representation in EEGs via Dynamic Graphs

Researchers have developed RepSPD, a novel geometric deep learning model that enhances EEG brain activity decoding using symmetric positive definite manifolds and dynamic graphs. The framework introduces cross-attention mechanisms on Riemannian manifolds and bidirectional alignment strategies to improve brain signal representation and analysis.

AIBullisharXiv – CS AI · Feb 276/102

🧠

Retrieval-Augmented Generation Assistant for Anatomical Pathology Laboratories

Researchers developed a Retrieval-Augmented Generation (RAG) assistant for anatomical pathology laboratories to replace outdated static documentation with dynamic, searchable protocol guidance. The system achieved strong performance using biomedical-specific embeddings and could transform healthcare laboratory workflows by providing technicians with accurate, context-grounded answers to protocol queries.

AIBullisharXiv – CS AI · Feb 276/107

🧠

Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots

Researchers developed a framework for analyzing AI diagnostic systems in clinical settings by preserving original AI inferences and comparing them with physician corrections. The study of 21 dermatological cases showed 71.4% exact agreement between AI and physicians, with 100% comprehensive concordance when using structured analysis methods.

AINeutralarXiv – CS AI · Feb 276/107

🧠

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Researchers have developed SPM-Bench, a PhD-level benchmark for testing large language models on scanning probe microscopy tasks. The benchmark uses automated data synthesis from scientific papers and introduces new evaluation metrics to assess AI reasoning capabilities in specialized scientific domains.

AIBullisharXiv – CS AI · Feb 276/108

🧠

FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

Researchers have developed FactGuard, an AI framework that uses multimodal large language models and reinforcement learning to detect video misinformation. The system addresses limitations of existing models by implementing iterative reasoning processes and external tool integration to verify information across video content.

AIBearisharXiv – CS AI · Feb 276/107

🧠

ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

Researchers developed ClinDet-Bench, a new benchmark that reveals large language models fail to properly identify when they have sufficient information to make clinical decisions. The study shows LLMs make both premature judgments and excessive abstentions in medical scenarios, highlighting safety concerns for AI deployment in healthcare settings.

AIBullisharXiv – CS AI · Feb 276/107

🧠

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Researchers introduce AMA-Bench, a new benchmark for evaluating long-horizon memory in AI agents deployed in real-world applications. The study reveals existing memory systems underperform due to lack of causality and objective information, while their proposed AMA-Agent system achieves 57.22% accuracy, surpassing baselines by 11.16%.

AINeutralarXiv – CS AI · Feb 276/108

🧠

When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design

Researchers propose a new conceptual model for agentic AI systems that addresses when and how AI should intervene by integrating Scene, Context, and Human Behavior Factors. The model derives five design principles to guide AI intervention timing, depth, and restraint for more contextually sensitive autonomous systems.

AINeutralarXiv – CS AI · Feb 276/105

🧠

Decomposing Physician Disagreement in HealthBench

Research analyzing physician disagreement in HealthBench medical AI evaluation dataset finds that 81.8% of disagreement variance is unexplained by observable features, with rubric identity accounting for only 15.8% of variance. The study reveals physicians agree on clearly good or bad AI outputs but disagree on borderline cases, suggesting structural limits to medical AI evaluation consistency.

AIBullisharXiv – CS AI · Feb 276/106

🧠

RLHFless: Serverless Computing for Efficient RLHF

Researchers introduce RLHFless, a serverless computing framework for Reinforcement Learning from Human Feedback (RLHF) that addresses resource inefficiencies in training large language models. The system achieves up to 1.35x speedup and 44.8% cost reduction compared to existing solutions by dynamically adapting to resource demands and optimizing workload distribution.

AIBullisharXiv – CS AI · Feb 276/105

🧠

Generative Data Transformation: From Mixed to Unified Data

Researchers propose TAESAR, a new data-centric framework for improving recommendation models by transforming mixed-domain data into unified target-domain sequences. The approach uses contrastive decoding to address domain gaps and data sparsity issues, outperforming traditional model-centric solutions while generalizing across various sequential models.

AIBullisharXiv – CS AI · Feb 276/106

🧠

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Researchers introduce SideQuest, a novel KV cache management system that uses Large Reasoning Models to compress memory usage during long-horizon AI tasks. The system reduces peak token usage by up to 65% while maintaining accuracy by having the model itself determine which tokens are useful to keep in memory.

AIBullisharXiv – CS AI · Feb 276/107

🧠

Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention

Researchers introduce AHCE (Active Human-Augmented Challenge Engagement), a framework that enables AI agents to collaborate with human experts more effectively through learned policies. The system achieved 32% improvement on normal difficulty tasks and 70% on difficult tasks in Minecraft experiments by treating humans as interactive reasoning tools rather than simple help sources.

AIBullisharXiv – CS AI · Feb 276/104

🧠

Agentic AI for Intent-driven Optimization in Cell-free O-RAN

Researchers propose an agentic AI framework using multiple LLM-based agents to optimize cell-free Open RAN networks through intent-driven automation. The system reduces active radio units by 42% in energy-saving mode while cutting memory usage by 92% through parameter-efficient fine-tuning.

AINeutralarXiv – CS AI · Feb 275/102

🧠

Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents

Researchers propose using cognitive models and AI algorithms as templates for designing modular language agents that combine multiple large language models. The position paper formalizes agent templates that specify roles for individual LLMs and how their functionalities should be composed to solve complex problems beyond single model capabilities.

AIBullisharXiv – CS AI · Feb 276/107

🧠

Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Researchers identified why AI mathematical reasoning guidance is inconsistent and developed Selective Strategy Retrieval (SSR), a framework that improves AI math performance by combining human and model strategies. The method showed significant improvements of up to 13 points on mathematical benchmarks by addressing the gap between strategy usage and executability.

AINeutralarXiv – CS AI · Feb 275/105

🧠

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Researchers propose Contrastive World Models (CWM), a new approach for training AI agents to better distinguish between physically feasible and infeasible actions in embodied environments. The method uses contrastive learning with hard negative examples to outperform traditional supervised fine-tuning, achieving 6.76 percentage point improvement in precision and better safety margins under stress conditions.

AIBullisharXiv – CS AI · Feb 275/107

🧠

DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

DeepPresenter is a new AI framework for autonomous presentation generation that can plan, render, and revise slides through environment-grounded reflection rather than fixed templates. The system uses perceptual feedback from rendered slides to identify and correct presentation-specific issues, achieving state-of-the-art performance with a competitive 9B parameter model.

AIBearisharXiv – CS AI · Feb 276/106

🧠

ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

Researchers introduced ConstraintBench, a new benchmark testing whether large language models can directly solve constrained optimization problems without external solvers. The study found that even the best frontier models only achieve 65% constraint satisfaction, with feasibility being a bigger challenge than optimality.

AIBullisharXiv – CS AI · Feb 276/106

🧠

A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

Researchers propose an Evaluation Agent framework to assess AI agent decision-making in AutoML pipelines, moving beyond outcome-focused metrics to evaluate intermediate decisions. The system can detect faulty decisions with 91.9% F1 score and reveals impacts ranging from -4.9% to +8.3% in final performance metrics.

AIBullisharXiv – CS AI · Feb 275/104

🧠

Mapping the Landscape of Artificial Intelligence in Life Cycle Assessment Using Large Language Models

Researchers conducted a comprehensive review of artificial intelligence applications in life cycle assessment (LCA) using large language models to analyze trends and patterns. The study found dramatic growth in AI adoption for environmental assessments, with a notable shift toward LLM-driven approaches and strong correlations between AI methods and LCA stages.

AINeutralarXiv – CS AI · Feb 276/104

🧠

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Researchers propose using psychometric modeling to correct systematic biases in human evaluations of AI systems, demonstrating how Item Response Theory can separate true AI output quality from rater behavior inconsistencies. The approach was tested on OpenAI's summarization dataset and showed improved reliability in measuring AI model performance.

AIBullisharXiv – CS AI · Feb 276/108

🧠

G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge

Researchers introduce G-reasoner, a unified framework combining graph and language foundation models to enable better reasoning over structured knowledge. The system uses a 34M-parameter graph foundation model with QuadGraph abstraction to outperform existing retrieval-augmented generation methods across six benchmarks.

AINeutralarXiv – CS AI · Feb 276/106

🧠

The AI Research Assistant: Promise, Peril, and a Proof of Concept

Researchers published a case study demonstrating successful human-AI collaboration in mathematical research, extending Hermite quadrature rule results beyond manual capabilities. The study reveals AI's strengths in algebraic manipulation and proof exploration, while highlighting the critical need for human verification and domain expertise in every step of the research process.

AINeutralOpenAI News · Feb 276/105

🧠

An update on our mental health-related work

OpenAI provides updates on its mental health safety initiatives, including new parental controls, trusted contact features, and enhanced distress detection capabilities. The company also addresses recent litigation developments related to its mental health work.

← PrevPage 180 of 366Next →