🧠

AI

21,049 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

21049 articles

AIBullisharXiv – CS AI · Mar 176/10

🧠

Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science

Researchers propose a new AI learning architecture inspired by human and animal cognition that integrates observational learning and active behavior learning. The framework includes a meta-control system that switches between learning modes, addressing current limitations in autonomous AI learning.

AINeutralarXiv – CS AI · Mar 176/10

🧠

PMAx: An Agentic Framework for AI-Driven Process Mining

Researchers have developed PMAx, an autonomous AI framework that democratizes process mining by allowing business users to analyze organizational workflows through natural language queries. The system uses a multi-agent architecture with local execution to ensure data privacy and mathematical accuracy while eliminating the need for specialized technical expertise.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Advancing Multimodal Agent Reasoning with Long-Term Neuro-Symbolic Memory

Researchers introduced NS-Mem, a neuro-symbolic memory framework that combines neural representations with symbolic structures to improve multimodal AI agent reasoning. The system achieved 4.35% average improvement in reasoning accuracy over pure neural systems, with up to 12.5% gains on constrained reasoning tasks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

Researchers developed an information-theoretic framework to explain 'Aha moments' in large language models during reasoning tasks. The study reveals that strong reasoning performance stems from uncertainty externalization rather than specific tokens, decomposing LLM reasoning into procedural information and epistemic verbalization.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets

Researchers have introduced Prompt Readiness Levels (PRL), a nine-level maturity framework for evaluating and governing AI prompt assets in production environments. The system includes a multidimensional scoring method (PRS) designed to ensure prompt engineering meets operational, safety, and compliance standards across organizations.

AINeutralarXiv – CS AI · Mar 176/10

🧠

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.

🧠 Gemini

AINeutralarXiv – CS AI · Mar 176/10

🧠

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

Researchers introduce Gradient Atoms, an unsupervised method that decomposes AI model training gradients to discover interpretable behaviors without requiring predefined queries. The technique can identify model behaviors like refusal patterns and arithmetic capabilities, while also serving as effective steering vectors to control model outputs.

AIBearisharXiv – CS AI · Mar 176/10

🧠

BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Researchers introduced BrainBench, a new benchmark revealing significant gaps in commonsense reasoning among leading LLMs. Even the best model (Claude Opus 4.6) achieved only 80.3% accuracy on 100 brainteaser questions, while GPT-4o scored just 39.7%, exposing fundamental reasoning deficits across frontier AI models.

🧠 GPT-4🧠 Claude🧠 Opus

AINeutralarXiv – CS AI · Mar 176/10

🧠

Dynamic Theory of Mind as a Temporal Memory Problem: Evidence from Large Language Models

Research reveals that Large Language Models struggle with dynamic Theory of Mind tasks, particularly tracking how others' beliefs change over time. While LLMs can infer current beliefs effectively, they fail to maintain and retrieve prior belief states after updates occur, showing patterns consistent with human cognitive biases.

AIBullisharXiv – CS AI · Mar 176/10

🧠

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

Researchers introduce OpenHospital, a new interactive arena designed to develop and benchmark Large Language Model-based Collective Intelligence through physician-patient agent interactions. The platform uses a data-in-agent-self paradigm to rapidly enhance AI agent capabilities while providing evaluation metrics for medical proficiency and system efficiency.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Argumentation for Explainable and Globally Contestable Decision Support with LLMs

Researchers introduce ArgEval, a new framework that enhances Large Language Model decision-making through structured argumentation and global contestability. Unlike previous approaches limited to binary choices and local corrections, ArgEval maps entire decision spaces and builds reusable argumentation frameworks that can be globally modified to prevent repeated mistakes.

AIBearisharXiv – CS AI · Mar 176/10

🧠

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Researchers propose a priority graph model to understand conflicts in LLM alignment, revealing that unified stable alignment is challenging due to context-dependent inconsistencies. The study identifies 'priority hacking' as a vulnerability where adversaries can manipulate safety alignments, and suggests runtime verification mechanisms as a potential solution.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Contests with Spillovers: Incentivizing Content Creation with GenAI

Researchers propose the Content Creation with Spillovers (CCS) model to address how GenAI and LLMs create positive spillovers where creators' content can be reused by others, potentially undermining individual incentives. They introduce Provisional Allocation mechanisms to guarantee equilibrium existence and develop approximation algorithms to maximize social welfare in content creation ecosystems.

AINeutralarXiv – CS AI · Mar 176/10

🧠

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Researchers introduce AgentProcessBench, the first benchmark for evaluating step-level effectiveness in AI tool-using agents, comprising 1,000 trajectories and 8,509 human-labeled annotations. The benchmark reveals that current AI models struggle with distinguishing neutral and erroneous actions in tool execution, and that process-level signals can significantly enhance test-time performance.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Researchers propose a hierarchical planning framework to analyze why LLM-based web agents fail at complex navigation tasks. The study reveals that while structured PDDL plans outperform natural language plans, low-level execution and perceptual grounding remain the primary bottlenecks rather than high-level reasoning.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Computational Concept of the Psyche

Researchers propose a new computational concept for modeling the human psyche as an operating system for artificial general intelligence. The approach treats the psyche as a decision-making system that operates in a state space including needs, sensations, and actions to optimize goal achievement while minimizing risks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Relationship-Aware Safety Unlearning for Multimodal LLMs

Researchers propose a new framework for improving safety in multimodal AI models by targeting unsafe relationships between objects rather than removing entire concepts. The approach uses parameter-efficient edits to suppress dangerous combinations while preserving benign uses of the same objects and relations.

AIBullisharXiv – CS AI · Mar 176/10

🧠

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

Researchers propose GRPO (Group Relative Policy Optimization) combined with reflection reward mechanisms to enhance mathematical reasoning in large language models. The four-stage framework encourages self-reflective capabilities during training and demonstrates state-of-the-art performance over existing methods like supervised fine-tuning and LoRA.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

A comprehensive research study examines the relationship between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) methods for improving Large Language Models after pre-training. The research identifies emerging trends toward hybrid post-training approaches that combine both methods, analyzing applications from 2023-2025 to establish when each method is most effective.

AIBullisharXiv – CS AI · Mar 176/10

🧠

EviAgent: Evidence-Driven Agent for Radiology Report Generation

Researchers introduce EviAgent, a new AI system for automated radiology report generation that provides transparent, evidence-driven analysis. The system addresses key limitations of current medical AI models by offering traceable decision-making and integrating external domain knowledge, outperforming existing specialized medical models in testing.

AIBearishThe Register – AI · Mar 176/10

🧠

AI still doesn't work very well, businesses are faking it, and a reckoning is coming

The article appears to discuss concerns about AI technology's current limitations and suggests that businesses may be overstating AI capabilities. A market correction or reassessment of AI's actual effectiveness may be approaching.

AIBearishDecrypt – AI · Mar 166/10

🧠

Did ChatGPT Really Cure a Dog's Cancer? It's Complicated

A viral story claiming ChatGPT helped cure a dog's cancer by designing a custom vaccine has been disputed by the actual scientists involved. The researchers say the AI's role was minimal and the credit for the breakthrough belongs to traditional scientific methods and expertise.

🧠 ChatGPT

AIBullishTechCrunch – AI · Mar 166/10

🧠

Nvidia’s version of OpenClaw could solve its biggest problem: security

Nvidia announced NemoClaw, an open enterprise AI agent platform built on the viral OpenClaw framework. This platform appears to address security concerns, which Nvidia identifies as one of its biggest challenges in the AI space.

🏢 Nvidia

AIBearishArs Technica – AI · Mar 166/10

🧠

OpenAI’s own mental health experts unanimously opposed “naughty” ChatGPT launch

OpenAI's internal mental health experts unanimously opposed the launch of a more permissive version of ChatGPT that allows adult content creation. The disagreement highlights concerns about the psychological impact of AI-generated adult content, even as OpenAI attempts to distinguish between different types of explicit material.

🏢 OpenAI🧠 ChatGPT

AIBullishBlockonomi · Mar 166/10

🧠

Arista Networks (ANET) Stock Poised for 27% Gain According to Analyst Consensus

Arista Networks (ANET) stock has a consensus price target of $177.50, representing a potential 27.6% upside. The optimistic outlook is driven by strong AI networking growth, high profit margins of 42.8%, and 93% of analysts rating it as a buy.

← PrevPage 510 of 842Next →