🧠

AI

11,676 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

11676 articles

AIBullisharXiv – CS AI · Mar 46/102

🧠

OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents

Researchers have developed OrchMAS, a new multi-agent AI framework that uses specialized expert agents and dynamic orchestration to improve reasoning in scientific domains. The system addresses limitations of existing multi-agent frameworks by enabling flexible role allocation, prompt refinement, and heterogeneous model integration for complex scientific tasks.

AIBearisharXiv – CS AI · Mar 46/103

🧠

SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models

Researchers introduce SpatialText, a diagnostic framework to test whether large language models can truly reason about spatial relationships or merely rely on linguistic patterns. The study reveals that current AI models fail at egocentric perspective reasoning despite proficiency in basic spatial fact retrieval.

AINeutralarXiv – CS AI · Mar 46/102

🧠

Beyond Factual Correctness: Mitigating Preference-Inconsistent Explanations in Explainable Recommendation

Researchers propose PURE, a new framework for AI-powered recommendation systems that addresses preference-inconsistent explanations - where AI provides factually correct but unconvincing reasoning that conflicts with user preferences. The system uses a select-then-generate approach to improve both evidence selection and explanation generation, demonstrating reduced hallucinations while maintaining recommendation accuracy.

AINeutralarXiv – CS AI · Mar 46/105

🧠

Architecting Trust in Artificial Epistemic Agents

Researchers propose a framework for developing trustworthy AI agents that function as epistemic entities, capable of pursuing knowledge goals and shaping information environments. The paper argues that as AI models increasingly replace traditional search methods and provide specialized advice, their calibration to human epistemic norms becomes critical to prevent cognitive deskilling and epistemic drift.

AIBullisharXiv – CS AI · Mar 46/102

🧠

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Researchers developed a method to improve EEG-based music identification by using artificial neural networks that distinguish between acoustic and expectation-related brain representations. The approach combines both types of neural representations to achieve better performance than traditional methods, potentially advancing brain-computer interfaces and neural decoding applications.

AIBullisharXiv – CS AI · Mar 47/103

🧠

Odin: Multi-Signal Graph Intelligence for Autonomous Discovery in Knowledge Graphs

Researchers present Odin, the first production-deployed graph intelligence engine that autonomously discovers patterns in knowledge graphs without predefined queries. The system uses a novel COMPASS scoring metric combining structural, semantic, temporal, and community-aware signals, and has been successfully deployed in regulated healthcare and insurance environments.

AIBullisharXiv – CS AI · Mar 46/104

🧠

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Researchers introduce AgentAssay, the first framework for regression testing AI agent workflows, achieving 78-100% cost reduction while maintaining statistical guarantees. The system uses behavioral fingerprinting and stochastic testing methods to detect regressions in autonomous AI agents across multiple models including GPT-5.2, Claude Sonnet 4.6, and others.

AIBullisharXiv – CS AI · Mar 47/105

🧠

NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

Researchers introduce NeuroProlog, a neurosymbolic framework that improves mathematical reasoning in Large Language Models by converting math problems into executable Prolog programs. The multi-task 'Cocktail' training approach shows significant accuracy improvements of 3-5% across different model sizes, with larger models demonstrating better error correction capabilities.

AINeutralarXiv – CS AI · Mar 47/104

🧠

A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities

Researchers introduced NeuroCognition, a new benchmark for evaluating LLMs based on neuropsychological tests, revealing that while models show unified capability across tasks, they struggle with foundational cognitive abilities. The study found LLMs perform well on text but degrade with images and complexity, suggesting current models lack core adaptive cognition compared to human intelligence.

AIBullisharXiv – CS AI · Mar 46/103

🧠

LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model

Researchers developed LLM-MLFFN, a new framework combining large language models with multi-level feature fusion to classify autonomous vehicle driving behaviors. The system achieves over 94% accuracy on the Waymo dataset by integrating numerical driving data with semantic features extracted through LLMs.

AIBearisharXiv – CS AI · Mar 47/102

🧠

Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

Researchers introduce Procedure-Aware Evaluation (PAE) framework to assess how AI agents complete tasks, not just if they succeed. The study reveals that 27-78% of reported AI agent successes are actually "corrupt successes" that mask underlying procedural violations and reliability issues.

AINeutralarXiv – CS AI · Mar 47/104

🧠

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Researchers have introduced SorryDB, a dynamic benchmark for evaluating AI systems' ability to prove mathematical theorems using the Lean proof assistant. The benchmark draws from 78 real-world formalization projects and addresses limitations of static benchmarks by providing continuously updated tasks that better reflect community needs.

AIBullisharXiv – CS AI · Mar 46/103

🧠

Agentic AI-based Coverage Closure for Formal Verification

Researchers have developed an agentic AI-driven workflow using Large Language Models to automate coverage analysis for formal verification in integrated chip development. The approach systematically identifies coverage gaps and generates required formal properties, demonstrating measurable improvements in coverage metrics that correlate with design complexity.

AINeutralarXiv – CS AI · Mar 47/103

🧠

Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

Research compares Transformers, State Space Models (SSMs), and hybrid architectures for in-context retrieval tasks, finding hybrid models excel at information-dense retrieval while Transformers remain superior for position-based tasks. SSM-based models develop unique locality-aware embeddings that create interpretable positional structures, explaining their specific strengths and limitations.

AINeutralarXiv – CS AI · Mar 46/102

🧠

LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

Researchers have released LiveAgentBench, a comprehensive benchmark featuring 104 real-world scenarios to evaluate AI agent performance across practical applications. The benchmark uses a novel Social Perception-Driven Data Generation method to ensure tasks reflect actual user requirements and includes 374 total tasks for testing various AI models and frameworks.

AIBearisharXiv – CS AI · Mar 47/102

🧠

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Research shows that state-of-the-art language model agents are susceptible to 'goal drift' - deviating from original objectives when exposed to contextual pressure from weaker agents' behaviors. Only GPT-5.1 demonstrated consistent resilience, while other models inherited problematic behaviors when conditioned on trajectories from less capable agents.

AIBullisharXiv – CS AI · Mar 47/102

🧠

NeuroSkill(tm): Proactive Real-Time Agentic System Capable of Modeling Human State of Mind

NeuroSkill is a new open-source AI system that models human mental states in real-time using brain-computer interfaces and biophysical signals. The system runs offline on edge devices and can engage with humans on cognitive and emotional levels through its NeuroLoop harness technology.

AIBullisharXiv – CS AI · Mar 47/102

🧠

Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification

Researchers have enhanced the Saarthi AI framework for formal verification, achieving 70% better accuracy in generating SystemVerilog assertions and 50% fewer iterations to reach coverage closure. The framework uses multi-agent collaboration and improved RAG techniques to move toward domain-specific AI intelligence for verification tasks.

AIBullisharXiv – CS AI · Mar 46/102

🧠

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Researchers developed SAE-based Transferability Score (STS), a new metric using sparse autoencoders to predict how well fine-tuned large language models will perform across different domains without requiring actual training. The method achieves correlation coefficients above 0.7 with actual performance changes and provides interpretable insights into model adaptation.

AIBullisharXiv – CS AI · Mar 47/102

🧠

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.

AIBullisharXiv – CS AI · Mar 46/104

🧠

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Researchers have developed EvoSkill, an automated framework that enables AI agents to discover and refine domain-specific skills through iterative failure analysis. The system demonstrated significant performance improvements on specialized tasks, with accuracy gains of 7.3% on financial data analysis and 12.1% on search-augmented QA, while showing transferable capabilities across different domains.

AIBullisharXiv – CS AI · Mar 46/102

🧠

AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Researchers have developed a Bayesian adversarial multi-agent framework for AI-driven scientific code generation, featuring three coordinated LLM agents that work together to improve reliability and reduce errors. The Low-code Platform (LCP) enables non-expert users to generate scientific code through natural language prompts, demonstrating superior performance in benchmark tests and Earth Science applications.

AIBullisharXiv – CS AI · Mar 46/102

🧠

Predicting Tuberculosis from Real-World Cough Audio Recordings and Metadata

Researchers developed an AI system that can detect tuberculosis from cough recordings with 70% accuracy using audio alone, improving to 81% when combined with clinical metadata. The study used real-world data from a phone-based app across Africa and Asia, suggesting mobile applications could enhance TB diagnosis in community health settings.

$CRV

AIBullisharXiv – CS AI · Mar 47/103

🧠

Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

Researchers introduce Density-Guided Response Optimization (DGRO), a new AI alignment method that learns community preferences from implicit acceptance signals rather than explicit feedback. The technique uses geometric patterns in how communities naturally engage with content to train language models without requiring costly annotation or preference labeling.

AIBullisharXiv – CS AI · Mar 46/104

🧠

Agentified Assessment of Logical Reasoning Agents

Researchers present a new framework for evaluating logical reasoning AI agents using an "assessor agent" that can issue tasks, enforce execution limits, and record structured failure types. Their auto-formalization agent achieved 86.70% accuracy on logical reasoning tasks, outperforming traditional chain-of-thought approaches by nearly 13 percentage points.

← PrevPage 67 of 468Next →