🧠

AI

12,730 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

12730 articles

AINeutralarXiv – CS AI · Apr 106/10

🧠

On Emotion-Sensitive Decision Making of Small Language Model Agents

Researchers introduce a framework for studying how emotional states affect decision-making in small language models (SLMs) used as autonomous agents. Using activation steering techniques grounded in real-world emotion-eliciting texts, they benchmark SLMs across game-theoretic scenarios and find that emotional perturbations systematically influence strategic choices, though behaviors often remain unstable and misaligned with human patterns.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Reasoning Fails Where Step Flow Breaks

Researchers introduce Step-Saliency, a diagnostic tool that reveals how large reasoning models fail during multi-step reasoning tasks by identifying two critical information-flow breakdowns: shallow layers that ignore context and deep layers that lose focus on reasoning. They propose StepFlow, a test-time intervention that repairs these flows and improves model accuracy without retraining.

AINeutralarXiv – CS AI · Apr 106/10

🧠

AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents

AgentGate introduces a lightweight routing engine that optimizes how AI agents communicate and dispatch tasks across distributed systems by treating routing as a constrained decision problem rather than open-ended text generation. The system uses a two-stage approach—action decision and structural grounding—and demonstrates that compact 3B-7B parameter models can achieve competitive performance while operating under resource constraints, latency, and privacy limitations.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Steering the Verifiability of Multimodal AI Hallucinations

Researchers have developed a method to control how verifiable AI hallucinations are in multimodal language models by distinguishing between obvious hallucinations (easily detected by humans) and elusive ones (harder to spot). Using a dataset of 4,470 human responses, they created targeted interventions that can fine-tune which types of hallucinations occur, enabling flexible control suited to different security and usability requirements.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Explaining Neural Networks in Preference Learning: a Post-hoc Inductive Logic Programming Approach

Researchers propose using Inductive Learning of Answer Set Programs (ILASP) to create interpretable approximations of neural networks trained on preference learning tasks. The approach combines dimensionality reduction through Principal Component Analysis with logic-based explanations, addressing the challenge of explaining black-box AI models while maintaining computational efficiency.

AIBullisharXiv – CS AI · Apr 106/10

🧠

EmoMAS: Emotion-Aware Multi-Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Orchestration

Researchers introduce EmoMAS, a Bayesian multi-agent framework that enables small language models to perform sophisticated negotiation by treating emotional intelligence as a strategic variable. The system coordinates game-theoretic, reinforcement learning, and psychological agents to optimize negotiation outcomes while maintaining privacy through edge deployment, demonstrating performance comparable to larger models across high-stakes domains.

AINeutralarXiv – CS AI · Apr 106/10

🧠

CAFP: A Post-Processing Framework for Group Fairness via Counterfactual Model Averaging

Researchers introduce CAFP, a post-processing framework that mitigates algorithmic bias by averaging predictions across factual and counterfactual versions of inputs where sensitive attributes are flipped. The model-agnostic approach eliminates the need for retraining or architectural modifications, making fairness interventions practical for deployed systems in high-stakes domains like credit scoring and criminal justice.

🏢 Meta

AINeutralarXiv – CS AI · Apr 106/10

🧠

A-MBER: Affective Memory Benchmark for Emotion Recognition

Researchers introduce A-MBER, a benchmark dataset designed to evaluate AI assistants' ability to recognize emotions based on long-term interaction history rather than immediate context. The benchmark tests whether models can retrieve relevant past interactions, infer current emotional states, and provide grounded explanations—revealing that memory's value lies in selective, context-aware interpretation rather than simple historical volume.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.

AINeutralarXiv – CS AI · Apr 106/10

🧠

How Much LLM Does a Self-Revising Agent Actually Need?

Researchers introduce a declarative runtime protocol that externalizes agent state to measure how much of an LLM-based agent's competence actually derives from the language model versus explicit structural components. Testing on Collaborative Battleship, they find that explicit world-model planning drives most performance gains, while sparse LLM-based revision at 4.3% of turns yields minimal and sometimes negative returns.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations

Researchers introduce AI-Sinkhole, an AI-agent augmented DNS-blocking framework that dynamically detects and temporarily blocks LLM chatbot services during proctored exams to prevent academic integrity violations. The system uses quantized LLMs for semantic classification and Pi-Hole for network-wide DNS blocking, achieving robust cross-lingual detection with F1-scores exceeding 0.83.

AIBearisharXiv – CS AI · Apr 106/10

🧠

Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen3-Embedding Model

Researchers identified a critical robustness vulnerability in Qwen3-embedding models for conversational retrieval, where structured dialogue noise becomes disproportionately retrievable and contaminates search results. The problem remains invisible under standard benchmarks but is significantly more pronounced in Qwen3 than competing models, though lightweight query prompting effectively mitigates it.

AINeutralarXiv – CS AI · Apr 105/10

🧠

Full State-Space Visualisation of the 8-Puzzle: Feasibility, Design, and Educational Use

Researchers have developed an interactive visualization system that displays the complete 181,440-state space of the 8-puzzle problem using GPU-based rendering, enabling students to explore search algorithm behavior in real-time. The system demonstrates that full state-space visualization is technically feasible and educationally valuable for AI education, bridging abstract algorithmic concepts with concrete puzzle manipulation.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Researchers propose a composite architecture combining instruction-based refusal with a structural abstention gate to reduce hallucinations in large language models. The system uses a support deficit score derived from self-consistency, paraphrase stability, and citation coverage to block unreliable outputs, achieving better accuracy than either mechanism alone across multiple models.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Researchers present CGD-PD, a test-time decoding method that improves large language models' performance on three-way logical question answering (True/False/Unknown) by enforcing negation consistency and resolving epistemic uncertainty through targeted entailment probes. The approach achieves up to 16% relative accuracy improvements on the FOLIO benchmark while reducing spurious Unknown predictions.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

Researchers introduce Text2DistBench, a new benchmark for evaluating how well large language models understand distributional information—like trends and preferences across text collections—rather than just factual details. Built from YouTube comments about movies and music, the benchmark reveals that while LLMs outperform random baselines, their performance varies significantly across different distribution types, highlighting both capabilities and gaps in current AI systems.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Front-End Ethics for Sensor-Fused Health Conversational Agents: An Ethical Design Space for Biometrics

Researchers propose an ethical framework for sensor-fused health AI agents that combine biometric data with large language models. The paper identifies critical risks at the user-facing layer where sensor data is translated into health guidance, arguing that the perceived objectivity of biometrics can mask AI errors and turn them into harmful medical directives.

AINeutralarXiv – CS AI · Apr 106/10

🧠

SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams

Researchers introduce SensorPersona, an LLM-based system that continuously extracts user personas from mobile sensor data rather than chat histories, achieving 31.4% higher recall in persona extraction and 85.7% win rate in personalized agent responses. The system processes multimodal sensor streams to infer physical patterns, psychosocial traits, and life experiences across longitudinal data collected from 20 participants over three months.

AINeutralarXiv – CS AI · Apr 106/10

🧠

The Human Condition as Reflected in Contemporary Large Language Models

A research study analyzes six leading large language models to identify shared cultural patterns revealed in their training data, finding consensus around themes like narrative meaning-making, status competition, and moral rationalization. The findings suggest LLMs function as 'cultural condensates' that compress how humans describe and contest their social lives across massive text datasets.

AINeutralarXiv – CS AI · Apr 106/10

🧠

A Comparative Study of Demonstration Selection for Practical Large Language Models-based Next POI Prediction

Researchers conducted a comparative analysis of demonstration selection strategies for using large language models to predict users' next point-of-interest (POI) based on historical location data. The study found that simple heuristic methods like geographical proximity and temporal ordering outperform complex embedding-based approaches in both computational efficiency and prediction accuracy, with LLMs using these heuristics sometimes matching fine-tuned model performance without additional training.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

Researchers introduce DOVE, a distributional evaluation framework that measures how well large language models align with cultural values through open-ended text generation rather than multiple-choice tests. The framework uses rate-distortion optimization to create a value codebook and unbalanced optimal transport to assess alignment, demonstrating 31.56% correlation with downstream tasks across 12 LLMs while requiring only 500 samples per culture.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

Researchers introduce chain-of-illocution (CoI) prompting to improve source faithfulness in retrieval-augmented language models, achieving up to 63% gains in source adherence for programming education tasks. The study reveals that standard RAG systems exhibit low fidelity to source materials, with non-RAG models performing worse, while a user study confirms improved faithfulness does not compromise user satisfaction.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Governing frontier general-purpose AI in the public sector: adaptive risk management and policy capacity under uncertainty through 2030

A research paper proposes adaptive risk management frameworks for governing frontier AI in public sectors through 2030, arguing that static compliance models are insufficient given rapid capability advancement and incomplete knowledge of AI harms. The work emphasizes that effective governance requires organizational redesign, stronger policy capacity, and scenario-aware regulation rather than purely technical solutions.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Negotiating Privacy with Smart Voice Assistants: Risk-Benefit and Control-Acceptance Tensions

Researchers studying 469 Canadian youth aged 16-24 developed a negotiation-based framework to understand privacy decision-making with smart voice assistants, introducing two tension indices (RBTI and CATI) that measure competing risk-benefit and control-acceptance pressures. The study reveals that frequent SVA users exhibit benefit-dominant profiles and accept convenience trade-offs, suggesting the privacy paradox reflects negotiation rather than inconsistency.

AINeutralarXiv – CS AI · Apr 106/10

🧠

DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Researchers introduce DISSECT, a 12,000-question diagnostic benchmark that reveals a critical "perception-integration gap" in Vision-Language Models—where VLMs successfully extract visual information but fail to reason about it during downstream tasks. Testing 18 VLMs across Chemistry and Biology shows open-source models systematically struggle with integrating visual input into reasoning, while closed-source models demonstrate superior integration capabilities.

← PrevPage 158 of 510Next →