AI Pulse News

Models, papers, tools. 18,994 articles with AI-powered sentiment analysis and key takeaways.

18994 articles

AINeutralarXiv – CS AI · Apr 146/10

🧠

Belief-Aware VLM Model for Human-like Reasoning

Researchers propose a belief-aware Vision Language Model framework that enhances human-like reasoning by integrating retrieval-based memory and reinforcement learning. The approach addresses limitations in current VLMs and VLAs by approximating belief states through vector-based memory, demonstrating improved performance on vision-question-answering tasks compared to zero-shot baselines.

AINeutralarXiv – CS AI · Apr 146/10

🧠

COMPOSITE-Stem

Researchers introduced COMPOSITE-STEM, a new benchmark containing 70 expert-written scientific tasks across physics, biology, chemistry, and mathematics to evaluate AI agents. The top-performing model achieved only 21% accuracy, indicating the benchmark effectively measures capabilities beyond current AI reach and addresses the saturation of existing evaluation frameworks.

AINeutralarXiv – CS AI · Apr 146/10

🧠

GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension

Researchers introduce GLEaN, a visual explainability method that transforms complex AI bias detection into understandable portrait composites, enabling non-technical audiences to grasp how text-to-image models like Stable Diffusion XL associate occupations and identities with specific demographic characteristics.

🧠 Stable Diffusion

AINeutralarXiv – CS AI · Apr 146/10

🧠

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Researchers introduced HealthAdminBench, a new evaluation framework with 135 tasks across realistic healthcare administration workflows, revealing that current AI agents achieve only 36.3% end-to-end success despite strong individual subtask performance. The benchmark demonstrates a critical gap between AI capabilities and the reliability requirements for automating healthcare administrative processes worth over $1 trillion annually.

🧠 GPT-5🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · Apr 146/10

🧠

New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework

Researchers propose a novel hybrid fine-tuning method for Large Language Models that combines full parameter updates with Parameter-Efficient Fine-Tuning (PEFT) modules using zeroth-order and first-order optimization. The approach addresses computational constraints of full fine-tuning while overcoming PEFT's limitations in knowledge acquisition, backed by theoretical convergence analysis and empirical validation across multiple tasks.

AINeutralarXiv – CS AI · Apr 146/10

🧠

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Researchers introduced FinTrace, a benchmark dataset with 800 expert-annotated trajectories for evaluating how large language models perform financial tool-calling tasks. The study reveals that while frontier LLMs excel at selecting appropriate tools, they struggle significantly with information utilization and generating accurate final outputs, pointing to a critical reasoning gap that persists even after fine-tuning with preference optimization techniques.

AIBullisharXiv – CS AI · Apr 146/10

🧠

SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

Researchers propose SVSR, a self-verification and self-rectification framework that enhances multimodal AI reasoning through a three-stage training approach combining preference datasets, supervised fine-tuning, and semi-online direct preference optimization. The method demonstrates improved accuracy and generalization across visual understanding tasks while maintaining performance even without explicit reasoning traces.

AINeutralarXiv – CS AI · Apr 146/10

🧠

STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems

Researchers introduce STARS, a framework for continuously auditing AI agent skill invocations in real-time by combining static capability analysis with request-conditioned risk modeling. The approach demonstrates improved detection of prompt injection attacks compared to static baselines, though remains most valuable as a triage layer rather than a complete replacement for pre-deployment screening.

AINeutralarXiv – CS AI · Apr 146/10

🧠

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

Researchers introduce TimeSeriesExamAgent, a scalable framework for automatically generating time series reasoning benchmarks using LLM agents and templates. The study reveals that while large language models show promise in time series tasks, they significantly underperform in abstract reasoning and domain-specific applications across healthcare, finance, and weather domains.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Gypscie: A Cross-Platform AI Artifact Management System

Gypscie is a new cross-platform AI artifact management system that unifies the complexity of managing machine learning models across diverse infrastructure through a knowledge graph and rule-based query language. The system streamlines the entire AI model lifecycle—from data preparation through deployment and monitoring—while enabling explainability through provenance tracking.

AINeutralarXiv – CS AI · Apr 146/10

🧠

VeriTrans: Fine-Tuned LLM-Assisted NL-to-PL Translation via a Deterministic Neuro-Symbolic Pipeline

VeriTrans is a machine learning system that converts natural language requirements into formal logic suitable for automated solvers, using a validator-gated pipeline to ensure reliability. Achieving 94.46% correctness on 2,100 specifications, the system combines fine-tuned language models with round-trip verification and deterministic execution, enabling auditable translation for critical applications.

$PL$NL$CNF

AINeutralarXiv – CS AI · Apr 146/10

🧠

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

ClawVM is a virtual memory management system designed for stateful LLM agents that addresses critical failures in current context window management. The system implements typed pages, multi-resolution representations, and validated writeback protocols to ensure deterministic state residency and durability, adding minimal computational overhead.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Researchers introduce a multi-agent framework to map data lineage in large language models, revealing how post-training datasets evolve and interconnect. The analysis uncovers structural redundancy, benchmark contamination propagation, and proposes lineage-aware dataset construction to improve LLM training diversity and quality.

AIBullisharXiv – CS AI · Apr 146/10

🧠

CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation

Researchers introduce CARO, a two-stage training framework that enhances large language models' ability to perform robust content moderation through analogical reasoning. By combining retrieval-augmented generation with direct preference optimization, CARO achieves 24.9% F1 score improvement over state-of-the-art models including DeepSeek R1 and LLaMA Guard on ambiguous moderation cases.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Cooperation in Human and Machine Agents: Promise Theory Considerations

A theoretical research paper examines Promise Theory as a framework for understanding cooperation between human and machine agents in autonomous systems. The work revisits established principles of agent cooperation to address how diverse components—humans, hardware, software, and AI—maintain alignment with intended purposes through signaling, trust, and feedback mechanisms.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

Researchers introduce Agent Mentor, an open-source analytics pipeline that monitors and automatically improves AI agent behavior by analyzing execution logs and iteratively refining system prompts with corrective instructions. The framework addresses variability in large language model-based agent performance caused by ambiguous prompt formulations, demonstrating consistent accuracy improvements across multiple configurations.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Researchers introduce Agent^2 RL-Bench, a benchmark testing whether LLM agents can autonomously design and execute reinforcement learning pipelines to improve foundation models. Testing across multiple agent systems reveals significant performance variation, with online RL succeeding primarily on ALFWorld while supervised learning pipelines dominate under fixed computational budgets.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

A new arXiv paper argues that AI alignment cannot rely solely on stated principles because their real-world application requires contextual judgment and interpretation. The research shows that a significant portion of preference-labeling data involves principle conflicts or indifference, meaning principles alone cannot determine decisions—and these interpretive choices often emerge only during model deployment rather than in training data.

AINeutralarXiv – CS AI · Apr 146/10

🧠

FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation

Researchers propose FedRio, a federated learning framework that enables social media platforms to collaboratively detect bot accounts without sharing raw user data. The system uses graph neural networks, adversarial learning, and reinforcement learning to improve bot detection accuracy while maintaining privacy across heterogeneous platform architectures.

AINeutralarXiv – CS AI · Apr 146/10

🧠

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Researchers introduce SciPredict, a benchmark testing whether large language models can predict scientific experiment outcomes across physics, biology, and chemistry. The study reveals that while some frontier models marginally exceed human experts (~20% accuracy), they fundamentally fail to assess prediction reliability, suggesting superhuman performance in experimental science requires not just better predictions but better calibration awareness.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

Researchers propose a method for training open-source language models to simulate how programming students learn and debug code, using authentic student data serialized into conversational formats. This approach addresses privacy and cost concerns with proprietary models while demonstrating improved performance in replicating student problem-solving behavior compared to existing baselines.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making

Researchers propose Clinical Narrative-informed Preference Rewards (CN-PR), a machine learning framework that extracts reward signals from patient discharge summaries to train reinforcement learning models for treatment decisions. The approach achieves strong alignment with clinical outcomes, including improved organ support-free days and faster shock resolution, offering a scalable alternative to traditional reward design in healthcare AI.

AINeutralarXiv – CS AI · Apr 146/10

🧠

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

TorchUMM is an open-source unified codebase designed to standardize evaluation, analysis, and post-training of multimodal AI models across diverse architectures. The framework addresses fragmentation in the field by providing a single interface for benchmarking models on vision-language understanding, generation, and editing tasks, enabling reproducible comparisons and accelerating development of more capable multimodal systems.

🏢 Meta

AINeutralarXiv – CS AI · Apr 146/10

🧠

Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering

Researchers introduce TagCC, a novel deep clustering framework that combines Large Language Models with contrastive learning to enhance tabular data analysis by incorporating semantic knowledge from feature names and values. The approach bridges the gap between statistical co-occurrence patterns and intrinsic semantic understanding, demonstrating significant performance improvements over existing methods in finance and healthcare applications.

AINeutralarXiv – CS AI · Apr 146/10

🧠

CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

Researchers introduce CFMS, a two-stage framework combining multimodal large language models with symbolic reasoning to improve tabular data comprehension for question answering and fact verification tasks. The approach achieves competitive results on WikiTQ and TabFact benchmarks while demonstrating particular robustness with large tables and smaller model architectures.

← PrevPage 297 of 760Next →