🧠

AI

22,418 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

22418 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

Researchers demonstrate that closed-loop automated machine learning systems can discover generalizable improvements in molecular property prediction by having language-model agents modify features, models, and acquire external evidence. Testing across 36 molecular endpoints reveals that while some improvements validate strongly, they don't consistently transfer to held-out test sets, highlighting critical challenges in ensuring reproducibility of AI-driven research discoveries.

AINeutralarXiv – CS AI · Jun 237/10

🧠

GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation

GroundEval introduces a deterministic framework for evaluating AI agents by auditing their evidence retrieval and reasoning paths rather than relying on LLM judges. The tool detected a critical failure case where frontier LLM judges scored an agent response above 0.85, but the actual trace revealed the agent never retrieved the artifact it cited, yielding a GroundEval score of 0.000.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Training Open Models for Agentic Phone Use

Researchers introduce PhoneBuddy, a training framework combining real device environments with mock-app simulations to improve AI agent performance on smartphone tasks. The approach achieves 45.33% success on real phones and 83.2% on test benchmarks, demonstrating that hybrid training surpasses either method alone.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Safety in Self-Evolving LLM Agent Systems: Threats, Amplification, and Case Studies

A new security analysis reveals that self-evolving LLM agent systems face critical vulnerabilities across 17 of 25 potential attack vectors, with adversarial compromises becoming permanently encoded and self-amplifying across system generations. Testing of open-source frameworks demonstrates 100% attack persistence rates, suggesting that autonomous AI systems capable of self-modification require fundamentally new security paradigms beyond traditional static defenses.

AIBullisharXiv – CS AI · Jun 237/10

🧠

ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers

Researchers introduce ScalingAttention, a training-free framework that optimizes video diffusion transformers by discovering stable, sparse attention patterns encoded in model weights rather than computing them dynamically. The method achieves up to 1.90X speedup while maintaining superior video generation fidelity, addressing a critical computational bottleneck in AI-generated video production.

AIBullisharXiv – CS AI · Jun 237/10

🧠

NOEM$^{3}$A: a Neuro-symbolic Ontology-Enhanced Method for Multi-intent understanding in Mobile Agents

NOEM³A is a lightweight neuro-symbolic framework that enhances compact language models with intent ontologies to improve natural language understanding for mobile agents. By injecting structured symbolic knowledge into both input prompts and output decoding, the method achieves better performance on dialogue understanding tasks while maintaining privacy and low-latency requirements suitable for on-device deployment.

🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

Researchers prove theoretically that reinforcement learning with verifiable rewards (RLVR) enables language models to learn efficient backtracking strategies superior to supervised fine-tuning (SFT), achieving exponential computational advantages during inference. The study models chain-of-thought reasoning as graph pathfinding and demonstrates that RLVR trains models to identify difficult decision points, allowing better allocation of compute resources.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models

Researchers demonstrate a novel attack vector against vision-language-action (VLA) policies that exploit the 'trusted imagination' component of world-action models rather than targeting reactive policies directly. By perturbing observations to corrupt latent trajectory predictions, attackers can fool downstream systems like safety gates and MPC planners while leaving the base policy unaffected, revealing a critical asymmetry in AI system robustness.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

Researchers demonstrate that internal computational artifacts within Large Language Models can reliably detect when the model produces incorrect outputs in legal classification tasks. By analyzing these internal signals, downstream classifiers can identify hallucinated or erroneous predictions, potentially improving the reliability of LLM-based legal systems for high-stakes applications like bail decisions and statute violation predictions.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Latent Personal Memory: Represent personal memory as dynamic soft prompts

Researchers introduce Latent Personal Memory (LPM), a framework that personalizes large language models by encoding user-specific behavioral patterns as compact, interpretable latent slots converted into dynamic soft prompts. The approach achieves significant efficiency gains—outperforming LoRA and Prompt Tuning by up to 54.4% on benchmarks while reducing memory usage by 64x—making personalized LLMs more practical for deployment.

AIBullisharXiv – CS AI · Jun 237/10

🧠

ConnectomeBench2: A Unified Benchmark for Automated Connectomic Proofreading

Researchers released ConnectomeBench2, a unified benchmark dataset containing over 716,000 expert-labeled proofreading decisions for automated 3D brain reconstruction across four species. A Vision Transformer model trained on this dataset achieved human-level accuracy in identifying segmentation errors, advancing the automation of connectomic proofreading—a critical bottleneck in neuroscience research.

🏢 Hugging Face

AIBullisharXiv – CS AI · Jun 237/10

🧠

The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence

Researchers introduce the Metanym Game, a novel LLM benchmark that measures structural intelligence through competitive word games where AI models generate and evaluate content without pre-existing test sets. Using spectral analysis on evaluator ratings, the benchmark achieves contamination-resistance and reveals that generation and judging skills dissociate significantly across models, with a self-governing council structure enabling dynamic competitive scaling.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Vesta: A Generalist Embodied Reasoning Model

Researchers introduce Vesta, a unified foundation model for robotics that consolidates localization, spatial reasoning, navigation, and planning into a single generalist system rather than relying on multiple specialist models. The approach outperforms individual state-of-the-art baselines by over 20% and improves real-world robotic task success by 35%, demonstrating that generalist models can match or exceed specialized alternatives while reducing computational overhead and error cascades.

AIBearisharXiv – CS AI · Jun 237/10

🧠

How Much Coordination Gain Is Real? A Paired Noise-Floor Protocol for Multi-Agent LLM Benchmarks

A technical study challenges the validity of reported improvements in multi-agent LLM coordination architectures by establishing a noise-floor baseline using Claude Haiku. The research reveals that paired configuration-equivalent trials produce statistical gaps of ±5pp at best, suggesting that seven of ten recent coordination papers report headline effects within or below this noise floor, raising questions about reproducibility and the actual gains from proposed architectures.

🧠 Claude🧠 Haiku

AIBearisharXiv – CS AI · Jun 237/10

🧠

The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks

Researchers have discovered a new class of attacks called Targeted Identity Re-Association (TIRA) that can manipulate machine learning fairness audits and SHAP explainability tools without leaving detectable traces. The attacks use probabilistic output manipulation techniques to mask the influence of protected features, demonstrating that critical AI accountability mechanisms are vulnerable to sophisticated gaming.

AINeutralarXiv – CS AI · Jun 237/10

🧠

Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship

A rigorous analysis of AI coding agents reveals that apparent benefits of human co-authorship in pull requests disappear under proper statistical controls, demonstrating how Simpson's Paradox and confounding variables can mask true causal relationships in AI agent research.

🏢 Microsoft🧠 Claude

AIBullisharXiv – CS AI · Jun 237/10

🧠

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

SpotAttention is a lightweight machine learning technique that reduces computational costs for large language models processing long text sequences. By learning to identify only the most relevant tokens to attend to, it achieves 3.9x faster decoding speeds while maintaining accuracy at context lengths eight times longer than training, addressing a critical efficiency bottleneck in modern LLMs.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Measuring Behavior Portability in Large Language Models

A new research framework reveals that large language models exhibit inconsistent behavior across structurally equivalent decision environments, demonstrating significant portability losses when behavioral patterns learned in one setting are applied to another. The findings suggest that LLM evaluations based on single environments may be unreliable for predicting real-world autonomous decision-making performance.

AIBearisharXiv – CS AI · Jun 237/10

🧠

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

Researchers have identified a sophisticated vulnerability in multimodal AI web agents through MIRAGE, a visual prompt injection attack that exploits trusted web platforms by embedding hidden adversarial instructions within legitimate ad slots or widgets. The attack demonstrates how constrained attackers can manipulate MLLM-based automation tools like SeeAct and OpenClaw without detection, raising critical security concerns for AI-powered browser automation systems.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Human and AI collaboration for pulmonary nodule segmentation

Hi-Seg, a human-in-the-loop segmentation framework built on the Segment Anything Model, achieved 85% accuracy in pulmonary nodule detection across 1,179 patients, outperforming five state-of-the-art AI models by 10-22%. The research demonstrates that non-experts with brief training can match junior medical professionals' performance, suggesting foundation models can be safely integrated into clinical workflows while reducing annotator burden.

AIBullisharXiv – CS AI · Jun 237/10

🧠

XmoPipe: A Pipeline for Large-Scale In-the-Wild Human Motion Dataset Construction

XmoPipe is a scalable pipeline that constructs large-scale human motion datasets by extracting 3D body and facial motion from unconstrained online videos, combined with automated textual descriptions. The system demonstrates that motion models trained on this in-the-wild data achieve performance comparable to traditional marker-based motion capture datasets while offering superior scalability and diversity.

AIBullisharXiv – CS AI · Jun 237/10

🧠

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

Researchers introduce RigorBench, the first benchmark measuring process discipline in AI coding agents beyond mere outcome correctness. The study demonstrates that structured engineering practices improve both process quality by 41% and code correctness by 17%, establishing that how AI agents approach coding tasks matters as significantly as their final results.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Benchmarking Robot Memory Under Interference

Researchers introduce RoboMME-Interference, a benchmark testing how robot memory systems perform across multiple sessions with irrelevant distractions. Testing current memory-augmented AI models reveals significant performance degradation as unrelated sessions accumulate, highlighting a critical gap in long-context robustness for real-world robot deployment.

AINeutralarXiv – CS AI · Jun 237/10

🧠

First-Token Broadcasters: Mechanistic Origins of Language Identity and Distributed Robustness in Transformers

Researchers identify specific attention heads in multilingual language models responsible for language switching errors, revealing that instruction tuning reorganizes these circuits to concentrate language identity signals in early layers. The study demonstrates that language selection operates through a distributed but hierarchical mechanism, with compensation patterns following predictable feedforward cascades rather than global diffusion.

AIBullisharXiv – CS AI · Jun 237/10

🧠

FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation

Researchers introduce FOCA, a new framework for improving Vision-Language-Action (VLA) models in robotic control with limited training data. The method achieves significant performance gains in few-shot learning scenarios, reaching 95.7% success on benchmark tasks with just 20 demonstrations and up to 26% improvements on real robots.

← PrevPage 14 of 897Next →