🧠

AI

11,686 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

11686 articles

AIBullisharXiv – CS AI · Mar 37/104

🧠

Neuro-Symbolic Skill Discovery for Conditional Multi-Level Planning

Researchers have developed a new AI architecture that learns high-level symbolic skills from minimal low-level demonstrations, enabling robots to manipulate objects and execute complex tasks in unseen environments. The system combines neural networks for symbol discovery with visual language models for high-level planning and gradient-based methods for low-level execution.

AIBullisharXiv – CS AI · Mar 37/103

🧠

MagicAgent: Towards Generalized Agent Planning

Researchers have developed MagicAgent, a series of foundation models designed for generalized AI agent planning that outperforms existing sub-100B models and even surpasses leading ultra-scale models like GPT-5.2. The models achieve superior performance through a novel synthetic data framework and two-stage training paradigm that addresses gradient interference in multi-task learning.

AINeutralarXiv – CS AI · Mar 37/105

🧠

DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs

Researchers introduce DAG-Math, a new framework for evaluating mathematical reasoning in Large Language Models that models Chain-of-Thought as rule-based processes over directed acyclic graphs. The framework includes a 'logical closeness' metric that reveals significant differences in reasoning quality between LLM families, even when final answer accuracy appears comparable.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Control Tax: The Price of Keeping AI in Check

Researchers introduce 'Control Tax' - a framework to quantify the operational and financial costs of implementing AI safety oversight mechanisms. The study provides theoretical models and empirical cost estimates to help organizations balance AI safety measures with economic feasibility in real-world deployments.

AINeutralarXiv – CS AI · Mar 37/103

🧠

CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing

Researchers introduced CityLens, a comprehensive benchmark for evaluating Large Vision-Language Models' ability to predict socioeconomic indicators from urban imagery. The study tested 17 state-of-the-art LVLMs across 11 prediction tasks using data from 17 global cities, revealing promising capabilities but significant limitations in urban socioeconomic analysis.

AIBullisharXiv – CS AI · Mar 37/103

🧠

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Researchers developed mCLM, a 3-billion parameter modular Chemical Language Model that generates functional molecules compatible with automated synthesis by tokenizing at the building block level rather than individual atoms. The AI system outperformed larger models including GPT-5 in creating synthesizable drug candidates and can iteratively improve failed clinical trial compounds.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Researchers identify a 'safety mirage' problem in vision language models where supervised fine-tuning creates spurious correlations that make models vulnerable to simple attacks and overly cautious with benign queries. They propose machine unlearning as an alternative that reduces attack success rates by up to 60.27% and unnecessary rejections by over 84.20%.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Researchers introduce Kiwi-Edit, a new video editing architecture that combines instruction-based and reference-guided editing for more precise visual control. The team created RefVIE, a large-scale dataset for training, and achieved state-of-the-art results in controllable video editing through a unified approach that addresses limitations of natural language descriptions.

AIBullisharXiv – CS AI · Mar 37/103

🧠

SageBwd: A Trainable Low-bit Attention

Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.

AIBullisharXiv – CS AI · Mar 37/104

🧠

DRAGON: LLM-Driven Decomposition and Reconstruction Agents for Large-Scale Combinatorial Optimization

Researchers introduce DRAGON, a new framework that combines Large Language Models with metaheuristic optimization to solve large-scale combinatorial optimization problems. The system decomposes complex problems into manageable subproblems and achieves near-optimal results on datasets with over 3 million variables, overcoming the scalability limitations of existing LLM-based solvers.

$NEAR

AINeutralarXiv – CS AI · Mar 37/104

🧠

Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models

Researchers discovered that large reasoning models (LRMs) suffer from inconsistent answers due to competing mechanisms between Chain-of-Thought reasoning and memory retrieval. They developed FARL, a new fine-tuning framework that suppresses retrieval shortcuts to promote genuine reasoning capabilities in AI models.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Researchers introduce Robometer, a new framework for training robot reward models that combines progress tracking with trajectory comparisons to better learn from failed attempts. The system is trained on RBM-1M, a dataset of over one million robot trajectories including failures, and shows improved performance across diverse robotics applications.

AIBullisharXiv – CS AI · Mar 37/103

🧠

GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered

Researchers propose GenDB, a revolutionary database system that uses Large Language Models to synthesize query execution code instead of relying on traditional engineered query processors. Early prototype testing shows GenDB outperforms established systems like DuckDB, Umbra, and PostgreSQL on OLAP workloads.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Revealing Combinatorial Reasoning of GNNs via Graph Concept Bottleneck Layer

Researchers developed a new graph concept bottleneck layer (GCBM) that can be integrated into Graph Neural Networks to make their decision-making process more interpretable. The method treats graph concepts as 'words' and uses language models to improve understanding of how GNNs make predictions, achieving state-of-the-art performance in both classification accuracy and interpretability.

AIBearisharXiv – CS AI · Mar 37/104

🧠

VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents

Researchers have identified critical security vulnerabilities in Computer-Use Agents (CUAs) through Visual Prompt Injection attacks, where malicious instructions are embedded in user interfaces. Their VPI-Bench study shows CUAs can be deceived at rates up to 51% and Browser-Use Agents up to 100% on certain platforms, with current defenses proving inadequate.

AINeutralarXiv – CS AI · Mar 37/103

🧠

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

Researchers prove that gradient descent in neural networks converges to optimal robustness margins at an extremely slow rate of Θ(1/ln(t)), even in simplified two-neuron settings. This establishes the first explicit lower bound on convergence rates for robustness margins in non-linear models, revealing fundamental limitations in neural network training efficiency.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping

Researchers introduce Interaction2Code, the first benchmark for evaluating Multimodal Large Language Models' ability to generate interactive webpage code from prototypes. The study identifies four critical limitations in current MLLMs and proposes enhancement strategies to improve their performance on dynamic web interactions.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Towards Camera Open-set 3D Object Detection for Autonomous Driving Scenarios

Researchers developed OS-Det3D, a two-stage framework for camera-based 3D object detection in autonomous vehicles that can identify unknown objects beyond predefined categories. The system uses LiDAR geometric cues and a joint selection module to discover novel objects while improving detection of known objects, addressing safety risks in real-world driving scenarios.

AIBullisharXiv – CS AI · Mar 37/104

🧠

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Surge AI introduces CoreCraft, the first environment in EnterpriseBench for training AI agents on realistic enterprise workflows. Training GLM 4.6 on this high-fidelity customer support simulation improved task performance from 25% to 37% and showed positive transfer to other benchmarks, demonstrating that quality training environments enable generalizable AI capabilities.

AIBullisharXiv – CS AI · Mar 37/102

🧠

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Researchers propose Intervened Preference Optimization (IPO) to address safety issues in Large Reasoning Models, where chain-of-thought reasoning contains harmful content even when final responses appear safe. The method achieves over 30% reduction in harmfulness while maintaining reasoning performance.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Learning from Synthetic Data Improves Multi-hop Reasoning

Researchers demonstrated that large language models can improve multi-hop reasoning performance by training on rule-generated synthetic data instead of expensive human annotations or frontier LLM outputs. The study found that LLMs trained on synthetic fictional data performed better on real-world question-answering benchmarks by learning fundamental knowledge composition skills.

AIBullisharXiv – CS AI · Mar 37/104

🧠

SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Researchers developed SpiroLLM, the first multimodal large language model capable of understanding spirogram time series data for COPD diagnosis. Using data from 234,028 UK Biobank individuals, the model achieved 0.8977 diagnostic AUROC and maintained 100% valid response rate even with missing data, far outperforming text-only models.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Bilinear representation mitigates reversal curse and enables consistent model editing

Researchers have identified that the 'reversal curse' in language models - their inability to infer 'B is A' from 'A is B' - can be overcome through bilinear representation structures. Training models on synthetic relational knowledge graphs creates internal geometries that enable consistent model editing and logical inference of reverse facts.

AIBullisharXiv – CS AI · Mar 37/105

🧠

Toward Clinically Explainable AI for Medical Diagnosis: A Foundation Model with Human-Compatible Reasoning via Reinforcement Learning

Researchers have developed DeepMedix-R1, a foundation model for chest X-ray interpretation that provides transparent, step-by-step reasoning alongside accurate diagnoses to address the black-box problem in medical AI. The model uses reinforcement learning to align diagnostic outputs with clinical plausibility and significantly outperforms existing models in report generation and visual question answering tasks.

AIBullisharXiv – CS AI · Mar 37/103

🧠

EigenBench: A Comparative Behavioral Measure of Value Alignment

Researchers have developed EigenBench, a new black-box method for measuring how well AI language models align with human values. The system uses an ensemble of models to judge each other's outputs against a given constitution, producing alignment scores that closely match human evaluator judgments.

← PrevPage 76 of 468Next →