🧠

AI

12,731 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

12731 articles

AINeutralarXiv – CS AI · Apr 106/10

🧠

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.

AIBearisharXiv – CS AI · Apr 106/10

🧠

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Researchers introduce MedDialBench, a comprehensive benchmark testing how large language models maintain diagnostic accuracy when patients exhibit adversarial behaviors across five dimensions. The study reveals that fabricating symptoms causes 1.7-3.4x larger accuracy drops than withholding information, with worst-case performance degradation ranging from 38.8 to 54.1 percentage points across tested models.

AINeutralarXiv – CS AI · Apr 106/10

🧠

SentinelSphere: Integrating AI-Powered Real-Time Threat Detection with Cybersecurity Awareness Training

SentinelSphere is an AI-powered cybersecurity platform combining machine learning-based threat detection with LLM-driven security training to address both technical vulnerabilities and human-factor weaknesses in enterprise security. The system uses an Enhanced DNN model trained on benchmark datasets for real-time threat identification and deploys a quantized Phi-4 model for accessible security education, validated by industry professionals as intuitive and effective.

AINeutralarXiv – CS AI · Apr 106/10

🧠

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Researchers introduce Sol-RL, a two-stage reinforcement learning framework that combines FP4 quantization for efficient rollout generation with BF16 precision for policy optimization in diffusion models. The approach achieves up to 4.64x training acceleration while maintaining alignment quality, addressing the computational bottleneck of scaling RL-based post-training on large foundational models like FLUX.1.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Multi-modal user interface control detection using cross-attention

Researchers have developed an enhanced version of YOLOv5 that combines visual and textual data through cross-attention mechanisms to improve UI control detection in software screenshots. Tested on over 16,000 annotated images across 23 control classes, the multi-modal approach significantly outperforms pixel-only detection, with convolutional fusion showing the strongest results for semantically complex elements.

AINeutralarXiv – CS AI · Apr 106/10

🧠

ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations

ConceptTracer is an interactive tool for analyzing neural network representations through human-interpretable concepts, using information-theoretic measures to identify neurons responsive to specific ideas. The tool demonstrates how foundation models like TabPFN encode conceptual information, advancing mechanistic interpretability research.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

Researchers developed the Strategic Courtroom Framework, a multi-agent simulation where LLM-based prosecution and defense teams engage in iterative legal argumentation with trait-conditioned personalities. Testing across 7,000+ simulated trials revealed that diverse teams with complementary traits outperform homogeneous ones, and a reinforcement learning system can dynamically optimize team composition, demonstrating language as a strategic action space in adversarial domains.

🧠 Gemini

AIBullisharXiv – CS AI · Apr 106/10

🧠

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

KITE is a training-free system that converts long robot execution videos into compact, interpretable tokens for vision-language models to analyze robot failures. The approach combines keyframe extraction, open-vocabulary detection, and bird's-eye-view spatial representations to enable failure detection, identification, localization, and correction without requiring model fine-tuning.

AIBearisharXiv – CS AI · Apr 106/10

🧠

The Impact of Steering Large Language Models with Persona Vectors in Educational Applications

Researchers studied how persona vectors—AI steering techniques that inject personality traits into large language models—affect educational applications like essay generation and automated grading. The study found that persona steering significantly degrades answer quality, with substantially larger negative impacts on open-ended humanities tasks compared to factual science questions, and reveals that AI scorers exhibit predictable bias patterns based on assigned personality traits.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Mixed-Initiative Context: Structuring and Managing Context for Human-AI Collaboration

Researchers propose Mixed-Initiative Context, a framework that reconceptualizes how multi-turn AI interactions are managed by treating context as an explicit, structured, and dynamically adjustable object rather than a fixed chronological sequence. The approach enables both humans and AI to actively participate in context construction, addressing current limitations where irrelevant exchanges clutter context windows and users lack direct control mechanisms.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education

Researchers conducted a participatory design study with 20 Afghan women excluded from formal education to understand how generative AI can safely support their learning and career development. The study reveals that women view GenAI as a compensatory peer and mentor rather than an information source, while identifying critical needs around privacy protection, cultural safety, and pedagogically sound guidance.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

Researchers evaluated how well large language models can perform formal grammar-based translation tasks using in-context learning, finding that LLM translation accuracy degrades significantly with grammar complexity and sentence length. The study identifies specific failure modes including vocabulary hallucination and untranslated source words, revealing fundamental limitations in LLMs' ability to apply formal grammatical rules to translation tasks.

AIBearisharXiv – CS AI · Apr 106/10

🧠

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Researchers found that large language models experience accuracy drops of 0.3% to 5.9% when math problems are presented in unfamiliar cultural contexts, even when the underlying mathematical logic remains identical. Testing 14 models across culturally adapted variants of the GSM8K benchmark reveals that LLM mathematical reasoning is not culturally neutral, with errors stemming from both reasoning failures and calculation mistakes.

🏢 OpenAI🏢 Anthropic🧠 Claude

AINeutralarXiv – CS AI · Apr 106/10

🧠

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Researchers introduce Commander-GPT, a modular framework that orchestrates multiple specialized AI agents for multimodal sarcasm detection rather than relying on a single LLM. The system achieves 4.4-11.7% F1 score improvements over existing baselines on standard benchmarks, demonstrating that task decomposition and intelligent routing can overcome LLM limitations in understanding sarcasm.

🧠 GPT-4🧠 Gemini

AIBullisharXiv – CS AI · Apr 106/10

🧠

Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

Researchers developed a multimodal generative AI pipeline that creates synthetic residential building datasets from publicly available county records and images, addressing critical data scarcity challenges in building energy modeling. The system achieves over 65% overlap with national reference data, enabling scalable energy research and urban simulations without relying on expensive or privacy-restricted datasets.

AINeutralarXiv – CS AI · Apr 106/10

🧠

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Researchers introduce OneLife, a framework for learning symbolic world models from minimal unguided exploration in complex, stochastic environments. The approach uses conditionally-activated programmatic laws within a probabilistic framework and demonstrates superior performance on 16 of 23 test scenarios, advancing autonomous construction of world models for unknown environments.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment

Researchers demonstrate that large language models exhibit critical control failures in causal reasoning, where they produce sound logical arguments but abandon them under social pressure or authority hints. The study introduces CAUSALT3, a benchmark revealing three reproducible pathologies, and proposes Regulated Causal Anchoring (RCA), an inference-time mitigation technique that validates reasoning consistency without retraining.

AINeutralarXiv – CS AI · Apr 106/10

🧠

AdaProb: Efficient Machine Unlearning via Adaptive Probability

Researchers propose AdaProb, a machine unlearning method that enables trained AI models to efficiently forget specific data while preserving privacy and complying with regulations like GDPR. The approach uses adaptive probability distributions and demonstrates 20% improvement in forgetting effectiveness with 50% less computational overhead compared to existing methods.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges

Researchers have developed a comprehensive evaluation framework for Large Language Models applied to outpatient referral systems in healthcare, revealing that LLMs offer limited advantages over simpler BERT-like models in static referral tasks but demonstrate potential in interactive dialogue scenarios. The study addresses the absence of standardized evaluation criteria for assessing LLM effectiveness in dynamic healthcare settings.

AIBearisharXiv – CS AI · Apr 106/10

🧠

A Study of LLMs' Preferences for Libraries and Programming Languages

A new empirical study reveals that eight major LLMs exhibit systematic biases in code generation, overusing popular libraries like NumPy in 45% of cases and defaulting to Python even when unsuitable, prioritizing familiarity over task-specific optimality. The findings highlight gaps in current LLM evaluation methodologies and underscore the need for targeted improvements in training data diversity and benchmarking standards.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

Researchers introduced a new benchmark dataset for evaluating world models' ability to maintain spatial consistency across long sequences, addressing a critical gap in AI evaluation. The dataset, collected from Minecraft environments with 20 million frames across 150 locations, enables development of memory-augmented models that can reliably simulate physical spaces for downstream tasks like planning and simulation.

AIBullisharXiv – CS AI · Apr 106/10

🧠

In-Context Decision Making for Optimizing Complex AutoML Pipelines

Researchers propose PS-PFN, an advanced AutoML method that extends traditional algorithm selection and hyperparameter optimization to handle modern ML pipelines with fine-tuning and ensembling. Using posterior sampling and prior-data fitted networks for in-context learning, the approach outperforms existing bandit and AutoML strategies on benchmark tasks.

AIBullisharXiv – CS AI · Apr 106/10

🧠

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Researchers demonstrate that Large Language Models used as judges suffer from score range bias, where evaluation outputs are highly sensitive to predefined scoring scales. Using contrastive decoding techniques, they achieve up to 11.7% improvement in alignment with human judgments across different score ranges.

AIBullisharXiv – CS AI · Apr 106/10

🧠

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

Researchers introduce LoRA-DA, a new initialization method for Low-Rank Adaptation that leverages target-domain data and theoretical optimization principles to improve fine-tuning performance. The method outperforms existing initialization approaches across multiple benchmarks while maintaining computational efficiency.

AIBullisharXiv – CS AI · Apr 106/10

🧠

Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

Researchers introduce Nirvana, a Specialized Generalist Model that combines broad language capabilities with domain-specific adaptation through task-aware memory mechanisms. The model achieves competitive performance on general benchmarks while reaching lowest perplexity across specialized domains like biomedicine, finance, and law, with practical applications demonstrated in medical imaging reconstruction.

🏢 Hugging Face🏢 Perplexity

← PrevPage 160 of 510Next →