#empirical-study News & Analysis

14 articles tagged with #empirical-study. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBearisharXiv – CS AI · May 297/10

🧠

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Researchers conducted 400 autonomous penetration testing runs across four LLM models against a fixed vulnerable target to measure attack consistency. Results show significant variation in exploitation success rates (25-85%) and distinctive failure modes per model, with Claude and Gemini 2.5 Flash-Lite substantially outperforming GPT-4o-mini and Qwen, raising critical questions about LLM reliability in security-critical autonomous operations.

🏢 Anthropic🧠 GPT-4🧠 Claude

AINeutralarXiv – CS AI · Apr 207/10

🧠

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Researchers conducted a comprehensive empirical study on scaling laws for large language models during reinforcement learning post-training, using Qwen2.5 models ranging from 0.5B to 72B parameters. The study reveals that larger models demonstrate superior learning efficiency, performance can be predicted via power-law models, and data reuse proves highly effective in constrained environments, providing practical guidelines for optimizing LLM reasoning capabilities.

AIBearisharXiv – CS AI · Apr 107/10

🧠

Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

Researchers evaluated Cursor, an AI-powered IDE, on its ability to generate large-scale software projects and found it achieves 91% functional correctness but produces significant design issues including code duplication, complexity violations, and framework best-practice breaches that threaten long-term maintainability.

AINeutralarXiv – CS AI · Jun 235/10

🧠

An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

Researchers conducted a case study evaluating GPT-4o's effectiveness in game development tasks within an existing Python/Pygame endless runner project. The study found that while the model successfully completed all three refactoring tasks, only one of three gameplay feature generation tasks integrated correctly, suggesting LLMs perform better with localized code transformations than complex cross-system integrations.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 116/10

🧠

Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study

A comprehensive empirical study examined how developers use rules in AI-powered IDEs to constrain LLM behavior, extracting 7,310 rules from 83 open-source projects. The research revealed a significant gap between what developers prioritize (architectural constraints) and what they actually implement (low-level formatting rules), while showing that rule updates improve artifact compliance by an average of 23 percentage points.

AINeutralarXiv – CS AI · Jun 46/10

🧠

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

A research study empirically examines how data scale, model complexity, and input modalities affect visual generalization in deep neural networks using CIFAR-10/100 datasets. The findings reveal that increasing training data consistently improves generalization, while model complexity changes yield inconsistent results, and color information removal significantly degrades performance.

AINeutralarXiv – CS AI · Jun 46/10

🧠

An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers

Researchers empirically compared eight input encoder architectures for Transformer models processing multi-channel signal data, finding that the standard per-channel linear projection matches all alternatives in performance while being simplest to implement. Two encoders underperformed significantly: shared-scalar baselines and channel-independent architectures, with practical differences between top performers remaining statistically small but modest.

AINeutralarXiv – CS AI · Jun 36/10

🧠

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

Researchers investigate whether real-world datasets contain natural experiments—events that create implicit interventions affecting some groups but not others—and propose using causal discovery methods to detect and leverage them for improved model performance. Their empirical study across synthetic and real-world datasets suggests that natural experiments do exist in practice and can enhance downstream machine learning outcomes when treated as interventional rather than observational data.

AINeutralarXiv – CS AI · May 126/10

🧠

Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification

Researchers deployed thirteen AI agents on Moltbook, a Reddit-like social network for AI systems, to study how configuration specifications affect emergent social behavior. Results show personality specification is the dominant factor influencing agent responses, while underlying LLM models and operational rules have more moderate effects on communication style and topic engagement.

AINeutralarXiv – CS AI · May 116/10

🧠

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Researchers conducted a controlled empirical study evaluating three LLMs (Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash) for qualitative coding of psychological safety in software engineering communities. Multi-shot prompting improved Claude Haiku's performance but not the others, while all models exhibited systematic biases in coding predictions, providing evidence-based guidelines for LLM-assisted qualitative research.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Apr 146/10

🧠

Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

A large-scale survey of 457 software engineering researchers reveals that generative AI adoption is widespread in academic research, concentrated primarily in writing and early-stage tasks. While researchers perceive significant productivity gains, persistent concerns about accuracy, bias, and lack of governance frameworks highlight the need for clearer guidelines on responsible AI integration in academic practice.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study

Researchers present the first empirical study of machine unlearning in hybrid quantum-classical neural networks, adapting classical unlearning methods to quantum settings and introducing quantum-specific strategies. The study reveals that quantum models can effectively support unlearning, with performance varying based on circuit depth and entanglement structure, establishing baseline insights for privacy-preserving quantum machine learning systems.

AINeutralarXiv – CS AI · Mar 55/10

🧠

Beyond the Prompt: An Empirical Study of Cursor Rules

Researchers conducted a large-scale empirical study analyzing 401 open-source repositories to understand how developers use cursor rules - persistent, machine-readable directives that provide context to AI coding assistants. The study identified five key themes of project context that developers consider essential: Conventions, Guidelines, Project Information, LLM Directives, and Examples.

AINeutralarXiv – CS AI · Mar 175/10

🧠

An Empirical Investigation of Pre-Trained Deep Learning Model Reuse in the Scientific Process

Researchers conducted the first empirical study analyzing how natural scientists reuse pre-trained deep learning models across 17,511 peer-reviewed papers from 2000-2025. The study found that biochemistry and molecular biology lead in model reuse, with adaptation being the most common reuse pattern, primarily impacting the testing phase of scientific research.