y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

arXiv – CS AI|Moaath Alshaikh, Tasneem Alshaher, Ricardo Vieira, Beatriz Santana, Clelio Xavier, Jose Amancio, Glauco Carneiro, Julio Leite, Savio Freire, Manoel Mendonca|
🤖AI Summary

Researchers conducted a controlled empirical study evaluating three LLMs (Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash) for qualitative coding of psychological safety in software engineering communities. Multi-shot prompting improved Claude Haiku's performance but not the others, while all models exhibited systematic biases in coding predictions, providing evidence-based guidelines for LLM-assisted qualitative research.

Analysis

This empirical study addresses a critical gap in LLM reliability for academic research applications, specifically qualitative coding tasks that traditionally require human expertise and subjective judgment. The research demonstrates that prompt engineering strategies have differential effects across LLM architectures, challenging the assumption that all large language models respond uniformly to methodological interventions. The multi-shot improvement for Claude Haiku (Cohen's kappa increase of 0.034) contrasts sharply with negligible gains for DeepSeek-Chat and Gemini 2.5 Flash, suggesting that model architecture and training methodology influence how effectively examples guide reasoning. The substantial variance in stability across models—particularly Gemini 2.5 Flash's higher variance (SD = 0.038)—raises important questions about production reliability when deploying LLMs for research automation at scale. The systematic biases identified across all models, including 5.25x over-prediction of negative feedback and consistent under-prediction of concern expression, indicate that LLMs internalize skewed representations that require explicit correction mechanisms. These findings directly impact software engineering researchers considering LLM-assisted analysis pipelines, as naive adoption could inadvertently introduce systematic measurement bias into published findings. The work establishes a methodological foundation for validating LLM outputs through controlled experimentation rather than assumption. Organizations developing research automation tools must account for model-specific performance characteristics and implement validation protocols that detect category-specific biases before analysis deployment.

Key Takeaways
  • Multi-shot prompting significantly improved Claude Haiku's agreement (kappa +0.034) but showed negligible effects for DeepSeek-Chat and Gemini 2.5 Flash
  • Gemini 2.5 Flash demonstrated the lowest intra-model stability (SD = 0.038), while Claude Haiku and DeepSeek-Chat were more consistent
  • All three models exhibited systematic bias toward over-predicting negative feedback (up to 5.25x) and under-predicting concern expression
  • Prompt engineering effectiveness varies significantly by model architecture, requiring model-specific validation before research deployment
  • The study provides empirical evidence that LLM-assisted qualitative coding requires explicit bias detection and correction mechanisms
Mentioned in AI
Models
ClaudeAnthropic
GeminiGoogle
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles