y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 6/10

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

arXiv – CS AI|Gilberto Sussumu Hida, Danilo Monteiro Ribeiro, Erika Yahata|
🤖AI Summary

A comprehensive study comparing 12 large language models against 4 classical classifiers for automating evidence screening in software engineering systematic literature reviews reveals that LLMs exhibit significant performance variability and lack consistent superiority over traditional methods. The research emphasizes that abstract availability is critical for LLM performance, while title and keywords provide minimal additional value, suggesting LLM adoption should be driven by operational constraints rather than performance guarantees.

Analysis

This research addresses a critical gap in LLM deployment practices by rigorously testing language models on a real-world task where accuracy directly impacts research validity. Study screening in systematic literature reviews represents a high-stakes classification problem where false negatives—rejected relevant papers—can undermine entire research conclusions. The authors' decision to test 12 different LLMs across multiple providers reveals an uncomfortable truth: the AI industry has oversold LLM capabilities in specialized domains without sufficient empirical validation.

The finding that LLMs demonstrate substantial heterogeneity and non-determinism even at temperature zero contradicts marketing narratives about reproducibility and reliability. This variability matters because researchers and organizations making million-dollar decisions about automation pipelines need guarantees, not probability distributions. The metadata analysis—showing abstract dominance while title and keywords add minimal value—provides actionable guidance but also suggests LLMs aren't extracting sophisticated signal from structured information as promised.

When compared directly to classical models under identical experimental conditions, LLMs fail to demonstrate consistent performance advantages. This challenges the broader narrative that deep learning automatically supersedes statistical approaches. For software engineering teams and research organizations, this means LLM adoption decisions shouldn't be based on hype but on tangible operational benefits like cost reduction or metadata handling efficiency.

The recommendation that LLM deployment should be justified by governance constraints rather than raw performance represents a maturation of AI evaluation practices. Organizations must conduct pilot validations before committing to LLM-based screening systems and explicitly report variability metrics. This research sets a precedent for rigorous comparative analysis that other domains should emulate.

Key Takeaways
  • LLMs showed no consistent performance superiority over classical classifiers on systematic literature review screening tasks across two real datasets.
  • Abstract text was decisive for LLM performance while title and keywords provided negligible improvements, limiting the practical value of enhanced metadata.
  • LLMs demonstrated substantial heterogeneity and residual non-determinism even when temperature was set to zero, raising reproducibility concerns.
  • LLM adoption decisions should prioritize operational and governance factors over performance guarantees, supported by pilot testing and variance reporting.
  • Classical machine learning models remain competitive for specialized document classification tasks in research workflows.
Mentioned in AI
Companies
OpenAI
Anthropic
Models
GeminiGoogle
LlamaMeta
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles