AIBearisharXiv โ CS AI ยท 8h ago6/10
๐ง
Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs
A comprehensive study comparing 12 large language models against 4 classical classifiers for automating evidence screening in software engineering systematic literature reviews reveals that LLMs exhibit significant performance variability and lack consistent superiority over traditional methods. The research emphasizes that abstract availability is critical for LLM performance, while title and keywords provide minimal additional value, suggesting LLM adoption should be driven by operational constraints rather than performance guarantees.
๐ข OpenAI๐ข Anthropic๐ง Gemini