Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks
Microsoft researchers released Delulu, a benchmark dataset containing 1,951 code generation samples across 7 programming languages designed to test how well large language models detect hallucinations in Fill-in-the-Middle tasks. Testing 11 open-weight models revealed fundamental limitations, with even the strongest achieving only 84.5% accuracy, indicating that code hallucination remains a persistent challenge across all model families.
Delulu addresses a critical vulnerability in code-generating AI systems: the tendency to produce plausible but incorrect completions that bypass surface-level review yet fail at runtime. The benchmark's rigorous construction through adversarial generation, multi-judge evaluation, and containerized verification sets a new standard for AI safety benchmarking in software development contexts. This matters because developers increasingly rely on LLMs for code completion, and hallucinated APIs, invalid parameters, or fabricated imports can introduce subtle bugs into production systems.
The research reveals that code hallucination detection remains unsolved across the entire spectrum of model families and scales tested. Even frontier models like Qwen2.5-Coder struggle significantly, with no family exceeding 77% Edit Similarity. This finding contradicts assumptions that scaling alone solves the problem and suggests hallucinations are intrinsic to current model architectures rather than training artifacts. The cross-family consistency of failures indicates the issue transcends specific implementations.
For developers and organizations deploying code LLMs, Delulu provides empirical evidence that automated code review and runtime verification remain non-negotiable practices. The benchmark's public release enables the research community to benchmark future models against a verified standard, potentially driving innovation in hallucination detection techniques. Organizations building AI-assisted development tools must architect guardrails acknowledging that no single model family solves this problem reliably.
- βAll 11 tested models demonstrated significant hallucination rates, with the strongest reaching only 84.5% accuracy on verified FIM samples
- βDelulu's adversarial construction methodology using four judge models and Docker verification establishes a reproducible benchmark standard for AI safety testing
- βCode hallucinations persist across model families and scales from 0.5B to 32B parameters, suggesting the problem is architectural rather than training-related
- βThe benchmark spans 7 languages and 4 hallucination types, providing comprehensive coverage for evaluating real-world code generation risks
- βMicrosoft's open-source release enables broader community evaluation and potential development of improved hallucination detection techniques