🧠 AI⚪ NeutralImportance 7/10

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

arXiv – CS AI|Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale, Xiaoyu Liu, Pareesa Ameneh Golnari, Shengyu Fu|May 12, 2026 at 04:00 AM

🤖AI Summary

Microsoft researchers released Delulu, a benchmark dataset containing 1,951 code generation samples across 7 programming languages designed to test how well large language models detect hallucinations in Fill-in-the-Middle tasks. Testing 11 open-weight models revealed fundamental limitations, with even the strongest achieving only 84.5% accuracy, indicating that code hallucination remains a persistent challenge across all model families.

Analysis

Delulu addresses a critical vulnerability in code-generating AI systems: the tendency to produce plausible but incorrect completions that bypass surface-level review yet fail at runtime. The benchmark's rigorous construction through adversarial generation, multi-judge evaluation, and containerized verification sets a new standard for AI safety benchmarking in software development contexts. This matters because developers increasingly rely on LLMs for code completion, and hallucinated APIs, invalid parameters, or fabricated imports can introduce subtle bugs into production systems.

The research reveals that code hallucination detection remains unsolved across the entire spectrum of model families and scales tested. Even frontier models like Qwen2.5-Coder struggle significantly, with no family exceeding 77% Edit Similarity. This finding contradicts assumptions that scaling alone solves the problem and suggests hallucinations are intrinsic to current model architectures rather than training artifacts. The cross-family consistency of failures indicates the issue transcends specific implementations.

For developers and organizations deploying code LLMs, Delulu provides empirical evidence that automated code review and runtime verification remain non-negotiable practices. The benchmark's public release enables the research community to benchmark future models against a verified standard, potentially driving innovation in hallucination detection techniques. Organizations building AI-assisted development tools must architect guardrails acknowledging that no single model family solves this problem reliably.

Key Takeaways

→All 11 tested models demonstrated significant hallucination rates, with the strongest reaching only 84.5% accuracy on verified FIM samples
→Delulu's adversarial construction methodology using four judge models and Docker verification establishes a reproducible benchmark standard for AI safety testing
→Code hallucinations persist across model families and scales from 0.5B to 32B parameters, suggesting the problem is architectural rather than training-related
→The benchmark spans 7 languages and 4 hallucination types, providing comprehensive coverage for evaluating real-world code generation risks
→Microsoft's open-source release enables broader community evaluation and potential development of improved hallucination detection techniques

#code-generation #llm-safety #hallucination-detection #benchmark #ai-evaluation #fill-in-the-middle #model-testing #software-development

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI5d ago

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AI6d ago

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AI6d ago

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge