BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali
Researchers introduce BenHalluEval, the first hallucination evaluation framework for Bengali-language LLMs, covering four task categories with 12,000 test cases across seven models. The framework reveals significant performance gaps and demonstrates that standard evaluation metrics fail to capture hallucination risks in low-resource languages.
The introduction of BenHalluEval addresses a critical gap in AI safety research for underrepresented languages. Bengali ranks as the sixth most spoken language globally, yet LLMs operating in this language have never been systematically evaluated for hallucinations—instances where models generate plausible-sounding but false information. This oversight has real consequences: millions of Bengali speakers rely on LLMs without understanding their reliability limitations.
The framework's dual-track protocol represents methodological sophistication. By measuring both false-positive rates on correct information and hallucination detection rates on fabricated content, the researchers prevent models from gaming metrics through uniform response patterns. The resulting BenHalluScore ranges from 7.72% to 55.42%, indicating dramatic variation in how different models handle Bengali-language hallucinations. This variation suggests no single model consistently balances accuracy with truthfulness.
The findings have broader implications for AI development in multilingual contexts. Chain-of-thought prompting, a popular technique for improving reasoning, failed to reliably reduce hallucinations—merely shifting response distributions without improving discrimination. This challenges the assumption that reasoning strategies universally enhance reliability across languages and tasks.
For the AI industry, BenHalluEval establishes a benchmark that developers must contend with when deploying models for Bengali speakers. The research highlights how single-track evaluation approaches and prompt-engineering-only solutions prove insufficient for ensuring safety in low-resource language settings. As LLMs expand into underserved linguistic markets, similar benchmarking efforts will become essential for maintaining user trust and regulatory compliance.
- →BenHalluEval is the first systematic hallucination evaluation framework for Bengali, covering 12,000 test cases across four task categories and seven LLMs.
- →The dual-track BenHalluScore metric reveals substantial performance variation (7.72%-55.42%), exposing inadequacies in single-metric evaluation approaches.
- →Chain-of-thought prompting shifts model responses without consistently improving hallucination detection, challenging common assumptions about reasoning improvements.
- →No existing LLM demonstrates reliable hallucination discrimination across Bengali tasks, indicating safety gaps in production deployments.
- →The research demonstrates that low-resource language evaluation requires specialized benchmarks and multi-faceted assessment protocols beyond prompting strategies.