🧠 AI⚪ NeutralImportance 7/10

HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems

arXiv – CS AI|Mateusz Bara\'nski, Jan Jasi\'nski, Julitta Bartolewska, Marcin Witkowski, Konrad Kowalczyk|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce HALAS, the first human-annotated dataset documenting naturally occurring hallucinations from seven state-of-the-art ASR systems on real earnings call recordings. The benchmark reveals that hallucinations persist even in nearly correct transcriptions and establishes rigorous evaluation methods, with current detection techniques achieving only 53.1% F1 scores despite character-level metrics reaching 81% ROC-AUC.

Analysis

The introduction of HALAS addresses a critical gap in AI safety research: the evaluation of speech recognition failures in production environments rather than synthetic conditions. Modern ASR systems, despite impressive performance metrics, generate confidently incorrect outputs—a phenomenon known as hallucination. This dataset represents the first systematic attempt to characterize these failures using real-world earnings call data, providing 7,000+ annotated hallucination instances with span-level granularity. The research reveals that hallucinations cluster around specific vocabulary and occur across multiple state-of-the-art models, suggesting common architectural vulnerabilities rather than isolated model failures.

The dataset's significance extends beyond academic interest. Earnings calls carry substantial commercial value—transcription errors directly impact financial analysis, investor communication, and regulatory compliance. The finding that hallucinations occur even when overall Word Error Rates are low indicates that current performance metrics mask latent reliability issues. This gap between aggregate accuracy and instance-level errors has profound implications for industries relying on ASR for critical applications.

The benchmark findings are sobering: existing hallucination detection methods lag significantly behind character-level analysis, suggesting detection remains an unsolved problem. This limitation constrains mitigation strategies and forces practitioners to choose between accepting hallucinations or maintaining expensive human verification workflows. The cross-model hallucination overlap implies that fundamental architectural properties—likely related to how models assign confidence scores—drive these failures rather than model-specific quirks.

Future work should focus on developing detection methods that match metric-based approaches and exploring why hallucinations persist despite low error rates. Industry adoption of HALAS benchmarks could accelerate detector development and incentivize architectures with more reliable confidence calibration.

Key Takeaways

→HALAS provides the first human-annotated dataset of naturally occurring ASR hallucinations from real earnings call recordings across seven state-of-the-art models.
→Hallucinations occur even in nearly correct transcriptions with low Word Error Rates, revealing a hidden reliability problem in current ASR systems.
→Current hallucination detection methods achieve only 53.1% F1 scores, far below character-level metrics reaching 81% ROC-AUC, indicating detection remains unsolved.
→Strong cross-model vocabulary overlap in hallucinations suggests common architectural vulnerabilities shared across different ASR implementations.
→The benchmark establishes rigorous non-artificial evaluation standards, shifting hallucination research from synthetic corrupted audio to production-grade speech data.