🧠 AI⚪ NeutralImportance 6/10

Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability

arXiv – CS AI|Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers evaluated four major LLMs (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, Qwen2.5-7B) on English-to-Hausa and English-to-Fongbe translation, finding that translation quality varies dramatically by language, model rankings differ across languages, and automatic evaluation metrics show weak correlation with human judgment for low-resource African languages.

Analysis

This research addresses a critical gap in LLM evaluation: the reliability of machine translation for low-resource African languages. The study benchmarks four leading models across typologically distinct West African languages, revealing that performance patterns don't generalize. Hausa achieved acceptable translation quality (4.0-4.5/5 human scores) while Fongbe performed poorly (1.0-2.2/5), with consistent 3x BLEU gaps across all systems. Crucially, model rankings shifted by language—Gemini excelled for Fongbe while GPT-4o led for Hausa—indicating that LLM capabilities on one low-resource language cannot predict performance on another.

The research exposes fundamental flaws in current evaluation methodology. Standard automatic metrics (BLEU, chrF++) showed perfect correlation with human judgment for Fongbe but weak correlation (rho=0.5) for Hausa, where all metrics ranked Claude first despite human evaluators preferring GPT-4o. Neural metrics like BERTScore exhibited embedding collapse (within-language similarity exceeding 0.99), rendering them unable to differentiate translation quality. These findings underscore that evaluation frameworks designed for high-resource language pairs fail for African languages.

The practical implications extend beyond academic interest. As organizations deploy LLMs globally, this research demonstrates that claiming multi-language capability requires rigorous, language-specific validation. The recommendation for minimum sample sizes of 2,500 sentences for stable rankings suggests many published benchmarks may suffer from artifact findings. Organizations developing African language NLP applications cannot rely on metric-based claims alone; they require native-speaker validation. The embedding collapse issue signals that foundation models may encode African languages inadequately, requiring architectural changes rather than simple fine-tuning.

Key Takeaways

→LLM translation quality and model rankings vary substantially across low-resource African languages, with Hausa achieving acceptable quality while Fongbe performs poorly
→Automatic evaluation metrics show dramatically different correlation with human judgment across languages, with neural metrics exhibiting embedding collapse that limits their utility
→Model rankings differ by language: Gemini leads for Fongbe while GPT-4o leads for Hausa, indicating performance on one African language doesn't predict another
→Minimum sample sizes of 2,500 sentences are required for stable system rankings; smaller datasets produce artifact findings that reverse at scale
→Multi-metric evaluation with native-speaker validation is essential for low-resource African languages, as standard benchmarking approaches prove unreliable

Mentioned in AI

Models

GPT-4OpenAI

ClaudeAnthropic

SonnetAnthropic

GeminiGoogle

#llm-evaluation #machine-translation #low-resource-languages #african-languages #benchmarking #nlp-metrics #hausa #fongbe #model-comparison

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge