y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability

arXiv – CS AI|Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra|
🤖AI Summary

Researchers evaluated four major LLMs (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, Qwen2.5-7B) on English-to-Hausa and English-to-Fongbe translation, finding that translation quality varies dramatically by language, model rankings differ across languages, and automatic evaluation metrics show weak correlation with human judgment for low-resource African languages.

Analysis

This research addresses a critical gap in LLM evaluation: the reliability of machine translation for low-resource African languages. The study benchmarks four leading models across typologically distinct West African languages, revealing that performance patterns don't generalize. Hausa achieved acceptable translation quality (4.0-4.5/5 human scores) while Fongbe performed poorly (1.0-2.2/5), with consistent 3x BLEU gaps across all systems. Crucially, model rankings shifted by language—Gemini excelled for Fongbe while GPT-4o led for Hausa—indicating that LLM capabilities on one low-resource language cannot predict performance on another.

The research exposes fundamental flaws in current evaluation methodology. Standard automatic metrics (BLEU, chrF++) showed perfect correlation with human judgment for Fongbe but weak correlation (rho=0.5) for Hausa, where all metrics ranked Claude first despite human evaluators preferring GPT-4o. Neural metrics like BERTScore exhibited embedding collapse (within-language similarity exceeding 0.99), rendering them unable to differentiate translation quality. These findings underscore that evaluation frameworks designed for high-resource language pairs fail for African languages.

The practical implications extend beyond academic interest. As organizations deploy LLMs globally, this research demonstrates that claiming multi-language capability requires rigorous, language-specific validation. The recommendation for minimum sample sizes of 2,500 sentences for stable rankings suggests many published benchmarks may suffer from artifact findings. Organizations developing African language NLP applications cannot rely on metric-based claims alone; they require native-speaker validation. The embedding collapse issue signals that foundation models may encode African languages inadequately, requiring architectural changes rather than simple fine-tuning.

Key Takeaways
  • LLM translation quality and model rankings vary substantially across low-resource African languages, with Hausa achieving acceptable quality while Fongbe performs poorly
  • Automatic evaluation metrics show dramatically different correlation with human judgment across languages, with neural metrics exhibiting embedding collapse that limits their utility
  • Model rankings differ by language: Gemini leads for Fongbe while GPT-4o leads for Hausa, indicating performance on one African language doesn't predict another
  • Minimum sample sizes of 2,500 sentences are required for stable system rankings; smaller datasets produce artifact findings that reverse at scale
  • Multi-metric evaluation with native-speaker validation is essential for low-resource African languages, as standard benchmarking approaches prove unreliable
Mentioned in AI
Models
GPT-4OpenAI
ClaudeAnthropic
SonnetAnthropic
GeminiGoogle
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles