y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#low-resource-languages News & Analysis

20 articles tagged with #low-resource-languages. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

20 articles
AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Researchers successfully developed multimodal large language models for Basque, a low-resource language, finding that only 20% Basque training data is needed for solid performance. The study demonstrates that specialized Basque language backbones aren't required, potentially enabling MLLM development for other underrepresented languages.

๐Ÿง  Llama
AINeutralarXiv โ€“ CS AI ยท 2d ago6/10
๐Ÿง 

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Researchers introduce a Cross-Lingual Mapping Task during LLM pre-training to improve multilingual performance across languages with varying data availability. The method achieves significant improvements in machine translation, cross-lingual question answering, and multilingual understanding without requiring extensive parallel data.

AINeutralarXiv โ€“ CS AI ยท 2d ago6/10
๐Ÿง 

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Researchers identify that reasoning language models exhibit worse performance in low-resource languages due to failures in language understanding rather than reasoning capability itself. The study proposes Selective Translation, which strategically adds English translations only when understanding failures are detected, achieving near full-translation performance while translating just 20% of inputs.

AINeutralarXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

Researchers have developed RandSymKL, a debiasing technique for Bangla language models that mitigates gender bias in classification tasks like sentiment analysis and hate speech detection. The study introduces four manually annotated benchmark datasets with gender-perturbation testing and demonstrates that the approach effectively reduces bias while maintaining competitive accuracy compared to existing methods.

AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

Researchers evaluated how well large language models can perform formal grammar-based translation tasks using in-context learning, finding that LLM translation accuracy degrades significantly with grammar complexity and sentence length. The study identifies specific failure modes including vocabulary hallucination and untranslated source words, revealing fundamental limitations in LLMs' ability to apply formal grammatical rules to translation tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Researchers successfully fine-tuned LLaMA 3.1-8B for medical transcription in Finnish, a low-resource language, achieving strong semantic similarity despite low n-gram overlap. The study used simulated clinical conversations from students and demonstrates the feasibility of privacy-oriented domain-specific language models for clinical documentation in underrepresented languages.

AIBearisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

Are LLMs Ready to Replace Bangla Annotators?

A comprehensive study of 17 Large Language Models as automated annotators for Bangla hate speech detection reveals significant bias and instability issues. The research found that larger models don't necessarily perform better than smaller, task-specific ones, raising concerns about LLM reliability for sensitive annotation tasks in low-resource languages.

AIBullisharXiv โ€“ CS AI ยท Feb 276/107
๐Ÿง 

Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

Researchers developed a new AI framework using RNN-T architecture to improve speech recognition for Taiwanese Hakka, an endangered low-resource language with high dialectal variability. The system achieved 57% and 40% relative error rate reductions for two different writing systems, marking the first systematic investigation into Hakka dialect variations in ASR.

AIBullisharXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Researchers introduced ViCLIP-OT, the first foundation vision-language model specifically designed for Vietnamese image-text retrieval. The model integrates CLIP-style contrastive learning with Similarity-Graph Regularized Optimal Transport (SIGROT) loss, achieving significant improvements over existing baselines with 67.34% average Recall@K on UIT-OpenViIC benchmark.

AIBullisharXiv โ€“ CS AI ยท Feb 275/103
๐Ÿง 

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Researchers developed Lipi-Ghor-882, an 882-hour Bengali speech dataset, and demonstrated that targeted fine-tuning with synthetic acoustic degradation significantly improves automatic speech recognition for long-form Bengali audio. Their dual pipeline achieved a 0.019 Real-Time Factor, establishing new benchmarks for low-resource speech processing.

AIBullishMicrosoft Research Blog ยท Feb 56/103
๐Ÿง 

Paza: Introducing automatic speech recognition benchmarks and models for low resource languages

Microsoft Research launched Paza, a human-centered speech recognition pipeline, and PazaBench, the first benchmark leaderboard specifically designed for low-resource languages. The initiative covers 39 African languages with 52 models and has been tested with real communities to improve AI accessibility for underrepresented languages.

AINeutralarXiv โ€“ CS AI ยท Mar 264/10
๐Ÿง 

Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

Researchers developed Konkani LLM, a specialized language model for the low-resource Indian language Konkani, using a synthetic 100k instruction dataset. The model addresses training data scarcity across multiple scripts (Devanagari, Romi, Kannada) and demonstrates competitive performance against proprietary models in machine translation tasks.

๐Ÿง  Gemini๐Ÿง  Llama
AINeutralarXiv โ€“ CS AI ยท Mar 54/10
๐Ÿง 

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

Researchers have developed LilMoo, a 0.6-billion parameter Hindi language model trained from scratch using a transparent, reproducible pipeline optimized for limited compute environments. The model outperforms similarly sized multilingual baselines like Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that language-specific pretraining can rival larger multilingual models.

AIBullisharXiv โ€“ CS AI ยท Mar 44/102
๐Ÿง 

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

Researchers developed a multistage AI approach for Bengali speech transcription and speaker diarization, achieving significant improvements in processing long-form audio recordings. The system used fine-tuned Whisper models and custom segmentation techniques to address the low-resource nature of Bengali in speech technology applications.

AINeutralarXiv โ€“ CS AI ยท Mar 34/104
๐Ÿง 

Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration

Researchers developed an optimized speech-to-text translation pipeline for Nepali-to-English that addresses punctuation loss issues in low-resource language processing. By implementing a Punctuation Restoration Module, they achieved a 4.90 BLEU point improvement over baseline systems, demonstrating significant quality gains for cascaded translation architectures.

AINeutralarXiv โ€“ CS AI ยท Mar 25/104
๐Ÿง 

Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek

A study evaluated large language models (Claude, Gemini, ChatGPT) translating Ancient Greek texts, finding high performance on previously translated works (95.2/100) but declining quality on untranslated technical texts (79.9/100). Terminology rarity was identified as a strong predictor of translation failure, with rare terms causing catastrophic performance drops.

AIBullisharXiv โ€“ CS AI ยท Feb 274/106
๐Ÿง 

ULTRA:Urdu Language Transformer-based Recommendation Architecture

Researchers developed ULTRA, a new AI architecture specifically designed for semantic content recommendation in Urdu, a low-resource language. The system uses a dual-embedding approach with query-length aware routing to improve news retrieval, achieving over 90% precision gains compared to existing methods.

AIBullishHugging Face Blog ยท Nov 154/106
๐Ÿง 

Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with ๐Ÿค— Transformers

The article appears to be about fine-tuning XLSR-Wav2Vec2, a speech recognition model, for automatic speech recognition (ASR) in low-resource languages using Hugging Face Transformers. This represents a technical advancement in AI speech processing capabilities for underserved languages.

AINeutralarXiv โ€“ CS AI ยท Mar 24/106
๐Ÿง 

Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages

Researchers propose Task-Lens, a cross-task survey analyzing 50 Indian speech datasets across 26 languages for nine downstream speech tasks. The study reveals untapped metadata in existing datasets that could support multiple AI speech applications and identifies critical gaps in resources for underserved Indian languages.