#low-resource-languages News & Analysis

29 articles tagged with #low-resource-languages. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

29 articles

AIBullisharXiv – CS AI · 3d ago7/10

🧠

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Researchers introduce BioELX, a two-stage cross-lingual biomedical entity linking system that maps medical mentions across languages to knowledge base identifiers without requiring task-specific training data. The framework combines multilingual alias-enriched retrieval with LLM-based ranking, achieving state-of-the-art results across five benchmarks with substantial improvements for low-resource languages.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Researchers address a critical limitation in Spoken Language Models (SLMs) for low-resource languages by identifying a fundamental trade-off called the Stability-Expressivity Gap, where synthetic data improves phonetic accuracy but suppresses prosodic variability. The proposed self-alignment frameworks—DGSA and TDSC—recover expressivity while maintaining stability, achieving performance comparable to commercial systems and enabling zero-shot voice cloning for Lao.

🧠 Gemini

AIBullisharXiv – CS AI · Apr 157/10

🧠

AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought

Researchers introduce AdaMCoT, a framework that improves multilingual reasoning in large language models by dynamically routing intermediate thoughts through optimal 'thinking languages' before generating target-language responses. The approach achieves significant performance gains in low-resource languages without requiring additional pretraining, addressing a key limitation in current multilingual AI systems.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Researchers successfully developed multimodal large language models for Basque, a low-resource language, finding that only 20% Basque training data is needed for solid performance. The study demonstrates that specialized Basque language backbones aren't required, potentially enabling MLLM development for other underrepresented languages.

🧠 Llama

AINeutralarXiv – CS AI · 2d ago5/10

🧠

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Researchers evaluated nine automatic speech recognition (ASR) models on Dutch child speech datasets, finding that fine-tuned Whisper-medium achieved 5.54% word error rate on clean data but 70.37% on noisy data. Using an utterance-level selection method, they identified 42% of clean recordings as reliable without manual verification, achieving 98.3% precision and significantly reducing annotation overhead for child speech research.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Researchers develop strategies for extending large language models as evaluation tools to multilingual settings, addressing challenges in low-resource languages. The study reveals that fine-tuned smaller models match proprietary performance when in-domain data exists, while larger zero-shot models excel in out-of-domain scenarios, providing practical guidance for building multilingual evaluation systems.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Researchers have released ParsVoice, a 2,200-hour Persian speech dataset with 1.36 million aligned segments from 1,815 speakers, making it 25 times larger than previous Persian TTS resources. The dataset was constructed using an automated pipeline combining ASR, fine-tuned language models, and quality assessment, and validation shows the corpus enables multi-speaker text-to-speech systems competitive with existing solutions.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 126/10

🧠

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Researchers have developed Bangla-WhisperDiar, a fine-tuned speech recognition and speaker diarization system that achieves a 24.41% word error rate for ASR and 23.92% diarization error rate. The work addresses critical gaps in Bangla language processing by combining OpenAI's Whisper model with PyAnnote's diarization framework, trained on custom datasets with extensive data augmentation techniques.

AINeutralarXiv – CS AI · May 116/10

🧠

Multilingual Safety Alignment via Self-Distillation

Researchers propose Multilingual Self-Distillation (MSD), a framework that transfers safety safeguards from high-resource languages like English to vulnerable low-resource languages in large language models. The method eliminates the need for expensive multilingual response data by leveraging an LLM's existing safety capabilities, demonstrating effective cross-lingual protection across diverse jailbreak benchmarks.

AIBullisharXiv – CS AI · May 96/10

🧠

ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model

Researchers introduced ANGOFA, four pre-trained language models tailored for Angolan languages using Multilingual Adaptive Fine-tuning (MAFT) with OFA embedding initialization and synthetic data. The approach achieved 12.3 and 3.8 point improvements over previous state-of-the-art models, addressing a critical gap in NLP support for very-low resource African languages.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Researchers introduce a Cross-Lingual Mapping Task during LLM pre-training to improve multilingual performance across languages with varying data availability. The method achieves significant improvements in machine translation, cross-lingual question answering, and multilingual understanding without requiring extensive parallel data.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Researchers identify that reasoning language models exhibit worse performance in low-resource languages due to failures in language understanding rather than reasoning capability itself. The study proposes Selective Translation, which strategically adds English translations only when understanding failures are detected, achieving near full-translation performance while translating just 20% of inputs.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

Researchers have developed RandSymKL, a debiasing technique for Bangla language models that mitigates gender bias in classification tasks like sentiment analysis and hate speech detection. The study introduces four manually annotated benchmark datasets with gender-perturbation testing and demonstrates that the approach effectively reduces bias while maintaining competitive accuracy compared to existing methods.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

Researchers evaluated how well large language models can perform formal grammar-based translation tasks using in-context learning, finding that LLM translation accuracy degrades significantly with grammar complexity and sentence length. The study identifies specific failure modes including vocabulary hallucination and untranslated source words, revealing fundamental limitations in LLMs' ability to apply formal grammatical rules to translation tasks.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation

Researchers discovered that multilingual MoE AI models exhibit 'Language Routing Isolation,' where high and low-resource languages activate different expert sets. They developed RISE, a framework that exploits this isolation to improve low-resource language performance by up to 10.85% F1 score while preserving other language capabilities.

AIBullisharXiv – CS AI · Mar 276/10

🧠

Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Researchers successfully fine-tuned LLaMA 3.1-8B for medical transcription in Finnish, a low-resource language, achieving strong semantic similarity despite low n-gram overlap. The study used simulated clinical conversations from students and demonstrates the feasibility of privacy-oriented domain-specific language models for clinical documentation in underrepresented languages.

AIBearisharXiv – CS AI · Mar 36/104

🧠

Are LLMs Ready to Replace Bangla Annotators?

A comprehensive study of 17 Large Language Models as automated annotators for Bangla hate speech detection reveals significant bias and instability issues. The research found that larger models don't necessarily perform better than smaller, task-specific ones, raising concerns about LLM reliability for sensitive annotation tasks in low-resource languages.

AIBullisharXiv – CS AI · Feb 276/107

🧠

Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

Researchers developed a new AI framework using RNN-T architecture to improve speech recognition for Taiwanese Hakka, an endangered low-resource language with high dialectal variability. The system achieved 57% and 40% relative error rate reductions for two different writing systems, marking the first systematic investigation into Hakka dialect variations in ASR.

AIBullisharXiv – CS AI · Feb 276/106

🧠

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Researchers introduced ViCLIP-OT, the first foundation vision-language model specifically designed for Vietnamese image-text retrieval. The model integrates CLIP-style contrastive learning with Similarity-Graph Regularized Optimal Transport (SIGROT) loss, achieving significant improvements over existing baselines with 67.34% average Recall@K on UIT-OpenViIC benchmark.

AIBullisharXiv – CS AI · Feb 275/103

🧠

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Researchers developed Lipi-Ghor-882, an 882-hour Bengali speech dataset, and demonstrated that targeted fine-tuning with synthetic acoustic degradation significantly improves automatic speech recognition for long-form Bengali audio. Their dual pipeline achieved a 0.019 Real-Time Factor, establishing new benchmarks for low-resource speech processing.

AIBullishMicrosoft Research Blog · Feb 56/103

🧠

Paza: Introducing automatic speech recognition benchmarks and models for low resource languages

Microsoft Research launched Paza, a human-centered speech recognition pipeline, and PazaBench, the first benchmark leaderboard specifically designed for low-resource languages. The initiative covers 39 African languages with 52 models and has been tested with real communities to improve AI accessibility for underrepresented languages.

AINeutralarXiv – CS AI · Mar 264/10

🧠

Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

Researchers developed Konkani LLM, a specialized language model for the low-resource Indian language Konkani, using a synthetic 100k instruction dataset. The model addresses training data scarcity across multiple scripts (Devanagari, Romi, Kannada) and demonstrates competitive performance against proprietary models in machine translation tasks.

🧠 Gemini🧠 Llama

AINeutralarXiv – CS AI · Mar 54/10

🧠

Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

Researchers have developed LilMoo, a 0.6-billion parameter Hindi language model trained from scratch using a transparent, reproducible pipeline optimized for limited compute environments. The model outperforms similarly sized multilingual baselines like Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that language-specific pretraining can rival larger multilingual models.

AIBullisharXiv – CS AI · Mar 44/102

🧠

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

Researchers developed a multistage AI approach for Bengali speech transcription and speaker diarization, achieving significant improvements in processing long-form audio recordings. The system used fine-tuned Whisper models and custom segmentation techniques to address the low-resource nature of Bengali in speech technology applications.

AINeutralarXiv – CS AI · Mar 34/104

🧠

Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration

Researchers developed an optimized speech-to-text translation pipeline for Nepali-to-English that addresses punctuation loss issues in low-resource language processing. By implementing a Punctuation Restoration Module, they achieved a 4.90 BLEU point improvement over baseline systems, demonstrating significant quality gains for cascaded translation architectures.

Page 1 of 2Next →