#multilingual-nlp News & Analysis

14 articles tagged with #multilingual-nlp. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBearisharXiv – CS AI · Jun 197/10

🧠

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

Researchers reveal significant limitations in using English-centric persona-based methods to generate multilingual mental health datasets, finding that simply adding nationality and language parameters introduces clinical inconsistencies and causes LLM evaluators to perform poorly on non-English depression severity assessments. The study underscores the urgent need for culturally responsive data generation approaches to build equitable AI mental health systems globally.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Linguistics-Aware Non-Distortionary LLM Watermarking

Researchers introduce LUNA, a linguistically-aware watermarking technique for large language models that maintains output quality across multiple languages while enabling reliable detection without model provider access. The method achieves 99.59% detection accuracy with minimal perplexity degradation (0.045 mean shift), outperforming eight baseline approaches across six typologically diverse languages.

🏢 Perplexity

AIBullisharXiv – CS AI · May 47/10

🧠

Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts

Researchers introduce Sentra-Guard, a real-time defense system that detects and mitigates jailbreak and prompt injection attacks on large language models with 99.96% accuracy. The multilingual framework combines FAISS-indexed semantic embeddings with fine-tuned transformers and human-in-the-loop feedback, significantly outperforming existing defenses like LlamaGuard-2 and OpenAI Moderation.

🏢 OpenAI

AINeutralarXiv – CS AI · Jun 256/10

🧠

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Researchers introduce SARA, a framework that improves multilingual performance in Mixture-of-Experts language models by aligning routing patterns between low-resource and high-resource languages. The method uses semantic anchoring and Jensen-Shannon divergence constraints to enable better expert sharing across languages, demonstrating measurable improvements on benchmark tests.

AINeutralarXiv – CS AI · Jun 236/10

🧠

From RAG to Agentic RAG for Faithful Islamic Question Answering

Researchers introduced IslamicFaithQA, a 3,810-item bilingual benchmark and agentic RAG framework designed to improve the accuracy and reliability of Islamic question-answering systems. The work addresses critical gaps in LLM evaluation by measuring hallucination rates and abstention capabilities, achieving state-of-the-art performance through iterative evidence-seeking mechanisms grounded in Qur'anic text.

🏢 Hugging Face

AINeutralarXiv – CS AI · Jun 106/10

🧠

STORM: Stepwise Token Optimization with Reward-Guided Beam Search

Researchers introduce STORM, a self-supervised framework that optimizes lexical query expansion for information retrieval by using BM25 reward signals during generation. The approach enables smaller language models (0.6B-8B parameters) to match larger proprietary rewriters while maintaining BM25's speed efficiency, and demonstrates zero-shot transfer across 18 languages.

AINeutralarXiv – CS AI · Jun 96/10

🧠

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov introduces a curated trilingual dataset (English-Hindi-Marathi) containing 8,000 parallel sentence pairs focused on Indian agricultural government schemes and farmer welfare programs. The dataset combines automated data collection, machine translation, and human post-editing to create domain-specific resources for machine translation, question-answering, and information retrieval systems aimed at farmer-facing applications.

AINeutralarXiv – CS AI · Jun 96/10

🧠

XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

Researchers introduce XCR-Bench, a benchmark dataset for evaluating cross-cultural reasoning in large language models, containing 4,100 parallel sentences and 1,098 culture-specific items across three reasoning tasks. The study reveals that state-of-the-art multilingual LLMs consistently fail to properly identify and adapt culturally sensitive content, exposing systematic biases and gaps in cultural competency.

AINeutralarXiv – CS AI · Jun 25/10

🧠

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

Researchers have developed a novel neural architecture combining Kolmogorov-Arnold Networks (KAN) with BiGRU models for classifying and summarizing legal documents in multilingual, low-resource settings. Tested on Bengali, English, and transliterated Bengali legal documents from Bangladesh, the hybrid model achieved 67.96% classification accuracy while demonstrating that KAN integration improved performance by over 10 percentage points.

AIBullisharXiv – CS AI · Jun 26/10

🧠

EuroBERT: Scaling Multilingual Encoders for European Languages

Researchers introduce EuroBERT, a family of multilingual encoder models that apply recent advances from generative AI to improve vector representations across European and global languages. The models outperform existing alternatives on retrieval, classification, and coding tasks while supporting sequences up to 8,192 tokens, with code and checkpoints publicly released.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

Researchers demonstrate that fine-tuning Spanish biomedical embeddings with synthetic data generated by large language models significantly improves clinical code retrieval across multiple European languages. The two-stage retrieval system outperforms existing benchmarks like BioBERT-ST, particularly for non-English languages, addressing a critical gap in multilingual medical AI applications.

🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Researchers have created a multilingual text simplification corpus by collecting and aligning sentence-level data from comparable corpora across five languages (Catalan, English, French, Italian, and Spanish). The dataset addresses a critical gap in NLP resources for non-English languages and is publicly available for training and evaluating text simplification models.

AIBullisharXiv – CS AI · May 126/10

🧠

GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

Researchers have developed GLiNER2-PII, a compact 0.3B-parameter multilingual model for detecting personally identifiable information across 42 entity types at character-level precision. Trained on a synthetic corpus of 4,910 annotated texts to overcome privacy constraints in real data collection, the model outperforms existing systems including OpenAI's Privacy Filter on benchmark evaluations and is now publicly available on Hugging Face.

🏢 OpenAI🏢 Hugging Face

AINeutralarXiv – CS AI · May 46/10

🧠

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

Researchers have introduced ViLegalNLI, the first large-scale Vietnamese Natural Language Inference dataset for legal texts, containing 42,012 premise-hypothesis pairs from statutory documents. The dataset enables AI systems to understand legal reasoning patterns and supports development of reliable AI tools for Vietnamese legal analysis and decision-making.