Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
Researchers introduce an interpretable deep learning framework to study how grammatical gender evolved from Latin's three-gender system to Occitan's two-gender structure. The work demonstrates that conventional tokenization fails in low-resource historical linguistics and proposes improvements while analyzing how gender information distributes between word roots and sentence context.
This research addresses a fundamental question in historical linguistics: how language systems simplify and restructure over time. The shift from Latin's tripartite gender system (masculine, feminine, neuter) to the Romance languages' bipartite system (masculine, feminine) represents a significant morphological reorganization that occurred over centuries. Understanding this transition illuminates broader principles of language change and the mechanisms underlying grammatical evolution.
The study's core contribution lies in applying interpretable deep learning to historical language data, a methodologically challenging domain characterized by sparse textual evidence and limited training examples. By demonstrating that standard tokenization approaches fail in this context, the researchers expose assumptions baked into modern NLP tools that work poorly for historical texts. Their customized tokenizer addresses these limitations, establishing that preprocessing strategy significantly impacts model performance on low-resource problems.
The lexical-versus-contextual analysis reveals how gender information redistributed during the Latin-to-Occitan transition. By quantifying morphological features' contribution to gender prediction at the lemma level and part-of-speech categories' contribution at the sentence level, the work maps exactly where information moved within the linguistic system. This decomposition provides interpretability often absent in black-box deep learning approaches.
For computational linguists and digital humanities scholars, this framework offers a replicable methodology for investigating other diachronic phenomena in low-resource settings. The public release of code, datasets, and results enables reproducibility and extension to other language pairs and historical transitions. The work demonstrates that contemporary machine learning techniques, properly adapted, can illuminate centuries-old linguistic questions previously accessible only through manual analysis.
- βCustom tokenization strategies substantially improve deep learning performance on historical low-resource linguistic data compared to conventional NLP preprocessing
- βGender information migrated from morphological markers on word roots to distributional patterns across sentence context during the Latin-to-Occitan transition
- βInterpretable deep learning can decompose complex grammatical changes into measurable contributions from specific linguistic features
- βThe framework's public codebase enables replication and application to other historical language evolution questions
- βThis research bridges computational methods and classical linguistics, demonstrating machine learning's utility for understanding centuries-old morphological restructuring