Full text loading...
Abstract
This paper presents a framework for building lemmatised Occitan corpora, focusing on early modern texts. Due to strong dialectal and diachronic variation, lemmatisation is essential for enabling cross-text and cross-period comparison. We adopt a semi-automatic approach based on the Pie neural model, combining tokenisation, super-lemma selection, and POS tagging aligned with Universal Dependencies. Initial experiments on 17th–18th century texts show promising results, particularly for frequent and grammatical words, while highlighting challenges with unknown lemmas. Despite its exploratory scope, the study demonstrates the feasibility of cost-effective corpus construction and lays the groundwork for a larger, more representative language model of Occitan.
Article metrics loading...
Full text loading...
References
Data & Media loading...