1887
image of Framework to build and lemmatise an Occitan historical corpus
USD
Buy:$35.00 + Taxes

Abstract

Abstract

This paper presents a framework for building lemmatised Occitan corpora, focusing on early modern texts. Due to strong dialectal and diachronic variation, lemmatisation is essential for enabling cross-text and cross-period comparison. We adopt a semi-automatic approach based on the neural model, combining tokenisation, super-lemma selection, and POS tagging aligned with Universal Dependencies. Initial experiments on 17th–18th century texts show promising results, particularly for frequent and grammatical words, while highlighting challenges with unknown lemmas. Despite its exploratory scope, the study demonstrates the feasibility of cost-effective corpus construction and lays the groundwork for a larger, more representative language model of Occitan.

Loading

Article metrics loading...

/content/journals/10.1075/rro.25009.cou
2025-12-15
2026-01-13
Loading full text...

Full text loading...

References

  1. Alibèrt, L.
    ([1976] 2000) Gramatica occitan segon los parlars lengadocians. Tolosa, Barcelona, IEO, IEC.
    [Google Scholar]
  2. Francioni, B., Romanova, N., Ziane, R.
    (2025) First steps towards building a treebank of Old Gascon. Talk atData in Historical Linguistics seminar, London, King’s College.
  3. Bras, M., Vergez-Couret, M., & Sibille, J.
    (2024) Corpus et bases de données. Manuel de linguistique occitane, Berlin, DeGruyter, –. 10.1515/9783110733433‑019
    https://doi.org/10.1515/9783110733433-019 [Google Scholar]
  4. Camps, J. B., Couffignal, G.
    (2020) La production de corpus d’occitan médiéval et prémoderne : problèmes et perspectives de travail, inJean-François Courouau et David Fabié (dir.), Fidélités et dissidences / Fidelitats et dissidéncias, actes du XIIe Congrès de l’Association Internationale d’Études Occitanes. Toulouse, SFAIEO, vol., –.
    [Google Scholar]
  5. Chambon, J. P.
    (2017) Brèves remarques sur le Tresor dóu Felibrige de Frédéric Mistral. InMéthodes de recherche en linguistique et en philologie romanes. Strasbourg, EliPhi, –.
    [Google Scholar]
  6. Couffignal, G.
    (forth. a). Philologie numérique et données bruitées : un exemple de recherche sur l’occitan prémoderne, inRobert Hesselbach et Tanja Prohl dir. Approches numériques des corpus historiques des langues de France.
    [Google Scholar]
  7. (forth. b). Le noël occitan imprimé à Toulouse au XVIIe siècle : une approche textométrique. InLittératures classiques, special issue directed by J.F. Courouau.
    [Google Scholar]
  8. Courouau, J. F.
    (2024) Contact avec le français et registres de l’occitan moderne (XVIe-XVIIIe siècle), Lengas, , journals.openedition.org/lengas/7743
    [Google Scholar]
  9. Field, T.
    (2013) The Linguistic Corpus of Old Gascon. Database for linguistic research on Southwestern France, https://mllidev.umbc.edu/gascon/
    [Google Scholar]
  10. Lafon, P.
    (1980) Sur la variabilité de la fréquence des formes dans un corpus. Mots, , –. 10.3406/mots.1980.1008
    https://doi.org/10.3406/mots.1980.1008 [Google Scholar]
  11. Léonard, J.-L., Brun-Trigaud, G., Picard, F.
    (2024) Atlas linguistiques et perspectives dialectométriques. InManuel de linguistique occitane. Berlin, DeGruyter, –. 10.1515/9783110733433‑018
    https://doi.org/10.1515/9783110733433-018 [Google Scholar]
  12. Levy, E.
    (1924) Provenzalisches Supplement-Wörterbuch. Leipzig, OR Reisland.
    [Google Scholar]
  13. Manjavacas, E., Kádár, Á., and Kestemont, M.
    (2019) Improving Lemmatization of Non-Standard Languages with Joint Learning. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota. Association for Computational Linguistics, –. 10.18653/v1/N19‑1153
    https://doi.org/10.18653/v1/N19-1153 [Google Scholar]
  14. Miletić, A.
    (2023) Outiller l’occitan: nouvelles ressources et lemmatisation. In18e Conférence en Recherche d’Information et Applications--16e Rencontres Jeunes Chercheurs en RI--30e Conférence sur le Traitement Automatique des Langues Naturelles--25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues. ATALA, –.
    [Google Scholar]
  15. Miletić, A., and Siewert, J.
    (2023) Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan. InTenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), Dubrovnik, Association for Computational Linguistics, –.
    [Google Scholar]
  16. Miletić, A., Bras, M., Vergez-Couret, M., Esher, L., Poujade, C., & Sibille, J.
    (2020) A four-dialect treebank for Occitan: Building process and parsing experiments. InProceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, –.
    [Google Scholar]
  17. Mistral, F.
    (1878) Lou Tresor dòu Felibrige ou dictionnaire provençal-français. Avignon, Veuve Remondet-Aubin.
    [Google Scholar]
  18. Rainsford, Thomas M.
    2025Old Gallo-Romance Corpus, version 1.0. Stuttgart: Institut für Linguistik/Romanistik. 〈www.ogr-corpus.org
    [Google Scholar]
  19. Raynouard, F. J. M.
    (1840) Lexique roman, ou, dictionnaire de la langue des troubadours. Paris, Silvestre.
    [Google Scholar]
  20. Ricketts, P. T.
    (2001) Concordance de l’occitan médiéval. Turnhout, Brepols.
    [Google Scholar]
  21. Sauzet, P.
    (2014) Idiomacitat e diglossia. InAmb un fil d’amistat. Mélanges offerts à Philippe Gardy. Toulouse, SFAIEO, –.
    [Google Scholar]
  22. Scrivner, O., Kübler, S., Vance, B., & Beuerlein, E.
    (2013) Le Roman de Flamenca: An annotated corpus of old Occitan. InProceedings of the Third Workshop on Annotation of Corpora for Research in Humanities, –.
    [Google Scholar]
  23. Stempel, W. D., Selig, M., Kraus, C., Peter, R., & Tausend, M.
    (1996) Dictionnaire de l’occitan médiéval (DOM en ligne), https://dom-en-ligne.de
    [Google Scholar]
  24. Thalamus team
    Thalamus team (2014) Édition critique numérique du manuscrit AA9 des Archives municipales de Montpellier dit Le Petit Thalamus. Université Paul Valéry Montpellier-III, thalamus.huma-num.fr/
    [Google Scholar]
/content/journals/10.1075/rro.25009.cou
Loading
/content/journals/10.1075/rro.25009.cou
Loading

Data & Media loading...

  • Article Type: Research Article
Keywords: POS tagging ; corpus linguistics ; Occitan ; historical linguistics ; lemmatisation
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error