1887
Volume 30, Issue 3
  • ISSN 1384-6655
  • E-ISSN: 1569-9811

Abstract

A wealth of linguistic data has been annotated by corpus linguists, and this extant annotated data can be used to automatically replicate and apply the linguist’s annotation scheme by means of machine learning models. This paper accompanies the release of documented code notebooks, which allow corpus linguists to use manually categorized examples or ‘training data’ as input for a predictive language model. By means of a case study of Early Modern English - forms, we describe how the predictive language model MacBERTh can be used to accurately replicate the manual data classification scheme employed in previous corpus linguistic studies. Additionally, we discuss how manual error analysis and post-correction may help improve the model’s output. By openly releasing the data and code used in this paper, we hope to stimulate the use of machine learning models such as MacBERTh in corpus linguistics.

Available under the CC BY 4.0 license.
Loading

Article metrics loading...

/content/journals/10.1075/ijcl.22088.fon
2025-09-19
2026-06-08
Loading full text...

Full text loading...

/deliver/fulltext/ijcl.22088.fon.html?itemId=/content/journals/10.1075/ijcl.22088.fon&mimeType=html&fmt=ahah

References

  1. Brandsen, A., Verberne, S., Lambers, K., & Wansleeben, M.
    (2022) Can BERT dig it? Named entity recognition for information retrieval in the Archaeology domain. Journal on Computing and Cultural Heritage, 15(3), Article 51. 10.1145/3497842
    https://doi.org/10.1145/3497842 [Google Scholar]
  2. Davies, M.
    (2010) The Corpus of Historical American English (COHA). Available online athttps://www.english-corpora.org/coha/
    [Google Scholar]
  3. De Smet, H., & Vancayzeele, E.
    (2015) Like a rolling stone: The changing use of English premodifying present participles. English Language and Linguistics, 19(1), 131–156. 10.1017/S136067431400029X
    https://doi.org/10.1017/S136067431400029X [Google Scholar]
  4. De Smet, H., Flach, S., Tyrkkö, J., & Diller, H.-J.
    (2015) The Corpus of Late Modern English (CLMET) (version 3.1: Improved tokenization and linguistic annotation). KU Leuven, FU Berlin, U Tampere, RU Bochum. https://perswww.kuleuven.be/~u0044428/clmet3_0.htm
    [Google Scholar]
  5. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K.
    (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Association for Computational Linguistics. https://aclanthology.org/N19-1423.pdf
    [Google Scholar]
  6. Fanego, T.
    (2004) On reanalysis and actualization in syntactic change: The rise and development of English verbal gerund. Diachronica, 21(1), 5–55. 10.1075/dia.21.1.03fan
    https://doi.org/10.1075/dia.21.1.03fan [Google Scholar]
  7. Fonteyn, L.
    (2019) Categoriality in language change: The case of the English gerund. Oxford University Press. 10.1093/oso/9780190917579.001.0001
    https://doi.org/10.1093/oso/9780190917579.001.0001 [Google Scholar]
  8. Fonteyn, L., & Hartmann, S.
    (2016) Usage-based perspectives on diachronic morphology: A mixed-methods approach towards English ing-nominals. Linguistics Vanguard, 2(1), 20160057. 10.1515/lingvan‑2016‑0057
    https://doi.org/10.1515/lingvan-2016-0057 [Google Scholar]
  9. Fonteyn, L., & Petré, P.
    (2022) On the probability and direction of morphosyntactic lifespan change. Language Variation and Change, 34(1), 79–105. 10.1017/S0954394522000011
    https://doi.org/10.1017/S0954394522000011 [Google Scholar]
  10. Fonteyn, L., & Van de Pol, N.
    (2016) Divide and conquer: The formation and functional dynamics of the Modern English ing-clause network. English Language and Linguistics, 20(2), 185–219. 10.1017/S1360674315000258
    https://doi.org/10.1017/S1360674315000258 [Google Scholar]
  11. Hosseini, K., Beelen, K., Colavizza, G., & Coll Ardanuy, M.
    (2021) Neural language models for nineteenth-century English. Journal of Open Humanities Data, 71, 22. 10.5334/johd.48
    https://doi.org/10.5334/johd.48 [Google Scholar]
  12. Hundt, M., Röthlisberger, M., Schneider, G., & Zehentner, E.
    (2019) (Semi-)automatic retrieval of data from historical corpora: Chances and challenges. [Conference presentation]. 52nd Annual Meeting of the Societas Linguistica Europaea (SLE). Leipzig, Germany. https://www.prepcomp.uzh.ch/dam/jcr:0b6447fb-d466-4015-85f2-4311e372abfc/SLE_Workshop_Introduction.pdf
    [Google Scholar]
  13. James, G. M., Witten, D., Hastie, T., & Tibshirani, R.
    (2013) An introduction to statistical learning. Springer. 10.1007/978‑1‑4614‑7138‑7
    https://doi.org/10.1007/978-1-4614-7138-7 [Google Scholar]
  14. Jurafsky, D., & Martin, J. H.
    (2025) Speech and language processing: An introduction to speech recognition, computational linguistics, and speech recognition with language models. Third edition. Online manuscript releasedJanuary 12, 2025. https://web.stanford.edu/~jurafsky/slp3/
    [Google Scholar]
  15. Killie, K., & Swan, T.
    (2009) The grammaticalization and subjectification of adverbial -ing clauses (converb clauses) in English. English Language and Linguistics, 13(3), 337–363. 10.1017/S1360674309990141
    https://doi.org/10.1017/S1360674309990141 [Google Scholar]
  16. Kortmann, B.
    (1991) Free adjuncts and absolutes in English: Problems of control and interpretation. Routledge. 10.4324/9781315002880
    https://doi.org/10.4324/9781315002880 [Google Scholar]
  17. Kroch, A., Santorini, B., & Delfs, L.
    (2004) The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) (First edition, release 3). Department of Linguistics, University of Pennsylvania. www.ling.upenn.edu/ppche/ppche-release-2016/PPCEME-RELEASE-3
    [Google Scholar]
  18. Kroch, A., Santorini, B., & Diertani, A.
    (2016) The Penn Parsed Corpus of Modern British English (PPCMBE2) (Second edition, release 1). Department of Linguistics, University of Pennsylvania. www.ling.upenn.edu/ppche/ppche-release-2016/PPCMBE2-RELEASE-1
    [Google Scholar]
  19. Lass, R.
    (1992) Phonology and morphology. InN. Blake (Ed.), The Cambridge history of the English language, vol. II: 1066–1476 (pp.23–155). Cambridge University Press. 10.1017/CHOL9780521264754.003
    https://doi.org/10.1017/CHOL9780521264754.003 [Google Scholar]
  20. Leech, G., Hundt, M., Mair, C., & Smith, N.
    (2009) Change in contemporary English: A grammatical study. Cambridge University Press. 10.1017/CBO9780511642210
    https://doi.org/10.1017/CBO9780511642210 [Google Scholar]
  21. Manjavacas, E., & Fonteyn, L.
    (2021) MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450–1950). Proceedings of the Workshop on Natural Language Processing for Digital Humanities (NLP4DH) (pp.23–36). Association for Computational Linguistics. icon2021.nits.ac.in/resources/nlp4dh.pdf#page=35
    [Google Scholar]
  22. (2022) Adapting vs. pre-training language models for historical languages. Journal of Data Mining & Digital Humanities, 91521. 10.46298/jdmdh.9152
    https://doi.org/10.46298/jdmdh.9152 [Google Scholar]
  23. Manning, C. D., Raghavan, P., & Schütze, H.
    (2008) Introduction to information retrieval. Cambridge University Press. 10.1017/CBO9780511809071
    https://doi.org/10.1017/CBO9780511809071 [Google Scholar]
  24. Manning, C. D.
    (2011) Part-of-Speech tagging from 97% to 100%: Is it time for some linguistics?. InA. F. Gelbukh (Ed.) Computational linguistics and intelligent text processing. CICLing 2011. Lecture notes in computer science, vol.66081. (pp.171–189). Springer. nlp.stanford.edu/~manning/papers/CICLing2011-manning-tagging.pdf. 10.1007/978‑3‑642‑19400‑9_14
    https://doi.org/10.1007/978-3-642-19400-9_14 [Google Scholar]
  25. Petré, P., Anthonissen, L., Budts, S., Manjavacas, E., Silva, E.-L., Standing, W., & Strik, A. O.
    (2019) Early Modern Multiloquent Authors (EMMA): Designing a large-scale corpus of individuals’ languages. ICAME Journal, 431, 83–122. 10.2478/icame‑2019‑0004
    https://doi.org/10.2478/icame-2019-0004 [Google Scholar]
  26. Rastas, I., Ryan, Y., Tiihonen, I., Qaraei, M., Repo, L., Babbar, R., Mäkelä, E., Tolonen, M. & Ginter, F.
    (2022) Explainable publication year prediction of eighteenth century texts with the BERT model. Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change (pp.68–77). Association for Computational Linguistics. https://aclanthology.org/2022.lchange-1.7.pdf. 10.18653/v1/2022.lchange‑1.7
    https://doi.org/10.18653/v1/2022.lchange-1.7 [Google Scholar]
  27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I.
    (2017) Attention is all you need. InI. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.) Advances in neural information processing systems: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Neural Information Processing Systems Foundation, Inc.https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
    [Google Scholar]
  28. Zhang, J., Ryan, Y. C., Rastas, I., Ginter, F., Tolonen, M., & Babbar, R.
    (2022) Detecting sequential genre change in eighteenth-century texts. InF. Karsdorp, A. Lassche, & K. Nielbo (Eds.), Proceedings of the Computational Humanities Research Conference 2022. CEUR Workshop Proceedings 3290 (pp.243–255). Computational Humanities Research Conference, Antwerp, Belgium. hdl.handle.net/10138/351519
    [Google Scholar]
/content/journals/10.1075/ijcl.22088.fon
Loading
/content/journals/10.1075/ijcl.22088.fon
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): gerund; historical corpora; machine learning; morphosyntax; participle
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error