1887
image of Towards better language representation in Natural Language Processing

Abstract

Abstract

This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual GEC studies.

Available under the CC BY 4.0 license.
Loading

Article metrics loading...

/content/journals/10.1075/ijlcr.24033.mas
2025-04-01
2025-04-25
Loading full text...

Full text loading...

/deliver/fulltext/10.1075/ijlcr.24033.mas/ijlcr.24033.mas.html?itemId=/content/journals/10.1075/ijlcr.24033.mas&mimeType=html&fmt=ahah

References

  1. Alsufieva, A., Kisselev, O., & Freels, S.
    (2012) Results 2012: Using flagship data to develop a Russian learner corpus of academic writing. Russian Language Journal, , –. 10.70163/0036‑0252.1158
    https://doi.org/10.70163/0036-0252.1158 [Google Scholar]
  2. Arhar Holdt, Š., Gantar, P., Bon, M., Gapsa, M., Lavrič, P., & Klemen, M.
    (2023) Dataset for evaluation of Slovene spell- and grammar-checking tools Šolar-Eval 1.0. (Slovenian language resource repository CLARIN.SI). https://www.cjvt.si/prop/en/
  3. Arhar Holdt, Š., & Kosem, I.
    (2024) Šolar, the developmental corpus of Slovene. Language Resources and Evaluation, –. 10.1007/s10579‑024‑09758‑4
    https://doi.org/10.1007/s10579-024-09758-4 [Google Scholar]
  4. Arnardóttir, Þ., Xu, X., Guðmundsdóttir, D., Stefánsdóttir, L., & Ingason, A.
    (2021) Creating an Error Corpus: Annotation and Applicability. InProceedings of CLARIN 2021 Annual Conference (pp.–).
    [Google Scholar]
  5. Bol, T., de Vaan, M., & van de Rijt, A.
    (2018) The Matthew effect in science funding. Proceedings of the National Academy of Sciences, (), –. 10.1073/pnas.1719557115
    https://doi.org/10.1073/pnas.1719557115 [Google Scholar]
  6. Boyd, A.
    (2018) Using Wikipedia edits in low resource grammatical error correction. InProceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text (pp.–). Association for Computational Linguistics. 10.18653/v1/W18‑6111
    https://doi.org/10.18653/v1/W18-6111 [Google Scholar]
  7. Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Schöne, K., Štindlová, B., & Vettori, C.
    (2014) The MERLIN corpus: Learner language and the CEFR. InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp.–). European Language Resources Association (ELRA).
    [Google Scholar]
  8. Council of Europe
    Council of Europe (2020) Common European Framework of Reference for Languages: Learning, teaching, assessment. Companion volume with new descriptors. Council of Europe Publishing.
    [Google Scholar]
  9. Darg̀is, R., Auziņa, I., Kaija, I., Levāne-Petrova, K., & Pokratniece, K.
    (2022) LaVA–Latvian Language Learner corpus. InProceedings of the Thirteenth Language Resources and Evaluation Conference (pp.–).
    [Google Scholar]
  10. Darg̀is, R., Auziņa, I., Levāne-Petrova, K., & Kaija, I.
    (2020) Quality focused approach to a learner corpus development. InProceedings of the Twelfth Language Resources and Evaluation Conference (pp.–).
    [Google Scholar]
  11. Davis, C., Caines, A., Andersen, Ø., Taslimipoor, S., Yannakoudakis, H., Yuan, Z., Bryant, C., Rei, M. & Buttery, P.
    (2024) Prompting open-source and commercial language models for grammatical error correction of English learner text. InFindings of the association for computational linguistics: ACL 2024 (pp.–). Association for Computational Linguistics. 10.18653/v1/2024.findings‑acl.711
    https://doi.org/10.18653/v1/2024.findings-acl.711 [Google Scholar]
  12. Ducel, F., Fort, K., Lejeune, G., & Lepage, Y.
    (2022) Do we name the languages we study? the #BenderRule in LREC and ACL articles. InN. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp.–). European Language Resources Association.
    [Google Scholar]
  13. Gantar, P., Bon, M., Gapsa, M., & Arhar Holdt, Š.
    (2023) Šolar-Eval: Evalvacijska množica za strojno popravljanje jezikovnih napak v slovenskih besedilih. Jezik in Slovstvo, (), –. 10.4312/jis.68.4.89‑108
    https://doi.org/10.4312/jis.68.4.89-108 [Google Scholar]
  14. Glišić, I., & Ingason, A. K.
    (2022) The Nature of Icelandic as a second language: An insight from the Learner Error Corpus for Icelandic. InProceedings of the CLARIN Annual Conference (p.–).
    [Google Scholar]
  15. Godfroid, A., & Andringa, S.
    (2023) Uncovering sampling biases, advancing inclusivity, and rethinking theoretical accounts in Second Language Acquisition: Introduction to the special issue SLA for all?Language Learning, (), –. 10.1111/lang.12620
    https://doi.org/10.1111/lang.12620 [Google Scholar]
  16. Hammarstedt, M., Schumacher, A., Borin, L., & Forsberg, M.
    (2022) Sparv 5 user manual (Tech. Rep.). Språkbanken Text.
    [Google Scholar]
  17. Ingason, A. K., Stefánsdóttir, L. B., Arnardóttir, Þ., & Xu, X.
    (2021) Icelandic Error Corpus (IceEC) Version 1.1. (CLARIN-IS).
    [Google Scholar]
  18. Ingason, A. K., Stefánsdóttir, L. B., Arnardóttir, Þ., Xu, X., Glišić, I., & Guðmundsdóttir, D.
    (2022) The Icelandic L2 Error Corpus (IceL2EC) 1.3 (22.10). (CLARIN-IS).
    [Google Scholar]
  19. Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., Volodina, E., Östling, R.
    (2025a) The MultiGEC-2025 shared task on multilingual grammatical error correction at NLP4CALL. InR. Muñoz Sánchez, D. Alfter, J. Kallas, & E. Volodina (Eds.), Proceedings of the 14th workshop on Natural Language Processing for Computer Assisted Language Learning. Tallin, Estonia: University of Tartu. https://hdl.handle.net/2077/84800
    [Google Scholar]
  20. Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., … Zesch, T.
    (2025b) An overview of grammatical error correction for the twelve MultiGEC-2025 languages. GU-ISS Forskningsrapporter från Institutionen för svenska språket. Institution for Swedish, Multilingualism, Language Technology; University of Gothenburg. https://hdl.handle.net/2077/8480
    [Google Scholar]
  21. Merton, R. K.
    (1968) The Matthew effect in science: The reward and communication systems of science are considered. Science, (), –. 10.1126/science.159.3810.56
    https://doi.org/10.1126/science.159.3810.56 [Google Scholar]
  22. Náplava, J., Straka, M., Straková, J., & Rosen, A.
    (2022) Czech grammar error correction with a large and diverse corpus. Transactions of the Association for Computational Linguistics, , –. 10.1162/tacl_a_00470
    https://doi.org/10.1162/tacl_a_00470 [Google Scholar]
  23. Nicholls, D., Caines, A., & Buttery, P.
    (2024) The Write & Improve Corpus 2024: Error-annotated and CEFR-labelled essays by learners of English. Cambridge University Press Assessment.
    [Google Scholar]
  24. Palma Gomez, F., & Rozovskaya, A.
    (2024) Multi-reference benchmarks for Russian grammatical error correction. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (volume 1: Long papers) (pp.–). Association for Computational Linguistics.
    [Google Scholar]
  25. Perc, M.
    (2014) The Matthew effect in empirical data. Journal of The Royal Society Interface, (), . 10.1098/rsif.2014.0378
    https://doi.org/10.1098/rsif.2014.0378 [Google Scholar]
  26. Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., & Štindlová, B.
    (2020) Compiling and annotating a learner corpus for a morphologically rich language — CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press.
    [Google Scholar]
  27. Rozovskaya, A., & Roth, D.
    (2019) Grammar error correction in morphologically rich languages: The case of Russian. Transactions of the Association for Computational Linguistics, , –. 10.1162/tacl_a_00251
    https://doi.org/10.1162/tacl_a_00251 [Google Scholar]
  28. Rudebeck, L., & Sundberg, G.
    (2021) SweLL correction annotation guidelines. (Tech. Rep.). GU-ISS Research report series, Department of Swedish, University of Gothenburg.
    [Google Scholar]
  29. Sakaguchi, K., Napoles, C., Post, M., & Tetreault, J.
    (2016) Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics, , –. 10.1162/tacl_a_00091
    https://doi.org/10.1162/tacl_a_00091 [Google Scholar]
  30. Šebesta, K., Bedřichová, Z., Šormová, K., Straňák, P., & Peterek, N.
    (2014) ROMi 1.0. (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University).
    [Google Scholar]
  31. Šebesta, K., Goláňová, H., Letafková, J., & Jelínková, B.
    (2016) AKCES 1. (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University).
    [Google Scholar]
  32. Søgaard, A.
    (2022) Should we ban English NLP for a year?InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp.–). Association for Computational Linguistics. 10.18653/v1/2022.emnlp‑main.351
    https://doi.org/10.18653/v1/2022.emnlp-main.351 [Google Scholar]
  33. Syvokon, O., Nahorna, O., Kuchmiichuk, P., & Osidach, N.
    (2023) UA-GEC: Grammatical error correction and fluency corpus for the Ukrainian Language. InProceedings of the second Ukrainian Natural Language Processing workshop (UNLP) (pp.–). Association for Computational Linguistics. 10.18653/v1/2023.unlp‑1.12
    https://doi.org/10.18653/v1/2023.unlp-1.12 [Google Scholar]
  34. Syvokon, O., & Romanyshyn, M.
    (2023) The UNLP 2023 Shared Task on Grammatical Error Correction for Ukrainian. InProceedings of the second Ukrainian Natural Language Processing workshop (UNLP) (pp.–). Association for Computational Linguistics. 10.18653/v1/2023.unlp‑1.16
    https://doi.org/10.18653/v1/2023.unlp-1.16 [Google Scholar]
  35. Tantos, A., Amvrazis, N., & Drakonaki, E.
    (2023) Greek Learner Corpus II (GLCII): Design and development of an online corpus for L2 Greek. Journal of Applied Linguistics, , –. 10.26262/jal.v0i36.9915
    https://doi.org/10.26262/jal.v0i36.9915 [Google Scholar]
  36. Volodina, E., Granstedt, L., Matsson, A., Megyesi, B., Pilán, I., Prentice, J., … & Wirén, M.
    (2019) The SweLL language learner corpus: From design to annotation. Northern European Journal of Language Technology (NEJLT), , –. 10.3384/nejlt.2000‑1533.19667
    https://doi.org/10.3384/nejlt.2000-1533.19667 [Google Scholar]
  37. (2022) SweLL-gold. Språkbanken Text. Distributed via SBX/CLARIN. 10.23695%2F2k47‑y432
    https://doi.org/10.23695%2F2k47-y432 [Google Scholar]
  38. Wisniewski, K., Schöne, K., Nicolas, L., Vettori, C., Boyd, A., Meurers, D., … Hana, J.
    (2013) MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data. InInternational Conference, ICT for Language Learning, 6th edition.
    [Google Scholar]
/content/journals/10.1075/ijlcr.24033.mas
Loading
/content/journals/10.1075/ijlcr.24033.mas
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error