Volume 6, Issue 2
  • ISSN 2215-1478
  • E-ISSN: 2215-1486
Buy:$35.00 + Taxes



This report outlines the development of a new corpus, which was created by refining and modifying the largest open-access L2 English learner database – the EFCAMDAT. The extensive data-curation process, which can inform the development and use of other corpora, included procedures such as converting the database from XML to a tabular format, and removing problematic markup tags and non-English texts. The final dataset contains two corresponding samples, written by similar learners in response to different prompts, which represents a unique research opportunity when it comes to analyzing task effects and conducting replication studies. Overall, the resulting corpus contains ~406,000 texts in the first sample and ~317,000 texts in the second sample, written by learners representing diverse L1s and a large range of L2 proficiency levels.


Article metrics loading...

Loading full text...

Full text loading...


  1. Alexopoulou, T. , Geertzen, J. , Korhonen, A. , & Meurers, D.
    (2015) Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1(1), 96–129. doi:  10.1075/ijlcr.1.1.04ale
    https://doi.org/10.1075/ijlcr.1.1.04ale [Google Scholar]
  2. Alexopoulou, T. , Michel, M. , Murakami, A. , & Meurers, D.
    (2017) Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67(S1), 180–208. doi:  10.1111/lang.12232
    https://doi.org/10.1111/lang.12232 [Google Scholar]
  3. Callies, M.
    (2015) Learner corpus methodology. In S. Granger , G. Gilquin , & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp.35–56). Cambridge: Cambridge University Press. doi:  10.1017/CBO9781139649414.003
    https://doi.org/10.1017/CBO9781139649414.003 [Google Scholar]
  4. Feinerer, I. , & Hornik, K.
    (2018) tm: Text Mining Package. Retrieved fromhttps://cran.r-project.org/package=tm
  5. Geertzen, J. , Alexopoulou, T. , Baker, R. , Hendriks, H. , Jiang, S. , & Korhonen, A.
    (2013) The EF Cambridge Open Language Database (EFCAMDAT). User Manual Part I: Written Production. Retrieved fromhttps://corpus.mml.cam.ac.uk/
    [Google Scholar]
  6. Geertzen, J. , Alexopoulou, T. , & Korhonen, A.
    (2014) Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R. T. Millar , K. I. Martin , C. M. Eddington , A. Henery , N. M. Miguel , & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp.240–254). Somerville, MA: Cascadilla Proceedings Project.
    [Google Scholar]
  7. Grün, B. , & Hornik, K.
    (2011) topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30. doi:  10.18637/jss.v040.i13
    https://doi.org/10.18637/jss.v040.i13 [Google Scholar]
  8. Huang, Y. , Geertzen, J. , Baker, R. , Korhonen, A. , & Alexopoulou, T.
    (2017) The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp.1–18). Retrieved fromhttps://corpus.mml.cam.ac.uk/
    [Google Scholar]
  9. Huang, Y. , Murakami, A. , Alexopoulou, T. , & Korhonen, A.
    (2018) Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54. doi:  10.1075/ijcl.16080.hua
    https://doi.org/10.1075/ijcl.16080.hua [Google Scholar]
  10. Kaliyaperumal, S. K. , Kuppusamy, M. , Arumugam, S. , Kannan, K. S. , Manoj, K. , & Arumugam, S.
    (2015) Labeling methods for identifying outliers. International Journal of Statistics and Systems, 10(2), 231–238.
    [Google Scholar]
  11. Lang, D. T.
    (2020) XML: Tools for parsing and generating XML within R and S-Plus. Retrieved fromhttps://cran.r-project.org/package=XML
  12. McEnery, T. , Brezina, V. , Gablasova, D. , & Banerjee, J.
    (2019) Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics, 39, 74–92. doi:  10.1017/S0267190519000096
    https://doi.org/10.1017/S0267190519000096 [Google Scholar]
  13. Murakami, A.
    (2013) Individual variation and the role of L1 in the L2 development of English grammatical morphemes: Insights from learner corpora (Unpublished doctoral dissertation). Cambridge University.
    [Google Scholar]
  14. (2016) Modeling systematicity and individuality in nonlinear second language development: The case of English grammatical morphemes. Language Learning, 66(4), 834–871. doi:  10.1111/lang.12166
    https://doi.org/10.1111/lang.12166 [Google Scholar]
  15. Ooms, J.
    (2018) cld2: Google’s compact language detector 2 (Version 1.2). R package. Retrieved fromhttps://cran.r-project.org/package=cld2
  16. Shatz, I.
    (2019) How native language and L2 proficiency affect EFL learners’ capitalisation abilities: A large-scale corpus study. Corpora, 14(2), 173–202. doi:  10.3366/cor.2019.0168
    https://doi.org/10.3366/cor.2019.0168 [Google Scholar]
  17. Van der Loo, M. P. J.
    (2014) The stringdist package for approximate string matching. The R Journal, 6(1), 111–122. Retrieved fromhttps://cran.r-project.org/package=stringdist
    [Google Scholar]
  18. Wickham, H. , François, R. , Henry, L. , Müller, K. , & RStudio
    (2019) dplyr: A grammar of data manipulation. Retrieved fromhttps://cran.r-project.org/web/packages/dplyr/index.html
  19. Wickham, H. , & RStudio
    (2019) stringr: Simple, consistent wrappers for common string operations. Retrieved fromhttps://cran.r-project.org/web/packages/stringr/index.html

Data & Media loading...

  • Article Type: Research Article
Keyword(s): corpus cleaning; data curation; EFCAMDAT; English as a second language
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error