Volume 8, Issue 1
  • ISSN 2215-1478
  • E-ISSN: 2215-1486
Buy:$35.00 + Taxes



This report introduces the (PELIC; Juffs et al., 2020), a publicly available 4.2-million-word learner corpus of written texts. Collected over seven years in the University of Pittsburgh’s Intensive English Program, these texts were produced by more than 1,100 students with diverse linguistic backgrounds and proficiency levels. Unlike most learner corpora which are cross-sectional, PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting. This potential is illustrated in an overview of the research conducted to date with these data. The report also provides a description of PELIC’s creation and contents, including how the texts have been managed to facilitate natural language processing. Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available learner corpora, supplemented by a useful set of Python tools and tutorials for accessing these data.


Article metrics loading...

Loading full text...

Full text loading...


  1. Alexopoulou, T. , Geertzen, J. , Korhonen, A. , & Meurers, D.
    (2015) Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1 (1), 96–129. 10.1075/ijlcr.1.1.04ale
    https://doi.org/10.1075/ijlcr.1.1.04ale [Google Scholar]
  2. Atkinson, K.
    (2019) Spell Checking Oriented Word Lists (SCOWL) (Version 2019). wordlist.aspell.net/
    [Google Scholar]
  3. Biber, D. , Reppen, R. , Staples, S. , & Egbert, J.
    (2020) Exploring the longitudinal development of grammatical complexity in the disciplinary writing of L2-English university students. International Journal of Learner Corpus Research, 6 (1), 38–71. 10.1075/ijlcr.18007.bib
    https://doi.org/10.1075/ijlcr.18007.bib [Google Scholar]
  4. Bird, S. , Loper, E. & Klein, E.
    (2009) Natural language processing with Python. O’Reilly Media.
    [Google Scholar]
  5. Blanchard, D. , Tetreault, J. , Higgins, D. , Cahill, A. , & Chodorow, M.
    (2014) ETS Corpus of Non-Native Written English LDC2014T06. Linguistic Data Consortium.
    [Google Scholar]
  6. Callies, M.
    (2015) Learner corpus methodology. In S. Granger , G. Gilquin , & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp.35–56). Cambridge University Press. 10.1017/CBO9781139649414.003
    https://doi.org/10.1017/CBO9781139649414.003 [Google Scholar]
  7. Centre for English Corpus Linguistics
    Centre for English Corpus Linguistics (2021a) Longitudinal Database of Learner English (LONGDALE). Université catholique de Louvain. https://uclouvain.be/en/research-institutes/ilc/cecl/longdale.html
    [Google Scholar]
  8. Centre for English Corpus Linguistics
    Centre for English Corpus Linguistics (2021b) Learner corpora around the world. Université catholique de Louvain. https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html
    [Google Scholar]
  9. Davies, M.
    (2008–) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. https://www.english-corpora.org/coca/
  10. Dunlap, S.
    (2012) Orthographic quality in English as a second language (Unpublished doctoral dissertation). University of Pittsburgh.
  11. Etaiwi, W. , & Naymat, G.
    (2017) The impact of applying different preprocessing steps on review spam detection. Procedia Computer Science, 113 , 273–279. 10.1016/j.procs.2017.08.368
    https://doi.org/10.1016/j.procs.2017.08.368 [Google Scholar]
  12. Gablasova, D. , Brezina, V. , & McEnery, T.
    (2017) Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning 67 (1), 130–154. 10.1111/lang.12226
    https://doi.org/10.1111/lang.12226 [Google Scholar]
  13. Garbe, W.
    (2020) SymSpell (Version 6.7). https://github.com/wolfgarbe/symspell
  14. Gilquin, G.
    (2015) From design to collection of learner corpora. In S. Granger , G. Gilquin , & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp.9–34). Cambridge University Press. 10.1017/CBO9781139649414.002
    https://doi.org/10.1017/CBO9781139649414.002 [Google Scholar]
  15. Granger, S. , Dupont, M. , Meunier, F. , Naets, H. & Paquot, M.
    (2020) The International Corpus of Learner English. Version 3. Presses universitaires de Louvain. https://dial.uclouvain.be/pr/boreal/object/boreal:229877
    [Google Scholar]
  16. Honnibal, M.
    (2013) A good part-of-speech tagger in about 200 lines of Python. Explosion. https://explosion.ai/blog/part-of-speech-pos-tagger-in-python
  17. Juffs, A.
    (2020) Aspects of language development in an intensive English program. Routledge. 10.4324/9781315170190
    https://doi.org/10.4324/9781315170190 [Google Scholar]
  18. Juffs, A. , & Han, N-R.
    (2019, March12). Combining formal and usage-based theories with data science techniques in measuring the development of syntactic complexity in written production. Paper presented at theInternational Conference of the American Association of Applied Linguistics, Atlanta, GA.
    [Google Scholar]
  19. Juffs, A. , Han, N-R. , & Naismith, B.
    (2020) The University of Pittsburgh English Language Corpus (PELIC) [Data set].   10.5281/zenodo.3991977
    https://doi.org/10.5281/zenodo.3991977 [Google Scholar]
  20. Leńko-Szymańska, A.
    (2019) Defining and assessing lexical proficiency. Routledge. 10.4324/9780429321993
    https://doi.org/10.4324/9780429321993 [Google Scholar]
  21. Marcus, M. P. , Santorini, B. , Marcinkiewicz, M. A. , & Taylor, A.
    (1999) Treebank-3 LDC99T42 [Web Download]. Linguistic Data Consortium. https://catalog.ldc.upenn.edu/LDC99T42
  22. Meunier, F.
    (2016) Introduction to the LONGDALE Project. In E. Castello , K. Ackerley , & F. Coccetta (Eds.), Studies in learner corpus linguistics. Research and applications for foreign language teaching and assessment (pp.123–126). Peter Lang.
    [Google Scholar]
  23. Naismith, B. , Han, N.-R. , Juffs, A. , Hill, B. L. , & Zheng, D.
    (2018) Accurate measurement of lexical sophistication with reference to ESL learner data. In K. E. Boyer & M. Yudelson (Eds), Proceedings of the 11th International Conference on Educational Data Mining (pp.259–265).
    [Google Scholar]
  24. Naismith, B. , & Juffs, A.
    (2021) Finding the sweet spot: Learners’ productive knowledge of mid-frequency lexical items. Language Teaching Research.   10.1177/13621688211020412
    https://doi.org/10.1177/13621688211020412 [Google Scholar]
  25. Nation, I. S. P.
    (2013) Learning vocabulary in another language (2nd ed.). Cambridge University Press. 10.1017/CBO9781139858656
    https://doi.org/10.1017/CBO9781139858656 [Google Scholar]
  26. Picoral, A. , Staples, S. , & Reppen, R.
    (2021) Automated annotation of learner English. International Journal of Learner Corpus Research, 7 (1), 17–52.   10.1075/ijlcr.20003.pic
    https://doi.org/10.1075/ijlcr.20003.pic [Google Scholar]
  27. Rankin, T. , & Schiftner, B.
    (2011) Marginal prepositions in learner English: Applying local corpus data. International Journal of Corpus Linguistics, 16 (3), 412–34. 10.1075/ijcl.16.3.07ran
    https://doi.org/10.1075/ijcl.16.3.07ran [Google Scholar]
  28. Someya, Y.
  29. Tidball, F. , & Treffers-Daller, J.
    (2008) Analysing lexical richness in French learner language: what frequency lists and teacher judgements can tell us about basic and advanced words. Journal of French Language Studies, 18 (3), 299–313.   10.1017/S0959269508003463
    https://doi.org/10.1017/S0959269508003463 [Google Scholar]
  30. van Rooy, B. , & Schäfer, L.
    (2009) The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20 (4), 325–335. 10.2989/16073610209486319
    https://doi.org/10.2989/16073610209486319 [Google Scholar]
  31. Vercellotti, M. L.
    (2017) The development of complexity, accuracy and fluency in second language performance. Applied Linguistics, 38 , 90–111. 10.1093/applin/amv002
    https://doi.org/10.1093/applin/amv002 [Google Scholar]
  32. Vercellotti, M. L. , Juffs, A. , & Naismith, B.
    (2021) Multiword sequences in L2 English language learners’ speech: The relationship between trigrams and lexical variety across development. System, 98 . 10.1016/j.system.2021.102494
    https://doi.org/10.1016/j.system.2021.102494 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): ESL; IEP; longitudinal development; multi-L1 corpus; PELIC
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error