Volume 6, Issue 2
  • ISSN 2215-1478
  • E-ISSN: 2215-1486



In Learner Corpus Research (LCR), a common source of errors stems from manual coding and annotation of linguistic features. To estimate the amount of error present in a coded dataset, coefficients of inter-rater reliability are used. However, despite the importance of reliability and internal consistency for validity and, by extension, study quality, interpretability and generalizability, it is surprisingly uncommon for studies in the field of LCR to report on such reliability coefficients. In this Methods Report, we use a recent collaborative research project to illustrate the pertinence of considering inter-rater reliability. In doing so, we hope to initiate methodological discussion on instrument design, piloting and evaluation. We also suggest some ways forward to encourage increased transparency in reporting practices.

Available under the CC BY-NC 4.0 license.

Article metrics loading...

Loading full text...

Full text loading...



  1. Andreu-Andrés, M. , Astor-Guardiola, A. , Boquera-Matarredona, M. , Macdonald, P. , Montero-Fleta, B. , & Pérez-Sabater, C.
    (2010) Analysing EFL learner output in the MiLC project: An error it’s*, but which tag?. In M. C. Campoy-Cubillo , B. Bellés-Fortuño , & M. Ll. Gea-Valor (Eds.), Corpus-based approaches to English language teaching (pp.167–188). London: Continuum.
    [Google Scholar]
  2. Artstein, R.
    (2017) Inter-annotator agreement. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp.297–313). New York, NY: Springer. 10.1007/978‑94‑024‑0881‑2_11
    https://doi.org/10.1007/978-94-024-0881-2_11 [Google Scholar]
  3. Cohen, J.
    (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. 10.1177/001316446002000104
    https://doi.org/10.1177/001316446002000104 [Google Scholar]
  4. Collentine, K.
    (2009) Learner use of holistic language units in task-based synchronous computer-mediated communication. Language Learning & Technology, 13, 67–87.
    [Google Scholar]
  5. Derrick, D.
    (2015) Instrument reporting practices in second language research. TESOL Quarterly, 50(1), 132–153. 10.1002/tesq.217
    https://doi.org/10.1002/tesq.217 [Google Scholar]
  6. Díez-Bedmar, M. B.
    (2015) Dealing with errors in learner corpora to describe, teach and assess EFL writing: Focus on article use. In E. Castello , K. Ackerley , & F. Coccetta (Eds.), Studies in Learner Corpus Linguistics: Research and applications for foreign language teaching and assessment (pp.37–69). Bern: Peter Lang.
    [Google Scholar]
  7. Fleiss, J. L.
    (1971) Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. 10.1037/h0031619
    https://doi.org/10.1037/h0031619 [Google Scholar]
  8. Gamer, M. , Lemon, J. , Fellows, I. , & Singh, P.
    (2012)  irr: Various coefficients of interrater reliability and agreement. R package version 0.84.
    [Google Scholar]
  9. Hallgren, K.
    (2012) Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34. 10.20982/tqmp.08.1.p023
    https://doi.org/10.20982/tqmp.08.1.p023 [Google Scholar]
  10. Hasselgård, H.
    (2010) Adjunct adverbials in English. Cambridge: Cambridge University Press. 10.1017/CBO9780511676253
    https://doi.org/10.1017/CBO9780511676253 [Google Scholar]
  11. Johnson, R. L. , Penny, J. , & Gordon, B.
    (2010) The relation between score resolution methods and interrater reliability: An empirical study of an analytic scoring rubric. Applied Measurement in Education, 13(2), 121–138. 10.1207/S15324818AME1302_1
    https://doi.org/10.1207/S15324818AME1302_1 [Google Scholar]
  12. Kutuk, G. , Putwain, D. W. , Kaye, L. , & Garrett, B.
    (in press). Development and validation of a new multidimensional language class anxiety scale. Journal of Psychoeducational Assessment.
    [Google Scholar]
  13. Landis, J. R. , & Koch, G. G.
    (1977) The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. 10.2307/2529310
    https://doi.org/10.2307/2529310 [Google Scholar]
  14. Larsson, T.
    (2018) Is there a correlation between form and function? A syntactic and functional investigation of the introductory it pattern in student writing. ICAME Journal, 42(1), 13–40. 10.1515/icame‑2018‑0003
    https://doi.org/10.1515/icame-2018-0003 [Google Scholar]
  15. Larsson, T. , Callies, M. , Hasselgård, H. , Laso, N. J. , Van Vuuren, S. , Verdaguer, I. , & Paquot, M.
    (2020) Adverb placement in EFL academic writing: Going beyond syntactic transfer. International Journal of Corpus Linguistics, 25(2), 155–184. 10.1075/ijcl.19131.lar
    https://doi.org/10.1075/ijcl.19131.lar [Google Scholar]
  16. Larson-Hall, J. , & Plonsky, L.
    (2015) Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65(Suppl. 1), 127–159. 10.1111/lang.12115
    https://doi.org/10.1111/lang.12115 [Google Scholar]
  17. Loewen, S. , & Plonsky, L.
    (2015) An A–Z of applied linguistics research methods. New York, NY: Palgrave.
    [Google Scholar]
  18. Lu, X.
    (2010) Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. 10.1075/ijcl.15.4.02lu
    https://doi.org/10.1075/ijcl.15.4.02lu [Google Scholar]
  19. Lüdeling, A. , & Hirschmann, H.
    (2015) Error annotation systems. In S. Granger , G. Gilquin , & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp.135–157). Cambridge: Cambridge University Press. 10.1017/CBO9781139649414.007
    https://doi.org/10.1017/CBO9781139649414.007 [Google Scholar]
  20. McKay, T. , & Plonsky, L.
    (in press). Reliability analyses: Estimating error in L2 research. In P. Winke & T. Brunfaut Eds. The Routledge handbook of second language acquisition and language testing. New York, NY: Routledge.
    [Google Scholar]
  21. Morgan, G. B. , Zhu, M. , Johnson, R. L. , & Hodge, K. J.
    (2014) Interrater reliability estimators commonly used in scoring language assessments: A Monte Carlo investigation of estimator accuracy. Language Assessment Quarterly, 11, 304–324. 10.1080/15434303.2014.937486
    https://doi.org/10.1080/15434303.2014.937486 [Google Scholar]
  22. Norris, J. M. , Plonsky, L. , Ross, S. J. , & Schoonen, R.
    (2015) Guidelines for reporting quantitative methods and results in primary research. Language Learning, 65(2), 470–476. 10.1111/lang.12104
    https://doi.org/10.1111/lang.12104 [Google Scholar]
  23. Osborne, J.
    (2003) Effect sizes and the disattenuation of correlation and regression coefficients: Lessons from educational psychology. Practical Assessment, Research, & Evaluation, 8(11). Retrieved fromhttps://pareonline.net/getvn.asp?v=8&n=11
    [Google Scholar]
  24. Paquot, M. , Hasselgård, H. , & Oksefjell Ebeling, S.
    (2013) Writer/reader visibility in learner writing across genres: A comparison of the French and Norwegian components of the ICLE and VESPA learner corpora. In S. Granger , G. Gilquin , & F. Meunier (Eds.), Twenty years of Learner Corpus Research: Looking back, moving ahead. Proceedings of the first Learner Corpus Research Conference (LCR 2011) (pp.377–387). Louvain-la-Neuve: Presses Universitaires de Louvain.
    [Google Scholar]
  25. Paquot, M. , Grafmiller, J. , & Szmrecsanyi, B.
    (2019) Particle placement alternation in EFL learner vs. L1 speech: Assessing the similarity of probabilistic grammars. In A. Abel , A. Glaznieks , V. Lyding , & L. Nicolas (Eds.), Widening the scope of learner corpus research: Selected papers from the fourth Learner Corpus Research Conference (pp.71–92). Louvain-la-Neuve: Presses universitaires de Louvain.
    [Google Scholar]
  26. Paquot, M. , & Plonsky, L.
    (2017) Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research, 3(1), 61–94. 10.1075/ijlcr.3.1.03paq
    https://doi.org/10.1075/ijlcr.3.1.03paq [Google Scholar]
  27. Plonsky, L.
    (2013) Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687. 10.1017/S0272263113000399
    https://doi.org/10.1017/S0272263113000399 [Google Scholar]
  28. Plonsky, L. , & Derrick, D. J.
    (2016) A meta-analysis of reliability coefficients in second language research. Modern Language Journal, 100, 538–553. 10.1111/modl.12335
    https://doi.org/10.1111/modl.12335 [Google Scholar]
  29. Polio, C. , & Shea, M.
    (2014) An investigation into current measures of linguistic accuracy in second language writing research. Journal of Second Language Writing, 26(1), 10–27. 10.1016/j.jslw.2014.09.003
    https://doi.org/10.1016/j.jslw.2014.09.003 [Google Scholar]
  30. Purpura, J. , Brown, J. D. , & Schoonen, R.
    (2015) Improving the validity of quantitative measures in applied linguistics research. Language Learning, 65(Suppl. 1), 37–75. 10.1111/lang.12112
    https://doi.org/10.1111/lang.12112 [Google Scholar]
  31. Quirk, R. , Greenbaum, S. , Leech, G. , & Svartvik, J.
    (1985) A comprehensive grammar of the English language. London: Longman.
    [Google Scholar]
  32. R Core Team
    R Core Team (2018) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved fromhttps://www.R-project.org/
    [Google Scholar]
  33. Révész, A.
    (2012) Coding second language data validly and reliably. In A. Mackey & S. Gass (Eds.), Research methods in Second Language Acquisition: A practical guide (pp.203–221). Hoboken, NJ: Wiley-Blackwell. 10.1002/9781444347340.ch11
    https://doi.org/10.1002/9781444347340.ch11 [Google Scholar]
  34. Rose, Y. , & MacWhinney, B.
    (2014) The PhonBank Project: Data and software-assisted methods for the study of phonology and phonological development. In J. Durand , U. Gut , & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp.380–401). Oxford: Oxford University Press.
    [Google Scholar]
  35. Rosen, A. , Hana, J. , Stindlova, B. , & Feldman, A.
    (2014) Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation, 48, 65–92. 10.1007/s10579‑013‑9226‑3
    https://doi.org/10.1007/s10579-013-9226-3 [Google Scholar]
  36. Sim, J. , & Wright, C. C.
    (2005) The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257–268. 10.1093/ptj/85.3.257
    https://doi.org/10.1093/ptj/85.3.257 [Google Scholar]
  37. Spooren, W. , & Degand, L.
    (2010) Coding coherence relations: Reliability and validity. Corpus Linguistics and Linguistic Theory, 6(2), 241–266. 10.1515/cllt.2010.009
    https://doi.org/10.1515/cllt.2010.009 [Google Scholar]
  38. Trafimow, D.
    (2017) The attenuation of correlation coefficients: A statistical literacy issue. Teaching Statistics, 38, 25–28. 10.1111/test.12087
    https://doi.org/10.1111/test.12087 [Google Scholar]
  39. Vyatkina, N.
    (2016) KANDEL: A developmental corpus of learner German. International Journal of Learner Corpus Research, 2(1), 102–120. 10.1075/ijlcr.2.1.04vya
    https://doi.org/10.1075/ijlcr.2.1.04vya [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error