Volume 25, Issue 2
  • ISSN 1384-6655
  • E-ISSN: 1569-9811



Throughout the social sciences, there has been growing pressure to present effect sizes when publishing empirical data (see American Psychological Association, 2001Parsons & Nelson, 2004). While it seems indisputable that for the majority of quantitative research foci, effect size is an essential element of statistical analysis, this paper argues that specifically for key word analysis in corpus linguistics, the means of reporting effect size must depend on the level of the unit of study of each investigation (single text, collection or large corpus). After exploring some main criticisms of the log-likelihood measure, this paper unpacks the parameters of different measures for keyness and how they might address underlying concerns. It maintains that for the exploration of foregrounded/deviant/salient/marked features in text, the use of log-likelihood scores to rank the results is still fit for purpose and coupled with Bayes Factors is a solid approach for key word analyses.

Available under the CC BY-NC 4.0 license.

Article metrics loading...

Loading full text...

Full text loading...



  1. Anthony, L.
    (2019) AntConc (Version 3.5.8) [Computer software]. Waseda University. https://www.laurenceanthony.net/software
    [Google Scholar]
  2. American Psychological Association
    American Psychological Association (2001) Publication Manual of the American Psychological Association (5th ed.). American Psychological Association.
    [Google Scholar]
  3. Baker, P.
    (2004) Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4), 346–359. 10.1177/0075424204269894
    https://doi.org/10.1177/0075424204269894 [Google Scholar]
  4. Baker, P., Gabrielatos, C., Khosravinik, M., Krzyżanowski, M., McEnery, T., & Wodak, R.
    (2008) A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273–306. 10.1177/0957926508088962
    https://doi.org/10.1177/0957926508088962 [Google Scholar]
  5. Bradley, J. V.
    (1960) Distribution-free Statistical Tests. Air Research and Development Command. 10.21236/AD0249268
    https://doi.org/10.21236/AD0249268 [Google Scholar]
  6. Brezina, V., McEnery, T., & Wattam, S.
    (2015) Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173. 10.1075/ijcl.20.2.01bre
    https://doi.org/10.1075/ijcl.20.2.01bre [Google Scholar]
  7. Cobb, T.
    (2000) The Compleat Lexical Tutor (Version 8.3) [Computer software]. RetrievedNovember, 2019, fromwww.lextutor.ca
    [Google Scholar]
  8. Croft, W. B., Metzler, D., & Strohman, T.
    (2010) Search Engines: Information Retrieval in Practice. Addison-Wesley.
    [Google Scholar]
  9. Dunning, T.
    (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
    [Google Scholar]
  10. Egbert, J., & Biber, D.
    (2019) Incorporating text dispersion into keyword analyses. Corpora, 14 (1), 77–104. 10.3366/cor.2019.0162
    https://doi.org/10.3366/cor.2019.0162 [Google Scholar]
  11. Gabrielatos, C.
    (2018) Keyness analysis: Nature, metrics and techniques. InC. Taylor & A. Marchi (Eds.) Corpus Approaches to Discourse: A Critical Review. Routledge. 10.4324/9781315179346‑11
    https://doi.org/10.4324/9781315179346-11 [Google Scholar]
  12. Gabrielatos, C., & Marchi, A.
    (2012) Keyness: Appropriate metrics and practical issues [Paper presentation]. CADS International Conference 2012, University of Bologna, Italy. https://www.researchgate.net/publication/261708842_Keyness_Appropriate_metrics_and_practical_issues
    [Google Scholar]
  13. Gabrielatos, C., Torgersen, E. N., Hoffmann, S., & Fox, S.
    (2010) A corpus-based sociolinguistic study of indefinite article forms in London English. Journal of English Linguistics, 38(4), 297–334. 10.1177/0075424209352729
    https://doi.org/10.1177/0075424209352729 [Google Scholar]
  14. Grissom, R. J., & Kim, J. J.
    (2012) Effect Sizes for Research: Univariate and Multivariate Applications. Routledge. 10.4324/9780203803233
    https://doi.org/10.4324/9780203803233 [Google Scholar]
  15. Hardie, A.
    (2012) CQPweb: Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. 10.1075/ijcl.17.3.04har
    https://doi.org/10.1075/ijcl.17.3.04har [Google Scholar]
  16. (2014a) Log Ratio – an informal introduction. ESRC Centre for Corpus Approaches to Social Science (CASS). cass.lancs.ac.uk/?p=1133
    [Google Scholar]
  17. (2014b) Statistical identification of keywords, lockwords and collocations as a two-step procedure [Paper presentation]. ICAME 35 Conference, University of Nottingham, Nottingham, UK.
    [Google Scholar]
  18. Hoey, M.
    (2005) Lexical Priming: A New Theory of Words and Language. Routledge.
    [Google Scholar]
  19. Jeaco, S.
    (2017) Concordancing lexical primings: The rationale and design of a user-friendly corpus tool for English language teaching and self-tutoring based on the Lexical Priming theory of language. InM. Pace-Sigge & K. J. Patterson (Eds.), Lexical Priming: Applications and Advances. John Benjamins. 10.1075/scl.79.11jea
    https://doi.org/10.1075/scl.79.11jea [Google Scholar]
  20. Johnston, J. E., Berry, K. J., & Mielke Jr, P. W.
    (2006) Measures of effect size for chi-squared and likelihood-ratio goodness-of-fit tests. Perceptual and Motor Skills, 103(2), 412–414. 10.2466/pms.103.2.412‑414
    https://doi.org/10.2466/pms.103.2.412-414 [Google Scholar]
  21. Kass, R. E., & Raftery, A. E.
    (1995) Bayes Factors. Journal of the American Statistical Association, 90(430), 773. 10.1080/01621459.1995.10476572
    https://doi.org/10.1080/01621459.1995.10476572 [Google Scholar]
  22. Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D.
    (2004) The Sketch Engine [Paper presentation]. The 2003 International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China.
    [Google Scholar]
  23. Lee, D. Y. W.
    (2001) Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5(3), 37–72.
    [Google Scholar]
  24. Leech, G. N., Hundt, M., Mair, C., & Smith, N.
    (2009) Change in Contemporary English: A Grammatical Study. Cambridge Univerisity Press. 10.1017/CBO9780511642210
    https://doi.org/10.1017/CBO9780511642210 [Google Scholar]
  25. Leech, G. N., & Short, M. H.
    (2007) Style in Fiction: A Linguistic Introduction to English Fictional Prose (2nd ed.). Pearson Longman. (Original work published 1981)
    [Google Scholar]
  26. Lexical Computing Ltd
    Lexical Computing Ltd (2014) Statistics used in the Sketch Engine. https://www.sketchengine.eu/wp-content/uploads/ske-statistics.pdf
  27. Mahlberg, M.
    (2013) Corpus Stylistics and Dickens’s Fiction. Routledge. 10.4324/9780203076088
    https://doi.org/10.4324/9780203076088 [Google Scholar]
  28. Mahlberg, M., Stockwell, P., de Joode, J., Smith, C., & O’Donnell, M. B.
    (2016) CLiC Dickens: Novel uses of concordances for the integration of corpus stylistics and cognitive poetics. Corpora, 11(3), 433–463. 10.3366/cor.2016.0102
    https://doi.org/10.3366/cor.2016.0102 [Google Scholar]
  29. Oakes, M. P.
    (1998) Statistics for Corpus Linguistics. Edinburgh University Press.
    [Google Scholar]
  30. Parsons, T. D., & Nelson, N. W.
    (2004) Paradigm shift in social science research: A significance testing and effect size estimation rapprochement?PsycCRITIQUES, 49(Suppl 3).
    [Google Scholar]
  31. Partington, A.
    (2010) Modern Diachronic Corpus-Assisted Discourse Studies (MD-CADS) on UK newspapers: An overview of the project. Corpora, 5(2), 83–108. 10.3366/cor.2010.0101
    https://doi.org/10.3366/cor.2010.0101 [Google Scholar]
  32. Plonsky, L., & Oswald, F. L.
    (2014) How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878–912. 10.1111/lang.12079
    https://doi.org/10.1111/lang.12079 [Google Scholar]
  33. Raftery, A. E.
    (1986) A note on Bayes Factors for Log-Linear contingency table models with vague prior information. Journal of the Royal Statistical Society. Series B (Methodological), 48(2), 249–250. 10.1111/j.2517‑6161.1986.tb01408.x
    https://doi.org/10.1111/j.2517-6161.1986.tb01408.x [Google Scholar]
  34. Rayson, P.
    (n.d.). UCREL Log-likelihood and effect size calculator. RetrievedNovember, 2019, fromucrel.lancs.ac.uk/llwizard.html
  35. (2008) From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549. 10.1075/ijcl.13.4.06ray
    https://doi.org/10.1075/ijcl.13.4.06ray [Google Scholar]
  36. Rayson, P., Berridge, D., & Francis, B.
    (2004) Extending the Cochran rule for the comparison of word frequencies between corpora [Paper presentation]. The 7th International Conference on Statistical Analysis of Textual Data, Louvain-la-Neuve, Belgium. https://eprints.lancs.ac.uk/id/eprint/12424/1/rbf04_jadt.pdf
    [Google Scholar]
  37. Rayson, P., & Garside, R.
    (2000) Comparing corpora using frequency profiling [Paper presentation]. The Workshop on Comparing Corpora, Hong Kong University of Science and Technology, Hong Kong. https://eprints.lancs.ac.uk/id/eprint/11882/1/rg_acl2000.pdf
    [Google Scholar]
  38. Rayson, P., Leech, G., & Hodges, M.
    (1997) Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics, 2(1), 133–152. 10.1075/ijcl.2.1.07ray
    https://doi.org/10.1075/ijcl.2.1.07ray [Google Scholar]
  39. Read, T. R. C., & Cressie, N. A. C.
    (1988) Goodness-of-fit Statistics for Discrete Multivariate Data. Springer. 10.1007/978‑1‑4612‑4578‑0
    https://doi.org/10.1007/978-1-4612-4578-0 [Google Scholar]
  40. Scott, M.
    (1997) PC analysis of key words – and key key words. System, 25(2), 233–245. 10.1016/S0346‑251X(97)00011‑0
    https://doi.org/10.1016/S0346-251X(97)00011-0 [Google Scholar]
  41. (2016) WordSmith Tools (Version 7.0) [Computer software]. Stroud: Lexical Analysis Software.
    [Google Scholar]
  42. (2019a) WordSmith Tools online manual “KeyWords: Calculation”. RetrievedNovember, 2019, fromhttps://lexically.net/downloads/version7/HTML/keywords_calculate_info.html
  43. (2019b) WordSmith Tools online manual “KeyWords”. RetrievedNovember, 2019, fromhttps://lexically.net/downloads/version7/HTML/keywords2.html
  44. (2019c) WordSmith Tools online manual “KeyWords: Thinking about keyness”. RetrievedNovember, 2019, fromhttps://lexically.net/downloads/version7/HTML/thinking_about_keyness.html
  45. (2019d) WordSmith Tools online manual “KeyWords: Keyness definition”. RetrievedNovember, 2019, fromhttps://lexically.net/downloads/version7/HTML/keyness_definition.html
  46. Scott, M., & Tribble, C.
    (2006) Textual Patterns: Key Words and Corpus Analysis in Language Education. John Benjamins. 10.1075/scl.22
    https://doi.org/10.1075/scl.22 [Google Scholar]
  47. Wilson, A.
    (2013) Embracing Bayes Factors for key item analysis in corpus linguistics. InM. Bieswanger & A. Koll-Stobbe (Eds.), New Approaches to the Study of Linguistic Variability (pp.3–12). Peter Lang.
    [Google Scholar]
  48. Zipf, G. K.
    (1935) The Psycho-Biology of Language: An Introduction to Dynamic Philology. Houghton Mifflin.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): effect size; key word analysis; keyness; log-likelihood; ranking
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error