Volume 28, Issue 2
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes



This paper explores variation in lexico-grammatical register features across text lengths in a large-scale sample of Reddit comments. Very short texts are known to be problematic for many statistical methods, so understanding their nature is important for the corpus-linguistic study of social media, where most contributions are short. I show that the frequencies of linguistic features change with comment length, even between longer comments, although longer texts are often considered similar in statistical terms. Moreover, I classify the variation found between short comments of different lengths into two main patterns, although other patterns can also be found, and there is variation even within these patterns. Furthermore, I interpret the observed differences in terms of register variation. For example, shorter comments appear to be more casual and less edited in terms of their feature makeup, whereas narrative and informational registers seem to favor longer comments.


Article metrics loading...

Loading full text...

Full text loading...


  1. Baroni, M.
    (2008) Distributions in text. InA. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp.803–822). Mouton de Gruyter.
    [Google Scholar]
  2. Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J.
    (2020) The Pushshift Reddit Dataset. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 830–839. https://ojs.aaai.org/index.php/ICWSM/article/view/7347. 10.1609/icwsm.v14i1.7347
    https://doi.org/10.1609/icwsm.v14i1.7347 [Google Scholar]
  3. Berber Sardinha, T., & Veirano Pinto, M.
    (2014) Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber. John Benjamins. 10.1075/scl.60
    https://doi.org/10.1075/scl.60 [Google Scholar]
  4. Biber, D.
    (1988) Variation across Speech and Writing. Cambridge University Press. 10.1017/CBO9780511621024
    https://doi.org/10.1017/CBO9780511621024 [Google Scholar]
  5. (1993) Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257. 10.1093/llc/8.4.243
    https://doi.org/10.1093/llc/8.4.243 [Google Scholar]
  6. (1994) An analytical framework for register studies. InD. Biber & E. Finegan (Eds.), Sociolinguistic Perspectives on Register (pp.31–56). Oxford University Press.
    [Google Scholar]
  7. (2014) Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast, 14(1), 7–34. 10.1075/lic.14.1.02bib
    https://doi.org/10.1075/lic.14.1.02bib [Google Scholar]
  8. Biber, D., & Conrad, S.
    (2001) Introduction: Multi-dimensional analysis and the study of register variation. InS. Conrad & D. Biber (Eds.), Variation in English: Multi-Dimensional Studies (pp.3–12). Pearson Education.
    [Google Scholar]
  9. (2009) Register, Genre, and Style. Cambridge University Press. 10.1017/CBO9780511814358
    https://doi.org/10.1017/CBO9780511814358 [Google Scholar]
  10. Biber, D., & Egbert, J.
    (2016) Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137. 10.1177/0075424216628955
    https://doi.org/10.1177/0075424216628955 [Google Scholar]
  11. (2018) Register Variation Online. Cambridge University Press. 10.1017/9781316388228
    https://doi.org/10.1017/9781316388228 [Google Scholar]
  12. Biber, D., & Gray, B.
    (2013) Being specific about historical change: The influence of sub-register. The Journal of English Linguistics, 411, 104–134. 10.1177/0075424212472509
    https://doi.org/10.1177/0075424212472509 [Google Scholar]
  13. Clarke, I., & Grieve, J.
    (2017) Dimensions of abusive language on Twitter. InZ. Waseem, W. Hui Kyong, D. Hovy, & J. Tetreault (Eds.), Proceedings of the First Workshop on Abusive Language Online (pp.1–10). Association for Computational Linguistics. 10.18653/v1/W17‑3001
    https://doi.org/10.18653/v1/W17-3001 [Google Scholar]
  14. (2019) Stylistic variation on the Donald Trump Twitter account: A linguistic analysis of tweets posted between 2009 and 2018. PLoS ONE, 14(9), Article e0222062. 10.1371/journal.pone.0222062
    https://doi.org/10.1371/journal.pone.0222062 [Google Scholar]
  15. Eberl, M.
    (2020) Double trouble: Are 280-character tweets comparable to 140-character tweets?InS. Rüdiger & D. Dayter (Eds.), Corpus Approaches to Social Media. John Benjamins. 10.1075/scl.98.06ebe
    https://doi.org/10.1075/scl.98.06ebe [Google Scholar]
  16. Egbert, J., & Schnur, E.
    (2018) The role of text in corpus and discourse analysis. InC. Taylor & A. Marchi (Eds.), Corpus Approaches to Discourse: A Critical Review (pp.159–173). Taylor & Francis. 10.4324/9781315179346‑8
    https://doi.org/10.4324/9781315179346-8 [Google Scholar]
  17. Friginal, E.
    (Ed.) (2013) Twenty-five Years of Biber–s Multi-Dimensional Analysis [Special issue]. Corpora, 8(2). 10.3366/cor.2013.0038
    https://doi.org/10.3366/cor.2013.0038 [Google Scholar]
  18. Glynn, D.
    (2014) Correspondence analysis: Exploring data and identifying patterns. InD. Glynn & J. A. Robinson (Eds.), Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy (pp.443–485). John Benjamins. 10.1075/hcp.43.17gly
    https://doi.org/10.1075/hcp.43.17gly [Google Scholar]
  19. Grieve, J., Biber, D., Friginal, E., & Nekrasova, T.
    (2011) Variation among blog text types: A multi-dimensional analysis. InA. Mehler, S. Sharoff, & M. Santini (Eds.), Genres on the Web: Corpus Studies and Computational Models (pp.302–322). Springer.
    [Google Scholar]
  20. Hess, C. W., Haug, H. T., & Landry, R. G.
    (1989) The reliability of type-token ratios for the oral language of school age children. Journal of Speech and Hearing Research, 32(3), 536–540. 10.1044/jshr.3203.536
    https://doi.org/10.1044/jshr.3203.536 [Google Scholar]
  21. Hess, C. W., Sefton, K. M., & Landry, R. G.
    (1986) Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research, 29(1), 129–134. 10.1044/jshr.2901.129
    https://doi.org/10.1044/jshr.2901.129 [Google Scholar]
  22. Hiltunen, T.
    (2014) Choice of national variety in the English-language Wikipedia. InJ. Tyrkkö & S. Leppänen (Eds.), Texts and Discourses of New Media. VARIENG. https://varieng.helsinki.fi/series/volumes/15/hiltunen/
    [Google Scholar]
  23. Holler, J., Kendrick, K. H., Casillas, M., & Levinson, S. C.
    (2015) Editorial: Turn-taking in human communicative interaction. Frontiers in Psychology, 61(1919) 10.3389/fpsyg.2015.01919
    https://doi.org/10.3389/fpsyg.2015.01919 [Google Scholar]
  24. Koizumi, R., & In–nami, Y.
    (2012) Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System, 40(4), 554–564. 10.1016/j.system.2012.10.012
    https://doi.org/10.1016/j.system.2012.10.012 [Google Scholar]
  25. Liimatta, A.
    (2019) Exploring register variation on Reddit: A multi-dimensional study of language use on a social media website. Register Studies, 1(2), 269–295. 10.1075/rs.18005.lii
    https://doi.org/10.1075/rs.18005.lii [Google Scholar]
  26. (2020) Using lengthwise scaling to compare feature frequencies across text lengths on Reddit. InS. Rüdiger & D. Dayter (Eds.), Corpus Approaches to Social Media (pp.111–130). John Benjamins. 10.1075/scl.98.05lii
    https://doi.org/10.1075/scl.98.05lii [Google Scholar]
  27. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D.
    (2014) The Stanford CoreNLP natural language processing toolkit. InK. Bontcheva & J. Zhu (Eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp.55–60). Association for Computational Linguistics. 10.3115/v1/P14‑5010
    https://doi.org/10.3115/v1/P14-5010 [Google Scholar]
  28. Rosen, A. [Google Scholar]
  29. Titak, A., & Roberson, A.
    (2013) Dimensions of web registers: An exploratory multi-dimensional comparison. Corpora, 8(2), 239–271. 10.3366/cor.2013.0042
    https://doi.org/10.3366/cor.2013.0042 [Google Scholar]
  30. Wallis, S.
    (2020) Statistics in Corpus Linguistic Research: A New Approach. Routledge. 10.4324/9780429491696
    https://doi.org/10.4324/9780429491696 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): functional variation; Reddit; register analysis; social media; text length
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error