Volume 24, Issue 1
  • ISSN 1384-6655
  • E-ISSN: 1569-9811



Previous research has demonstrated that language use can vary depending on the context of situation. The present paper extends this finding by comparing word predictability differences between 14 speech registers ranging from highly informal conversations to read-aloud books. We trained 14 statistical language models to compute register-specific word predictability and trained a register classifier on the perplexity score vector of the language models. The classifier distinguishes perfectly between samples from all speech registers and this result generalizes to unseen materials. We show that differences in vocabulary and sentence length cannot explain the speech register classifier’s performance. The combined results show that speech registers differ in word predictability.


Article metrics loading...

Loading full text...

Full text loading...


  1. Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., & Gildea, D.
    (1999) Forms of English function words-effects of disfluencies, turn position, age and sex, and predictability. InJ. J. Ohala, Y. Hasegawa, M. Ohala, D. Granville & A. C. Bailey (Eds.), Proceedings of ICPHS-99 (pp.395–398). Berkley, CA: University of California. Retrieved fromhttps://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_0395.pdf (last accessedFebruary 2019).
    [Google Scholar]
  2. Van Berkum, J. J., Brown, C. M., Zwitserlood, P., Kooijman, V., & Hagoort, P.
    (2005) Anticipating upcoming words in discourse: Evidence from ERPs and reading times. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31(3), 443–467.
    [Google Scholar]
  3. Biber, D.
    (1988) Variation Across Speech and Writing. New York, NY: Cambridge University Press. 10.1017/CBO9780511621024
    https://doi.org/10.1017/CBO9780511621024 [Google Scholar]
  4. (1995) Dimensions of Register Variation: A Cross-linguistic Comparison. New York, NY: Cambridge University Press. 10.1017/CBO9780511519871
    https://doi.org/10.1017/CBO9780511519871 [Google Scholar]
  5. Biber, D., & Conrad, S.
    (2009) Register, Genre, and Style. New York, NY: Cambridge University Press. 10.1017/CBO9780511814358
    https://doi.org/10.1017/CBO9780511814358 [Google Scholar]
  6. Chen, S. F., & Goodman, J.
    (1999) An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–393. 10.1006/csla.1999.0128
    https://doi.org/10.1006/csla.1999.0128 [Google Scholar]
  7. Church, K. W., & Gale, W. A.
    (1995) Poisson mixtures. Natural Language Engineering, 1(2), 163–190. 10.1017/S1351324900000139
    https://doi.org/10.1017/S1351324900000139 [Google Scholar]
  8. Denoual, E.
    (2006) A method to quantify corpus similarity and its application to quantifying the degree of literality in a document. International Journal of Technology and Human Interaction, 2(1), 51–66. 10.4018/jthi.2006010104
    https://doi.org/10.4018/jthi.2006010104 [Google Scholar]
  9. Ellis, N. C.
    (2002) Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24(2), 143–188. 10.1017/S0272263102002024
    https://doi.org/10.1017/S0272263102002024 [Google Scholar]
  10. Frisson, S., Rayner, K., & Pickering, M. J.
    (2005) Effects of contextual predictability and transitional probability on eye movements during reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31(5), 862–877.
    [Google Scholar]
  11. Van Gijsel, S., Speelman, D., & Geeraerts, D.
    (2006) Locating lexical richness: A corpus linguistic, sociovariational analysis. InJ. M. Viprey (Eds.), Proceedings of the 8th International Conference on the Statistical Analysis of Textual Data (pp.961–971). Besançon: Presses universitaires de Franche-Comté. Retrieved fromlexicometrica.univ-paris3.fr/jadt/jadt2006/PDF/II-085.pdf (last accessedFebruary 2019).
    [Google Scholar]
  12. Goedertier, W., Goddijn, S. M., & Martens, J. P.
    (2000) Orthographic transcription of the Spoken Dutch Corpus. InN. Calzolari, G. Carayannis, K. Choukri, H. Höge, B. Maegaard, J. Mariani, & A. Zampolli (Eds.), Proceedings of LREC-2000. Athens: ELRA. Retrieved fromwww.lrec-conf.org/proceedings/lrec2000/pdf/87.pdf (last accessedFebruary 2019).
    [Google Scholar]
  13. Van Gompel, M., & van den Bosch, A.
    (2016) Efficient n-gram, skipgram and flexgram modelling with Colibri Core. Journal of Open Research Software, 4(1), 1–10.
    [Google Scholar]
  14. Gries, S. Th.
    (2001) A corpus linguistic analysis of English ic vs ical adjectives. ICAME Journal, 25, 65–108.
    [Google Scholar]
  15. Gries, S. Th., & Ellis, N. C.
    (2015) Statistical measures for usage-based linguistics. Language Learning, 65(1), 228–255. 10.1111/lang.12119
    https://doi.org/10.1111/lang.12119 [Google Scholar]
  16. Hlaváčová, J., & Rychlý, P.
    (1999) Dispersion of words in a language corpus. InV. Matousek, P. Mautner, J. Ocelíková, P. Sojka (Eds.), Text, Speech and Dialogue: Second International Workshop, TSD’99 Plzen, Czech Republic, September 13–17, 1999 Proceedings (pp.321–324). Berlin: Springer. 10.1007/3‑540‑48239‑3_58
    https://doi.org/10.1007/3-540-48239-3_58 [Google Scholar]
  17. Jurafsky, D., & Martin, J. H.
    (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Upper Saddle River, NJ: Pearson.
    [Google Scholar]
  18. Kilgarriff, A.
    (2001) Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133. 10.1075/ijcl.6.1.05kil
    https://doi.org/10.1075/ijcl.6.1.05kil [Google Scholar]
  19. Lee, D. Y.
    (2001) Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5(3), 37–72.
    [Google Scholar]
  20. Leech, G.
    (2000) Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning, 50(4), 675–724. 10.1111/0023‑8333.00143
    https://doi.org/10.1111/0023-8333.00143 [Google Scholar]
  21. Marco, J.
    (2000) Register analysis in literary translation: A functional approach. Fédération International des Traucteurs (FIT) Revue Babel, 46(1), 1–19.
    [Google Scholar]
  22. Miller, D., & Biber, D.
    (2015) Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition. International Journal of Corpus Linguistics, 20(1), 30–53. 10.1075/ijcl.20.1.02mil
    https://doi.org/10.1075/ijcl.20.1.02mil [Google Scholar]
  23. Monsalve, I. F., Frank, S. L., & Vigliocco, G.
    (2012) Lexical surprisal as a general predictor of reading time. InW. Daelemans (Eds.), Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp.398–408). Avignon: Association for Computational Linguistics. Retrieved fromaclweb.org/anthology/E12-1041 (last accessedFebruary 2019).
    [Google Scholar]
  24. Oostdijk, N.
    (2001) The design of the Spoken Dutch Corpus. Language and Computers, 36(1), 105–112.
    [Google Scholar]
  25. Oostdijk, N., Reynaert, M., Hoste, V., & Schuurman, I.
    (2013) The construction of a 500-million-word reference corpus of contemporary written Dutch. InP. Spyns & J. Odijk (Eds.), Essential Speech and Language Technology for Dutch (pp.219–247). Berlin: Springer. 10.1007/978‑3‑642‑30910‑6_13
    https://doi.org/10.1007/978-3-642-30910-6_13 [Google Scholar]
  26. Pluymaekers, M., Ernestus, M., & Baayen, R. H.
    (2006) Effects of word frequency on the acoustic durations of affixes. InProceedings of Interspeech 2006 – ICSLP (pp.953–956). Pittsburgh, PA: International Speech Communication Association. Retrieved fromhttps://www.isca-speech.org/archive/archive_papers/interspeech_2006/i06_1241.pdf (last accessedFebruary 2019).
    [Google Scholar]
  27. Rayson, P., & Garside, R.
    (2000) Comparing corpora using frequency profiling. InA. Kilgarriff & T. Berber Sardinha (Eds.), Proceedings of the Workshop on Comparing Corpora of ACL 2000 (pp.1–6). Hong Kong: Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/W/W00/W00-0901.pdf (last accessedFebruary 2019). 10.3115/1117729.1117730
    https://doi.org/10.3115/1117729.1117730 [Google Scholar]
  28. Savický, P., & Hlavácová, J.
    (2002) Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215–231. 10.1076/jqul.
    https://doi.org/10.1076/jqul. [Google Scholar]
  29. Schmitt, N.
    (2010) Researching Vocabulary: A Vocabulary Research Manual. New York, NY: Palgrave Macmillan. 10.1057/9780230293977
    https://doi.org/10.1057/9780230293977 [Google Scholar]
  30. Smith, N. J., & Levy, R.
    (2013) The effect of word predictability on reading time is logarithmic. Cognition, 128(3), 302–319. 10.1016/j.cognition.2013.02.013
    https://doi.org/10.1016/j.cognition.2013.02.013 [Google Scholar]
  31. Van Son, R., Wesseling, W., Sanders, E., & van den Heuvel, H.
    (2008) The IFADV Corpus: A Free Dialog Video Corpus. InN. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), LREC (pp.501–508). Marrakech: ELRA. Retrieved fromwww.lrec-conf.org/proceedings/lrec2008/pdf/132_paper.pdf (last accessedFebruary 2019).
    [Google Scholar]
  32. Stolcke, A.
    (2002) SRILM-an extensible language modelling toolkit. InJ. H. L. Hansen & B. L. Pellom (Eds.), Proceedings of the International Conference on Spoken Language Processing. Denver, CO: International Speech Communication Association. Retrieved fromhttps://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_0901.pdf (last accessedFebruary 2019).
    [Google Scholar]
  33. Tottie, G.
    (1991) Negation in English Speech and Writing: A Study in Variation. San Diego, CA: Academic Press.
    [Google Scholar]
  34. Willems, R. M., Frank, S. L., Nijhof, A. D., Hagoort, P., & van den Bosch, A.
    (2016) Prediction during natural language comprehension. Cerebral Cortex, 26(6), 2506–2516. 10.1093/cercor/bhv075
    https://doi.org/10.1093/cercor/bhv075 [Google Scholar]
  35. Witten, I. H., & Bell, T. C.
    (1991) The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094. 10.1109/18.87000
    https://doi.org/10.1109/18.87000 [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error