Volume 26, Issue 1
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes



This paper discusses the creation and use of the TV Corpus (subtitles from 75,000 episodes, 325 million words, 6 English-speaking countries, 1950s-2010s) and the Movies Corpus (subtitles from 25,000 movies, 200 million words, 6 English-speaking countries, 1930s–2010s), which are available at English-Corpora.org. The corpora compare well to the BNC-Conversation data in terms of informality, lexis, phraseology, and syntax. But at 525 million words in total size, they are more than 30 times as large as BNC-Conversation (both BNC1994 and BNC2014 combined), which means that they can be used to look at a wide range of linguistic phenomena. The TV and Movies corpora also allow useful comparisons of very informal language across time (containing texts from the 1930s and later for the movies, and from the 1950s onwards for TV shows) and between dialects of English (such as British and American English).


Article metrics loading...

Loading full text...

Full text loading...


  1. Baker, P.
    (2009) The BE06 corpus of British English and recent language change. International Journal of Corpus Linguistics, 14(3), 312–337. 10.1075/ijcl.14.3.02bak
    https://doi.org/10.1075/ijcl.14.3.02bak [Google Scholar]
  2. (2011) Times may change but we’ll always have money: A corpus driven examination of vocabulary change in four diachronic corpora. Journal of English Linguistics, 39(1), 65–88. 10.1177/0075424210368368
    https://doi.org/10.1177/0075424210368368 [Google Scholar]
  3. Bednarek, M.
    (2018) Language and Television Series: A Linguistic Approach to TV Dialogue. Cambridge University Press. 10.1017/9781108559553
    https://doi.org/10.1017/9781108559553 [Google Scholar]
  4. (2019) Creating Dialogue for TV: Screenwriters Talk Television. Routledge. 10.4324/9780429029394
    https://doi.org/10.4324/9780429029394 [Google Scholar]
  5. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E.
    (1999) Longman Grammar of Spoken and Written English. Longman.
    [Google Scholar]
  6. BNC Consortium
    BNC Consortium (2007) British National Corpus (version 3, BNC XML ed.). www.natcorp.ox.ac.uk
  7. Brysbaert, M., & New, B.
    (2009) Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. 10.3758/BRM.41.4.977
    https://doi.org/10.3758/BRM.41.4.977 [Google Scholar]
  8. Brysbaert, M., Mandera, P., & Keuleers, E.
    (2018) The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45–50. 10.1177/0963721417727521
    https://doi.org/10.1177/0963721417727521 [Google Scholar]
  9. Canavan, A., & Zipperlen, G.
    (1996) CALLFRIEND American English-Non-Southern Dialect (LDC96S46). Linguistic Data Consortiumhttps://catalog.ldc.upenn.edu/LDC96S46.
    [Google Scholar]
  10. Canavan, A., Graff, D., & Zipperlen, G.
    (1997) CALLHOME American English Speech (LDC97S42). Linguistic Data Consortiumhttps://catalog.ldc.upenn.edu/LDC97S42.
    [Google Scholar]
  11. Davies, M.
    (2009) the 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190. 10.1075/ijcl.14.2.02dav
    https://doi.org/10.1075/ijcl.14.2.02dav [Google Scholar]
  12. (2011) The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–465. 10.1093/llc/fqq018
    https://doi.org/10.1093/llc/fqq018 [Google Scholar]
  13. (2012) Expanding horizons in historical linguistics with the 400 million word Corpus of Historical American English. Corpora, 7(2), 121–157. 10.3366/cor.2012.0024
    https://doi.org/10.3366/cor.2012.0024 [Google Scholar]
  14. (2015) Corpora: An introduction. InD. Biber & R. Reppen (Eds.), Cambridge Handbook of English Corpus Linguistics (pp.11–31). Cambridge University Press. 10.1017/CBO9781139764377.002
    https://doi.org/10.1017/CBO9781139764377.002 [Google Scholar]
  15. (2017) Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. InE. Friginal (Ed.), Studies in Corpus-Based Sociolinguistics (pp.19–82). Routledge. 10.4324/9781315527819‑2
    https://doi.org/10.4324/9781315527819-2 [Google Scholar]
  16. (2018) Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. InC. Suhr, T. Nevalainen & I. Taavitsainen (Eds.), From Data to Evidence in English Language Research (pp.34–55). Brill. 10.1163/9789004390652_004
    https://doi.org/10.1163/9789004390652_004 [Google Scholar]
  17. Forchini, P.
    (2012) Movie Language Revisited: Evidence from Multi-Dimensional Analysis and Corpora. Peter Lang. 10.3726/978‑3‑0351‑0325‑0
    https://doi.org/10.3726/978-3-0351-0325-0 [Google Scholar]
  18. Greenbaum, S.
    (1996) Comparing English Worldwide: The International Corpus of English. Clarendon Press.
    [Google Scholar]
  19. Godfrey, J. J., & Holliman, E.
    (1993) Switchboard-1 Release 2 (LDC97S62). Linguistic Data Consortium. https://catalog.ldc.upenn.edu/LDC97S62
    [Google Scholar]
  20. Van Heuven, W., Mandera, P., Keuleers, E., & Brysbaert, M.
    (2014) SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190. 10.1080/17470218.2013.850521
    https://doi.org/10.1080/17470218.2013.850521 [Google Scholar]
  21. Levshina, N.
    (2017) Online film subtitles as a corpus: An n-gram approach. Corpora, 12(3), 311–338. 10.3366/cor.2017.0123
    https://doi.org/10.3366/cor.2017.0123 [Google Scholar]
  22. Lison, P., & Tiedemann, J.
    (2016) OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. InN. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L16-1147/
    [Google Scholar]
  23. Love, R.
    (2020) Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. Routledge. 10.4324/9780429429811
    https://doi.org/10.4324/9780429429811 [Google Scholar]
  24. Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T.
    (2017) The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344. 10.1075/ijcl.22.3.02lov
    https://doi.org/10.1075/ijcl.22.3.02lov [Google Scholar]
  25. Lugea, J.
    (2019) The intralingual subtitling of The Wire: Changes of style and substance. Journal of Applied Linguistics and Professional Practice, 12(1), 23–49. 10.1558/jalpp.24620
    https://doi.org/10.1558/jalpp.24620 [Google Scholar]
  26. Piazza, R., Bednarek, M., & Rossi, F.
    (Eds.) (2011) Telecinematic Discourse: Approaches to the Language of Films and Television Series. John Benjamins. 10.1075/pbns.211
    https://doi.org/10.1075/pbns.211 [Google Scholar]
  27. Quaglio, P.
    (2009) Television Dialogue: The Sitcom Friends vs. Natural Conversation. John Benjamins. 10.1075/scl.36
    https://doi.org/10.1075/scl.36 [Google Scholar]
  28. Rayson, P., & Garside, R.
    (1998) The CLAWS web tagger. ICAME Journal, 22(4), 121–123.
    [Google Scholar]
  29. Simpson, R., Briggs, L., Ovens, J., & Swales, J.
    (2002) The Michigan Corpus of Academic Spoken English. The Regents of the University of Michigan.
    [Google Scholar]
  30. Tiedemann, J.
    (2016) OPUS – parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384.
    [Google Scholar]
  31. Veirano Pinto, M.
    (2014) Dimensions of variation in North American movies. InT. Berber Sardinha & M. Veirano Pinto (Eds.), Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber (pp.109–146). John Benjamins. 10.1075/scl.60.04vei
    https://doi.org/10.1075/scl.60.04vei [Google Scholar]
  32. (2018) Variation in movies and television programs: The impact of corpus sampling. InV. Werner (Ed.), The Language of Pop Culture (pp.139–161). Routledge. 10.4324/9781315168210‑7
    https://doi.org/10.4324/9781315168210-7 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): diachronic; dialects; movies; speech; TV
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error