Volume 22, Issue 3
  • ISSN 1384-6655
  • E-ISSN: 1569-9811



This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014’s creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.

This work is licensed under a Creative Commons Attribution 4.0 license.

Article metrics loading...

Loading full text...

Full text loading...



  1. Adolphs, S. , & Carter, R.
    (2013) Spoken Corpus Linguistics: From Monomodal to Multimodal. Abingdon: Routledge.
    [Google Scholar]
  2. Adolphs, S. , Knight, D. , & Carter, R.
    (2015) Beyond modal spoken corpora: A dynamic approach to tracking language in context. In P. Baker & T. McEnery (Eds.), Corpora and Discourse Studies: Integrating Discourse and Corpora (pp.41–62). Houndsmill: Palgrave Macmillan. doi:  10.1057/9781137431738_3
    https://doi.org/10.1057/9781137431738_3 [Google Scholar]
  3. Alderson, C. J.
    (2007) Judging the frequency of English words. Applied Linguistics, 28(3), 383–409. doi:  10.1093/applin/amm024
    https://doi.org/10.1093/applin/amm024 [Google Scholar]
  4. Aston, G. , & Burnard, L.
    (1998) The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
    [Google Scholar]
  5. Atkins, A. , Clear, J. , & Ostler, N.
    (1992) Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16. doi:  10.1093/llc/7.1.1
    https://doi.org/10.1093/llc/7.1.1 [Google Scholar]
  6. Biber, D.
    (1993) Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4), 243–257. doi:  10.1093/llc/8.4.243
    https://doi.org/10.1093/llc/8.4.243 [Google Scholar]
  7. Brezina, V. , & Meyerhoff, M.
    (2014) Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1–28. doi:  10.1075/ijcl.19.1.01bre
    https://doi.org/10.1075/ijcl.19.1.01bre [Google Scholar]
  8. Brezina, V. , Gablasova, D. , McEnery, T. , & Meyerhoff, M.
    (2016) British National Corpus (BNC) as a sociolinguistic dataset: Exploring individual and social variation. Retrieved fromgtr.rcuk.ac.uk/projects?ref=ES%2FP001599%2F1 (last accessedNovember 2016).
    [Google Scholar]
  9. Brezina, V. , Love, R. , & Aijmer, K.
    Eds. (forthcoming). Corpus Approaches to Sociolinguistic Variation in Contemporary British English: An Exploration of the Spoken BNC2014. New York: Routledge.
    [Google Scholar]
  10. Burnard, L.
    (2000) Reference guide for the British National Corpus (World Edition). Oxford University. Retrieved fromwww.natcorp.ox.ac.uk/archive/worldURG/urg.pdf (last accessedDecember 2013).
  11. (2002) Where did we go wrong? A retrospective look at the British National Corpus. In B. Kettemann & G. Markus (Eds.), Teaching and Learning by Doing Corpus Analysis (pp.51–71). Amsterdam: Rodopi. doi:  10.1163/9789004334236_007
    https://doi.org/10.1163/9789004334236_007 [Google Scholar]
  12. (2007) Reference Guide for the British National Corpus (XML Edition). Oxford University. Retrieved fromwww.natcorp.ox.ac.uk/docs/URG/ (last accessedDecember 2013).
  13. Burnard, L. , & Bauman, S.
    (Eds.) (2013) TEI: P5 Guidelines. TEI Consortium. Retrieved fromwww.tei-c.org/Guidelines/P5/ (last accessedJune 2017).
    [Google Scholar]
  14. Carter, R.
    (1998) Orders of reality: CANCODE, communication, and culture. ELT Journal, 52(1), 43–56. doi:  10.1093/elt/52.1.43
    https://doi.org/10.1093/elt/52.1.43 [Google Scholar]
  15. Cappelle, B. , Dugas, E. , & Tobin, V.
    (2015) An afterthought on let alone. Journal of Pragmatics, 80, 70–85. doi:  10.1016/j.pragma.2015.02.005
    https://doi.org/10.1016/j.pragma.2015.02.005 [Google Scholar]
  16. Čermák, F.
    (2009) Spoken corpora design: Their constitutive parameters. International Journal of Corpus Linguistics, 14(1), 113–123. doi:  10.1075/ijcl.14.1.07cer
    https://doi.org/10.1075/ijcl.14.1.07cer [Google Scholar]
  17. Crowdy, S.
    (1993) Spoken corpus design. Literary and Linguistic Computing, 8(4), 259–265. doi:  10.1093/llc/8.4.259
    https://doi.org/10.1093/llc/8.4.259 [Google Scholar]
  18. (1994) Spoken corpus transcription. Literary and Linguistic Computing, 9(1), 25–28. doi:  10.1093/llc/9.1.25
    https://doi.org/10.1093/llc/9.1.25 [Google Scholar]
  19. (1995) The BNC spoken corpus. In G. Leech , G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-Up and Annotation (pp.224–234). Harlow: Longman.
    [Google Scholar]
  20. Davies, M.
    (2004) BYU-BNC (Based on the British National Corpus from Oxford University Press). Brigham Young University. Retrieved fromcorpus.byu.edu/bnc/ (last accessedJune 2017).
    [Google Scholar]
  21. Deuchar, M. , Davies P. , Herring J. , Parafita Couto, M. , & Carter D.
    (2014) Building bilingual corpora. In E. M. Thomas & I. Mennen (Eds.), Advances in the Study of Bilingualism (pp.93–111). Bristol: Multilingual Matters.
    [Google Scholar]
  22. Douglas, F.
    (2003) The Scottish Corpus of Texts and Speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37. doi:  10.1093/llc/18.1.23
    https://doi.org/10.1093/llc/18.1.23 [Google Scholar]
  23. Flowerdew, J.
    (2009) Corpora in language teaching. In M. H. Long & C. J. Doughty (Eds.), The Handbook of Language Teaching (pp.327–350). Oxford: Wiley-Blackwell. doi:  10.1002/9781444315783.ch19
    https://doi.org/10.1002/9781444315783.ch19 [Google Scholar]
  24. Gabrielatos, C.
    (2013) If-conditionals in ICLE and the BNC: A success story for teaching or learning?In S. Granger , G. Gilquin & F. Meunier (Eds.), Twenty Years of Learner Corpus Research: Looking back, Moving ahead (pp.155–156). Louvain-la-Neuve: Presses Universitaires de Louvain.
    [Google Scholar]
  25. Garside, R. , & Smith, N.
    (1997) A hybrid grammatical tagger: CLAWS4. In R. Garside , G. Leech & A. McEnery (Eds.), Corpus Annotation: Linguistic Information from Computer Text Corpora (pp.102–121). London: Longman.
    [Google Scholar]
  26. Hadikin, G.
    (2014)  A, an and the environments in Spoken Korean English. Corpora, 9(1), 1–28. doi:  10.3366/cor.2014.0049
    https://doi.org/10.3366/cor.2014.0049 [Google Scholar]
  27. Hardie, A.
    (2012) CQPweb – Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. doi:  10.1075/ijcl.17.3.04har
    https://doi.org/10.1075/ijcl.17.3.04har [Google Scholar]
  28. (2014) Modest XML for Corpora: Not a standard, but a suggestion. ICAME Journal38, 73–103. doi:  10.2478/icame‑2014‑0004
    https://doi.org/10.2478/icame-2014-0004 [Google Scholar]
  29. Hatice, C.
    (2015) Impoliteness in Corpora: A Comparative Analysis of British English and Spoken Turkish. Sheffield: Equinox.
    [Google Scholar]
  30. Hoffmann, S. , Evert, S. , Lee, D. , & Ylva, B.
    (2008) Corpus Linguistics with BNCweb: A Practical Guide. Frankfurt am Main: Peter Lang.
    [Google Scholar]
  31. Ide, N.
    (1996) Corpus Encoding Standard. Expert Advisory Group on Language Engineering Standards (EAGLES). Retrieved fromwww.cs.vassar.edu/CES/ (last accessedJune 2017).
    [Google Scholar]
  32. Kallen, J. L. , & Kirk, J.
    (2008) ICE-Ireland: A User’s Guide Documentation to accompany the Ireland Component of the International Corpus of English (ICE-Ireland). Belfast: Cló Ollscoil na Banríona. Retrieved fromwww.johnmkirk.co.uk/johnmkirk/documents/003647.pdf (last accessedJune 2017).
    [Google Scholar]
  33. Lam, P.
    (2009) The making of a BNC customised spoken corpus for comparative purposes. Corpora, 4(1), 167–188. doi:  10.3366/E174950320900029X
    https://doi.org/10.3366/E174950320900029X [Google Scholar]
  34. Leech, G. , Rayson, P. , & Wilson, A.
    (2001) Word Frequencies in Written and Spoken English: Based on the British National Corpus. Harlow: Pearson Education Limited.
    [Google Scholar]
  35. Lüdeling, A. , & Kytö, M.
    (2008) Introduction. In A. Lüdeling , & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp.i–xii). Berlin: Walter de Gruyter. doi:  10.1515/9783110211429
    https://doi.org/10.1515/9783110211429 [Google Scholar]
  36. Love, R. , Hawtin, A. , & Hardie, A.
    (2017) The British National Corpus 2014: User Manual and Reference Guide (version 1.0). Lancaster: ESRC Centre for Corpus Approaches to Social Science.
    [Google Scholar]
  37. McEnery, T.
    (2005) Swearing in English: Bad Language, Purity and Power from 1586 to the Present. New York, NY: Routledge.
    [Google Scholar]
  38. Montgomery, C.
    (2012) The effect of proximity in perceptual dialectology. Journal of Sociolinguistics, 16(5), 638–668. doi:  10.1111/josl.12003
    https://doi.org/10.1111/josl.12003 [Google Scholar]
  39. Nelson, G. , Wallis, S. , & Aarts, B.
    (2002) Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam/Philadelphia: John Benjamins. doi:  10.1075/veaw.g29
    https://doi.org/10.1075/veaw.g29 [Google Scholar]
  40. Nesselhauf, N. , & Römer, U.
    (2007) Lexical-grammatical patterns in spoken English: The case of the progressive with future time reference. International Journal of Corpus Linguistics, 12(3), 297–333. doi:  10.1075/ijcl.12.3.02nes
    https://doi.org/10.1075/ijcl.12.3.02nes [Google Scholar]
  41. Ribaric, S. , Ariyaeeinia, A. , & Pavesic, N.
    (2016) De-identification for privacy protection in multimedia content: A survey. Signal Processing: Image Communication, 47, 131–151.
    [Google Scholar]
  42. Rühlemann, C.
    (2006) Coming to terms with conversational grammar: ‘Dislocation’ and ‘dysfluency’. International Journal of Corpus Linguistics, 11(4), 385–409. doi:  10.1075/ijcl.11.4.03ruh
    https://doi.org/10.1075/ijcl.11.4.03ruh [Google Scholar]
  43. Rühlemann, C. , & Gries, S.
    (2015) Turn order and turn distribution in multi-party storytelling. Journal of Pragmatics, 87, 171–191. doi:  10.1016/j.pragma.2015.08.003
    https://doi.org/10.1016/j.pragma.2015.08.003 [Google Scholar]
  44. Säily, T.
    (2011) Variation in morphological productivity in the BNC: Sociolinguistic and methodological considerations. Corpus Linguistics and Linguistic Theory, 7(1), 119–141. doi:  10.1515/cllt.2011.006
    https://doi.org/10.1515/cllt.2011.006 [Google Scholar]
  45. Schmidt, T.
    (2016) Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. International Journal of Corpus Linguistics, 21(3), 396–418. doi:  10.1075/ijcl.21.3.05sch
    https://doi.org/10.1075/ijcl.21.3.05sch [Google Scholar]
  46. Shirk, J. L. , Ballard, H. L. , Wilderman, C. C. , Phillips, T. , Wiggins, A. , Jordan, R. , McCallie, E. , Minarchek, M. , Lewenstein, B. V. , Krasny, M. E. , & Bonney, R.
    (2012) Public participation in scientific research: A framework for deliberate design. Ecology and Society, 17(2), 29. doi:  10.5751/ES‑04705‑170229
    https://doi.org/10.5751/ES-04705-170229 [Google Scholar]
  47. Smith, A.
    (2014) Newly emerging subordinators in spoken/written English. Australian Journal of Linguistics, 34(1), 118–138. doi:  10.1080/07268602.2014.875458
    https://doi.org/10.1080/07268602.2014.875458 [Google Scholar]
  48. Stenström, A. -B. , Andersen, G. , & Hasund, I. K.
    (2002) Trends in Teenage Talk: Corpus Compilation, Analysis and Findings. Amsterdam/Philadelphia: John Benjamins. doi:  10.1075/scl.8
    https://doi.org/10.1075/scl.8 [Google Scholar]
  49. Thompson, P. , & Nesi, H.
    (2001) The British Academic Spoken English (BASE) Corpus Project. Language Teaching Research, 5(3), 263–264.
    [Google Scholar]
  50. Trudgill, P.
    (1999) The Dialects of England (2nd ed.). Oxford: Blackwell Publishing Ltd.
    [Google Scholar]
  51. Wang, S.
    (2005) Corpus-based approaches and discourse analysis in relation to reduplication and repetition. Journal of Pragmatics, 34(4), 505–540. doi:  10.1016/j.pragma.2004.08.002
    https://doi.org/10.1016/j.pragma.2004.08.002 [Google Scholar]
  52. Wichmann, A.
    (2008) Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp.187–206). Berlin: Walter de Gruyter.
    [Google Scholar]
  53. Xiao, R. , & Tao, H.
    (2007) A corpus-based sociolinguistic study of amplifiers in British English. Sociolinguistic Studies, 1(2), 231–273. doi:  10.1558/sols.v1i2.241
    https://doi.org/10.1558/sols.v1i2.241 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): corpus construction; Spoken BNC2014; spoken corpora; transcription
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error