Compilation, transcription, markup and annotation of spoken corpora
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes


This paper discusses key issues in the compilation of spoken language corpora in a computer-mediated communication (CMC) environment, using data from the Corpus of Academic Spoken English (CASE), a corpus of conversations currently being compiled at Saarland University, Germany, in cooperation with European and US partners. Based on first findings, is presented as a suitable tool for collecting informal spoken data. In addition, new recommendations concerning data compilation and transcription are put forward to supplement existing best practice as presented in Wynne (2005). We recommend the preservation of multimodal features during anonymisation, and the addition of annotation elements already at the transcription stage, particularly CMC-related discourse features, English as a Lingua Franca (ELF) features (e.g. non-standard language and code-switching), as well as the inclusion of prosodic, paralinguistic, and non-verbal annotation. Additionally, we propose a layered corpus design in order to allow researchers to focus on specific annotation features.


Article metrics loading...

Loading full text...

Full text loading...


  1. Adolphs, S. , & Carter, R
    (2013) Spoken Corpus Linguistics. From Monomodal to Multimodal. London: Routledge.
    [Google Scholar]
  2. Biber, D
    (1988) Variation across Speech and Writing. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511621024
    https://doi.org/10.1017/CBO9780511621024 [Google Scholar]
  3. Brunner, M.-L
    (2015) Negotiating Conversation Starts in the Corpus of Academic Spoken English (Unpublished MA thesis). Universität des Saarlandes, Saarbrücken, Germany.
    [Google Scholar]
  4. ECAMM – Call Recorder for Mac
    (2013) [Computer software]. Retrieved fromwww.ecamm.com/mac/callrecorder/ (last accessedMarch 2016).
  5. CASE – Corpus of Academic Spoken English
    . (Forthcoming S. Diemer , M.-L. Brunner , C. Collet & S. Schmidt ). . Saarbrücken: Saarland University (Coordination) /Sofia: St Kliment Ohridski University / Forlì: University of Bologna-Forlì / Santiago: University of Santiago de Compostela / Helsinki: Helsinki University & Hanken School of Economics / Birmingham: Birmingham City University / Växjö: Linnaeus University / Louvain-la-Neuve: Université catholique de Louvain / Lyon: Université Lumière Lyon 2 / Boise: Boise State University. Retrieved fromwww.uni-saarland.de/campus/fakultaeten/fachrichtungen/philosophische-fakultaet-ii/fachrichtungen/fr43/staff/adjunct-faculty/engling2/case.html (last accessedMarch 2016).
  6. Chafe, W
    (2007) The Importance of not Being Earnest: The Feeling behind Laughter and Humor. Amsterdam: John Benjamins. doi: 10.1075/ceb.3
    https://doi.org/10.1075/ceb.3 [Google Scholar]
  7. CLAWS Part-of-Speech Tagger for English
    (1994-2016) [Computer software]. Retrieved fromwww.comp.lancs.ac.uk/computing/research/ucrel/claws/ (last accessedMarch 2016).
  8. Conrad, S. , & Mauranen, A
    (2003) The corpus of English as lingua franca in academic settings. TESOL Quarterly, 37(3), 513–527. doi: 10.1002/j.1545‑7249.2003.tb02095.x
    https://doi.org/10.1002/j.1545-7249.2003.tb02095.x [Google Scholar]
  9. Dressler, R.A. , & Kreuz, R.J
    (2000) Transcribing oral discourse: A survey and a model system. Discourse Processes, 29(1), 25–36. doi: 10.1207/S15326950dp2901_2
    https://doi.org/10.1207/S15326950dp2901_2 [Google Scholar]
  10. Edwards, J.A
    (1993) Principles and contrasting systems of discourse transcription. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp.3–32). Hillsdale: Lawrence Erlbaum Associates.
    [Google Scholar]
  11. ELFA – The Corpus of English as a Lingua Franca in Academic Settings
    (2008) A. Mauranen (Director). Retrieved fromwww.helsinki.fi/elfa/elfacorpus (last accessedFebruary 2015).
  12. Firth, A
    (1996) The discursive accomplishment of normality: On ‘lingua franca’ English and conversation analysis. Journal of Pragmatics, 26(2) 237–259. doi: 10.1016/0378‑2166(96)00014‑8
    https://doi.org/10.1016/0378-2166(96)00014-8 [Google Scholar]
  13. Gee, M
    (2014) CASE XML Conversion Tool [Computer software]. Retrieved fromrdues.bcu.ac.uk/case (last accessedNovember 2015).
    [Google Scholar]
  14. Geluykens, R
    (1993) Topic introduction in English conversation. Transactions of the Philological Society, 91(2). 181–214. doi: 10.1111/j.1467‑968X.1993.tb01068.x
    https://doi.org/10.1111/j.1467-968X.1993.tb01068.x [Google Scholar]
  15. Gibbon, D. , Moore R. , & Winski, R
    (1998) Handbook of Standards and Resources for Spoken Language Systems 1: Spoken Language Systems and Corpus Design. Berlin, Germany: Mouton de Gruyter. doi: 10.1515/9783110809817
    https://doi.org/10.1515/9783110809817 [Google Scholar]
  16. Glenn, P
    (2003) Laughter in Interaction. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511519888
    https://doi.org/10.1017/CBO9780511519888 [Google Scholar]
  17. Howarth, P.A
    (1996) Phraseology in English Academic Writing: Some Implications for Language Learning and Dictionary Making. Tübingen: Niemeyer. doi: 10.1515/9783110937923
    https://doi.org/10.1515/9783110937923 [Google Scholar]
  18. ICE Corpus annotation guidelines
    (2009) Retrieved fromice-corpora.net/ice/annotate.htm (last accessedMarch 2016).
  19. IFA Dialog Video Corpus
    (2008) Retrieved fromwww.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/ (last accessedMarch 2016).
  20. Jefferson, G. , Sacks, H. , & Schegloff, E.A
    (1987) Notes on laughter in the pursuit of intimacy. In G. Button & J.R.E. Lee (Eds.), Talk and Social Organisation (pp.152–205). Clevedon: Multilingual Matters.
    [Google Scholar]
  21. Jenkins, J. , Modiano, M. , & Seidlhofer, B
    (2001) Euro-English. English Today, 17(4), 13–19. doi: 10.1017/S0266078401004023
    https://doi.org/10.1017/S0266078401004023 [Google Scholar]
  22. Leech, G. , Myers, G. , & Thomas, J
    (Eds.) (1995) Spoken English on Computer. Harlow: Longman.
    [Google Scholar]
  23. Leech, G
    (2005) Adding linguistic annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp.17–29). Oxford: Oxbow Books.
    [Google Scholar]
  24. Mair, C
    (Ed.) (2003) The Politics of English as a World Language. Amsterdam: Rodopi.
    [Google Scholar]
  25. Meierkord, C
    (1996) Englisch als Medium der interkulturellen Kommunikation. Untersuchungen zum non-native-/non-native Speaker-Diskurs. Frankfurt am Main: Peter Lang.
    [Google Scholar]
  26. Nelson, G
    (2002) ICE mark-up manual for spoken texts. Retrieved fromice-corpora.net/ice/spoken.doc (last accessed31 March 2016)
  27. Sauer, S. , & Lüdeling, A
    (2016) Flexible Multi-Layer Spoken Dialogue Corpora. International Journal of Corpus Linguistics (this volume).
    [Google Scholar]
  28. Schegloff, E.A
    (1968) Sequencing in conversational openings. American Anthropologist, 70(6), 1075–1095. doi: 10.1525/aa.1968.70.6.02a00030
    https://doi.org/10.1525/aa.1968.70.6.02a00030 [Google Scholar]
  29. Schmidt, S
    (2015) Laughter in computer-mediated communication: A means of creating rapport in first-contact situations(Unpublished MA dissertation). Universität des Saarlandes, Saarbrücken, Germany.
    [Google Scholar]
  30. Schmidt, S. , Brunner, M.-L. , & Diemer, S
    (2014) CASE: Corpus of Academic Spoken English: Transcription Conventions. Retrieved fromwww.uni-saarland.de/index.php?id=48506 (last accessedMarch 2016).
  31. Sinclair, J
    (1995) From theory to practice. In G. Leech , G. Myers & J. Thomas (Eds.), Spoken English on Computer (pp.99–112). Harlow: Longman.
    [Google Scholar]
  32. Spencer-Oatey, H
    (2002) Managing rapport in talk: Using rapport sensitive incidents to explore the motivational concerns underlying the management of relations. Journal of Pragmatics, 34(5) 529–545. doi: 10.1016/S0378‑2166(01)00039‑X
    https://doi.org/10.1016/S0378-2166(01)00039-X [Google Scholar]
  33. Supertintin – Skype Video Call Recorder (2013) [Computer software]. Retrieved fromwww.supertintin.com/index.html (last accessedMarch 2016).
  34. Tannen, D
    (1989) Talking Voices: Repetition, Dialog, and Imagery in Conversational Discourse. Cambridge: Cambridge University Press.
    [Google Scholar]
  35. Thompson, P
    (2005) Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp.59–70). Oxford: Oxbow Books.
    [Google Scholar]
  36. VOICE – The Vienna-Oxford International Corpus of English
    (Version 2.0 XML) (2013) B. Seidlhofer (Director). Vienna: University of Vienna. Retrieved fromhttps://www.univie.ac.at/voice/ (last accessedMarch 2016).
  37. Wynne, M
    (Ed.) (2005) Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books. Retrieved fromusers.ox.ac.uk/~martinw/dlc/index.htm (last accessedMarch 2016).
    [Google Scholar]
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error