Compilation, transcription, markup and annotation of spoken corpora
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes


An aspect of corpus compilation that poses a particular challenge is the question of how to transcribe orthographically units that are not part of any standardised vocabulary. Among the problematic categories we find voiced pauses, minimal response signals, interjections, certain discourse markers, phonologically reduced forms, colloquialisms and dialect forms. Such semi-lexical features are usually represented by regular phonemic-graphemic correspondences but are nevertheless often inconsistently handled. This paper reviews a number of existing transcription guidelines and assesses whether the recommendations they provide are sufficient and detailed enough to secure a consistent transcription of the categories mentioned. Further, the paper assesses to what extent transcription of semi-lexical features is consistent within and across two spoken corpora. On the basis of a cross-corpus comparison of the Bergen Corpus of London Teenage Language (COLT) and the London English Corpus (LEC), the paper provides specific recommendations for corpus transcription.


Article metrics loading...

Loading full text...

Full text loading...


  1. Aijmer, K
    (2002) English Discourse Particles: Evidence from a Corpus. Amsterdam: John Benjamins. doi: 10.1075/scl.10
    https://doi.org/10.1075/scl.10 [Google Scholar]
  2. Ameka, F
    (1992) Interjections: The universal yet neglected part of speech. Journal of Pragmatics, 18(2/3), 101–118. doi: 10.1016/0378‑2166(92)90048‑G
    https://doi.org/10.1016/0378-2166(92)90048-G [Google Scholar]
  3. Andersen, G
    (2001) Pragmatic Markers and Sociolinguistic Variation. Amsterdam: John Benjamins. doi: 10.1075/pbns.84
    https://doi.org/10.1075/pbns.84 [Google Scholar]
  4. (2016) Using the corpus-driven method to chart discourse-pragmatic change. In H. Pichler (Ed.), Discourse-Pragmatic Variation and Change in English: New Methods and Insights (pp.21–40). Cambridge: Cambridge University Press. doi: 10.1017/CBO9781107295476.002
    https://doi.org/10.1017/CBO9781107295476.002 [Google Scholar]
  5. Berglund, Y
    (2005) Expressions of Future in Present-day English: A Corpus-based Approach. Uppsala: Acta Universitatis Upsaliensis.
    [Google Scholar]
  6. Biber, D. , Johansson, S. , Leech, G. , Conrad, S. , & Finegan, E
    (1999) Longman Grammar of Spoken and Written English. London: Longman.
    [Google Scholar]
  7. Brinton, L
    (1996) Pragmatic Markers in English. Berlin: Mouton de Gruyter. doi: 10.1515/9783110907582
    https://doi.org/10.1515/9783110907582 [Google Scholar]
  8. Cheshire, J. , Fox, S. , Kerswill, P. , & Torgersen, E
    (2008) Ethnicity, friendship network and social practices as the motor of dialect change: Linguistic innovation in London. Sociolinguistica Jahrbuch, 22, 1–23. doi: 10.1515/9783484605299.1
    https://doi.org/10.1515/9783484605299.1 [Google Scholar]
  9. Cheshire, J. , Kerswill, P. , Fox, S. , & Torgersen, E
    (2011) Contact, the feature pool and the speech community: The emergence of Multicultural London English. Journal of Sociolinguistics, 15(2), 151–196. doi: 10.1111/j.1467‑9841.2011.00478.x
    https://doi.org/10.1111/j.1467-9841.2011.00478.x [Google Scholar]
  10. Du Bois, J.W. , Schuetze-Coburn, S. , Cumming, S. , & Danae, P
    (1993) Outline of discourse transciption. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp.45–89). Hillsdale, NJ: Lawrence Erlbaum.
    [Google Scholar]
  11. Edwards, J.A
    (1993) Principles and contrasting systems of discourse transcription. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp.3–31). Hillsdale, NJ: Lawrence Erlbaum.
    [Google Scholar]
  12. French, J.P
    (1992) Notes and conventions for soundscript transcribers. Unpublished manuscript.
    [Google Scholar]
  13. Gibbon, D. , Moore, R. , & Winsky, R
    (Eds.) (1997) Handbook of Standards and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter.
    [Google Scholar]
  14. Jefferson, G
    (1983) Issues in the transcription of naturally occurring talk: Caricature versus capturing pronunciational particulars. Tilburg Papers in Language and Literature, 34, 1–12.
    [Google Scholar]
  15. (2004) Glossary of transcript symbols with an introduction. In G.H. Lerner (Ed.), Conversation Analysis: Studies from the First Generation (pp.13–31). Amsterdam: John Benjamins. doi: 10.1075/pbns.125.02jef
    https://doi.org/10.1075/pbns.125.02jef [Google Scholar]
  16. Johansson, S
    (1995) The approach of the Text Encoding Initiative to the encoding of spoken discourse. In G. Leech , G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application (pp.82–98). Harlow: Longman.
    [Google Scholar]
  17. MacWhinney, B
    (2000) The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
    [Google Scholar]
  18. Nelson, G
    (2002) International Corpus of English: Markup Manual for: Spoken Texts. Retrieved fromice-corpora.net/ICE/spoken.doc (last accessedNovember 2015).
  19. Payne, J
    (1995) The COBUILD spoken corpus: Transcription conventions. In G. Leech , G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application (pp.203–207). Harlow: Longman.
    [Google Scholar]
  20. Poplack, S. & Tagliamonte, S
    (2000) The grammaticization of going to in (African American) English. Language Variation and Change, 11(3), 315–342.
    [Google Scholar]
  21. Sachs, H. , Schegloff, E.A. , & Jefferson, G
    (1974) A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696–735. doi: 10.1353/lan.1974.0010
    https://doi.org/10.1353/lan.1974.0010 [Google Scholar]
  22. Sinclair, J
    (1995) From theory to practice. In G. Leech , G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application. (pp.99–109). Harlow: Longman.
    [Google Scholar]
  23. Sinclair, J.M
    (Ed.) (1987) Looking Up. London/Glasgow: Collins ELT.
    [Google Scholar]
  24. Stenström, A.-B
    (1998) From sentence to discourse: cos (because) in teenage talk. In A. Jucker & Y. Ziv . (Eds.), Discourse Markers: Descriptions and Theory (pp.127–146). Amsterdam: John Benjamins. doi: 10.1075/pbns.57.08ste
    https://doi.org/10.1075/pbns.57.08ste [Google Scholar]
  25. Stenström, A.-B. , Andersen, G. , & Hasund, K
    (2002) Trends in Teenage Talk: Corpus Compilation, Analysis and Findings. Amsterdam: John Benjamins. doi: 10.1075/scl.8
    https://doi.org/10.1075/scl.8 [Google Scholar]
  26. TEI, T.-E
    . I. TEI P5: Guidelines for Electronic Text Encoding and Interchange.
    [Google Scholar]
  27. Thompson, P
    (2005) Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp.59–70). Oxford: Oxbow Books.
    [Google Scholar]
  28. Torgersen, E. , Gabrielatos, C. , Hoffman, S. , & Fox, S
    (2011) A corpus-based study of pragmatic markers in London English. Corpus Linguistics and Linguistic Theory, 7(1), 93–118. doi: 10.1515/cllt.2011.005
    https://doi.org/10.1515/cllt.2011.005 [Google Scholar]
  29. van den Heuvel, H. , & Boves, L
    (2001) Annotation in the SpeechDat projects. International Journal of Speech Technology, 4, 127–143. doi: 10.1023/A:1011375311203
    https://doi.org/10.1023/A:1011375311203 [Google Scholar]
  30. Wynne, M
    (Ed.) (2005) Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books.
    [Google Scholar]
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error