Compilation, transcription, markup and annotation of spoken corpora
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes


This paper reports on some issues encountered when using various ‘external points of reference’ in the development of POS-tagging guidelines for the Vienna-Oxford International Corpus of English (VOICE). VOICE is a corpus of spoken English as a Lingua Franca (ELF) containing naturally occurring, plurilingual data. As in all kinds of natural language use, speakers recorded in VOICE exploit available linguistic resources, often resulting in non-codified language use and language which is difficult to classify unambiguously. However, detailed tagging solutions for such phenomena are rarely reported. We discuss usefulness and limitations of external points of reference with regard to their suitability for POS-tagging VOICE and address methodological as well as practical issues, especially the handling of non-codified language use and different types of ambiguities. We suggest that the solutions found, and the theoretical approach adopted, could be relevant for the tagging of other spoken corpora.


Article metrics loading...

Loading full text...

Full text loading...


  1. Atwell, E
    (2008) Development of tag sets for part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook. Volume 1 (pp.501–527). Berlin/New York: Walter de Gruyter.
    [Google Scholar]
  2. Beal, J. , Corrigan, K. , Smith N. , & Rayson P
    (2006) Writing the vernacular: Transcribing and tagging the Newcastle Electronic Corpus of Tyneside English (NECTE). In A. Meurman-Solin & A. Nurmi , Studies in Variation Contacts and Change: Annotating Variation and Change. Helsinki: VARIENG. Retrieved fromwww.helsinki.fi/varieng/journal/volumes/01/beal_et_al/ (last accessedNovember 2015).
    [Google Scholar]
  3. Biber, D. , Johansson, S. , Leech, G. , Conrad S. , & Finegan, E
    (1999) Longman Grammar of Spoken and Written English. Harlow: Longman.
    [Google Scholar]
  4. Breiteneder, A. , Klimpfinger, T. , Majewski S. , & Pitzl, M-L
    (2009) The Vienna-Oxford International Corpus of English (VOICE): A linguistic resource for exploring English as a lingua franca. ÖGAI-Journal, 28(1), 21–26.
    [Google Scholar]
  5. Breiteneder, A. , Pitzl, M-L. , Majewski S. , & Klimpfinger, T
    (2006) VOICE recording: Methodological challenges in the compilation of a corpus of spoken ELF. Nordic Journal of English Studies, 5(2), 161–188.
    [Google Scholar]
  6. Carter, R. , & McCarthy, M
    (2006) Cambridge Grammar of English: A Comprehensive Guide to Spoken and Written English Usage. Cambridge: Cambridge University Press.
    [Google Scholar]
  7. Cook, V
    (2002) Background to the L2 User. In V. Cook (Ed.), Portraits of the L2 User (pp.1–28). Clevedon: Multilingual Matters.
    [Google Scholar]
    (1996) Recommendations for the Morphosynatctic Annotation of Corpora. Retrieved fromwww.ilc.cnr.it/EAGLES/browse.html (last accessedMarch 2014).
  9. Garside, R
    (1995) Grammatical tagging of the spoken part of the British National Corpus: A Progress Report. In G. Leech , G. Myers & J. Thomas (Eds.), Spoken English on Computer (pp.161–167). London: Longman.
    [Google Scholar]
  10. Greenbaum, S. , & Ni, Y
    (1994) Tagging the British ICE Corpus: English Word Classes. In N. Oostdijk & P. de Haan (Eds.), Corpus-based Research into Language. In Honour of Jan Aarts (pp.33–45). Amsterdam: Rodopi.
    [Google Scholar]
  11. Hirschmann, H. , Doolittle S. , & Lüdeling, A
    (2007) Syntactic annotation of non-canonical linguistic structures. In M. Davies , P. Rayson , S. Hunston & P. Danielsson (Eds.), Proceedings of the Corpus Linguistics Conference CL2007, University of Birmingham, UK, 27–30 July 2007 (pp.1–15). Retrieved fromucrel.lancs.ac.uk/publications/CL2007/paper/128_Paper.pdf (last accessedOctober 2012).
    [Google Scholar]
  12. Hudson-Ettle, D.M. , & Schmied, J
    (1999) Manual to accompany The East African Component of The International Corpus of English ICE-EA: Background information, coding conventions and lists of source texts. Retrieved fromclu.uni.no/icame/manuals/ICE_EA.PDF (last accessedJanuary 2015).
  13. Hülmbauer, C
    (2009) ‘We don’t take the right way. We just take the way that we think you will understand’: The shifting relationship between correctness and effectiveness in ELF. In A. Mauranen & E. Ranta (Eds.), English as a Lingua Franca: Studies and Findings (pp.323–347). Newcastle upon Tyne: Cambridge Scholars Publishing.
    [Google Scholar]
  14. Jendryczka-Wierszycka, J
    (2009) Collecting spoken learner data: Challenges and benefits. In M. Mahlberg , V. González-Díaz & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference CL2009. Liverpool, 20–23 July 2009, University of Liverpool, UK. Retrieved fromucrel.lancs.ac.uk/publications/cl2009/230_FullPaper.doc (last accessedMarch 2014).
    [Google Scholar]
  15. Jendryczka-Wierszycka, J. , Rayson P. , & Hoffmann, S
    (2009) Spoken learner corpus & its POS tagging. Retrieved fromwww.ling.lancs.ac.uk/groups/crg/files/CRG09_wk30_JJW_slides.pdf (last accessedMarch 2014).
  16. Jørgensen, F
    (2007) Clause boundary detection in transcribed spoken language. In J. Nivre , H-J. Kaalep , K. Muischnek & M. Koit (Eds.), Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, University of Tartu, Tartu, 235–239. Retrieved fromfolk.uio.no/fredrijo/publications/pdf/Joer07.pdf (last accessedJanuary 2013).
    [Google Scholar]
  17. Leech, G
    (2005) Adding linguistic annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (online). Oxford: Oxbow Books. Retrievable fromwww.ahds.ac.uk/creating/guides/linguistic-corpora/chapter2.htm (last accessedJune 2016).
    [Google Scholar]
  18. Leech, G. , Garside R. , & Bryant, M
    (1994) The large-scale grammatical tagging of text: Experience with the British National Corpus. In N. Oostdijk & P. de Haan (Eds.), Corpus-based Research into Language (pp.47–63). Amsterdam: Rodopi.
    [Google Scholar]
  19. Linguistic Data Consortium (LDC)
    (1999) Addendum to the Part-of-Speech Tagging Guidelines for the Penn Treebank Project (Modifications for the SwitchBoard corpus). Retrieved fromwww.cis.upenn.edu/~bies/manuals/tagguid2.pdf (accessedMarch 2014).
  20. MacWhinney, B
    (2009) Enriching CHILDES for morphosyntactic analysis. Carnegie Mellon University, Pittsburgh, PA. Retrieved fromrepository.cmu.edu/cgi/viewcontent.cgi?article=1174&context=psychology (last accessedJune 2016).
  21. (2012) The CHILDES Project. Tools for analyzing talk – Electronic edition. Part 1: The CHAT transcription format. Carnegie Mellon University. Retrieved fromchildes.psy.cmu.edu/manuals/chat.pdf (last accessedFebruary 2013).
  22. Meurers, D. , & Wunsch, H
    (2010) Linguistically annotated learner corpora: Aspects of a layered linguistic encoding and standardized representation. In Proceedings of Linguistic Evidence , 1–4. Retrieved fromwww.sfs.uni-tuebingen.de/~dm/papers/meurers-wunsch-10.pdf (last accessedDecember 2012).
    [Google Scholar]
  23. Mukherjee, J
    (2007) Exploring and annotating a spoken English learner corpus: A work-in-progress report. In S. Volk-Birke & J. Lippert (Eds.), Anglistentag 2006 Halle: Proceedings (pp.365–375). Trier: WVT.
    [Google Scholar]
  24. Nivre, J. , & Grönqvist, L
    (2001) Tagging a corpus of spoken Swedish. International Journal of Corpus Linguistics, 6(1), 47–78. doi: 10.1075/ijcl.6.1.03niv
    https://doi.org/10.1075/ijcl.6.1.03niv [Google Scholar]
  25. Oxford Advanced Learner’s Dictionary of Current English (7th ed.) (2005) Oxford: Oxford University Press.
    [Google Scholar]
  26. Ortega, L
    (2010, March). The Bilingual Turn in SLA. Paper presented at the AAAL conference , Atlanta, GA.
    [Google Scholar]
  27. Osimk-Teasdale, R
    (2013) Applying existing tagging practices to VOICE. In M. Joybrato & M. Huber (Eds.), Corpus Linguistics and Variation in English: Focus on Non-Native Englishes. Helsinki: VARIENG. Retrieved fromwww.helsinki.fi/varieng/series/volumes/13/osimk-teasdale/ (last accessedMarch 2014).
    [Google Scholar]
  28. (2014) ‘I just wanted to give a partly answer’: Capturing and exploring word class variation in ELF data. Journal of English as a Lingua Franca, 3(1), 109–143. doi: 10.1515/jelf‑2014‑0005
    https://doi.org/10.1515/jelf-2014-0005 [Google Scholar]
  29. (2015) Parts of Speech in English as a Lingua Franca: The POS Tagging of VOICE. (Unpublished doctoral dissertation). University of Vienna, Austria.
    [Google Scholar]
  30. Pitzl, M-L. , Breiteneder A. , & Klimpfinger, T
    (2008) A world of words: processes of lexical innovation in VOICE. Views, 17(2), 21–46.
    [Google Scholar]
  31. Quirk, R. , Greenbaum, S. , Leech G. , & Svartvik, J
    (1985) A Comprehensive Grammar of the English Language. London: Longman.
    [Google Scholar]
  32. Rahman, A. , & Sampson, G
    (2000) Extending grammar annotation standards to spontaneus speech. In J.M. Kirk . (Ed.), Corpora Galore: Analyses and Techniques in Describing English. Papers from the Nineteenth International Conference on English Language Research on Computerised Corpora (ICAME 1998). (pp.295–311). Amsterdam/Atlanta: Rodopi.
    [Google Scholar]
  33. Rastelli, S
    (2009) Learner corpora without error tagging. Lingustik Online, 38(2), 57–66.
    [Google Scholar]
  34. Santorini, B
    (1991) Part of Speech Tagging Guidelines for the Penn Treebank Project. Retrieved fromwww.personal.psu.edu/xxl13/teaching/sp07/apling597e/resources/Tagset.pdf (last accessedMarch 2014).
    [Google Scholar]
  35. Sampson, G
    (2000) CHRISTINE Corpus: Documentation. Retrieved fromwww.grsampson.net/ChrisDoc.html (last accessed February 2013).
    [Google Scholar]
  36. Schmid, H
    (2008) Tokenization and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.527–551). Berlin: Walter de Gruyter.
    [Google Scholar]
  37. Seidlhofer, B
    (2001) Closing a conceptual gap: The case for a description of English as lingua franca. International Journal of Applied Linguistics, 11(2), 133–158. doi: 10.1111/1473‑4192.00011
    https://doi.org/10.1111/1473-4192.00011 [Google Scholar]
  38. (2011) Understanding English as a Lingua Franca. Oxford: Oxford University Press.
    [Google Scholar]
  39. van Eynde, F. , Zavrel J. , & Daelemans, W
    (2000.) Part of speech tagging and lemmatisation for the Spoken Dutch Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000) , Athens, Greece, (pp.1427–1434). Retrieved fromciteseerx.ist.psu.edu/viewdoc/download?doi= (last accessed March 2014).
    [Google Scholar]
  40. VOICE Project
    (2013a) Availability. Retrieved fromwww.univie.ac.at/voice/page/corpus_availability (last assessedMarch 2014).
  41. (2013b) Corpus Information. Retrieved fromwww.univie.ac.at/voice/page/corpus_information (accessedMarch 2014).
  42. (2013c) VOICE part-of-speech tagging and lemmatization manual. Retrieved fromwww.univie.ac.at/voice/documents/VOICE_tagging_manual.pdf (accessedMarch 2014).
  43. Wynne, M
    (Ed) (2005) Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books. Retrievable fromwww.ahds.ac.uk/creating/guides/linguistic-corpora/ (last accessed3 June 2016).
    [Google Scholar]
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error