Compilation, transcription, markup and annotation of spoken corpora
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes


This paper describes the construction of deeply annotated spoken dialogue corpora. To ensure a maximum of flexibility — in the degree of normalization, the types and formats of annotations, the possibilities for modifying and extending the corpus, or the use for research questions not originally anticipated — we propose a flexible multi-layer standoff architecture. We also take a closer look at the interoperability of tools and formats compatible with such an architecture. Free access to the corpus data through corpus queries, visualizations, and downloads — including documentation, metadata, and the original recordings — enables transparency, verifiability, and reproducibility of every step of interpretation throughout corpus construction and of any research findings obtained from this data.


Article metrics loading...

Loading full text...

Full text loading...


  1. Anderson, A.H. , Bader, M. , Gurman Bard, E. , Boyle, E. , Doherty, G. , Garrod, S. , Isard, S. , Kowtko, J. , McAllister, J. , Miller, J. , Sotillo, C. , Thompson, H.S. , & Weinert, R
    (1991) The HCRC Map Task Corpus. Language and Speech, 34(4), 351–366.
    [Google Scholar]
  2. Belz, M
    (2013) Disfluencies und Reparaturen bei Muttersprachlern und Lernern: Eine kontrastive Analyse. Humboldt-Universität zu Berlin. Retrieved fromedoc.hu-berlin.de/docviews/abstract.php?id=40482 (last accessedMarch 2014).
    [Google Scholar]
  3. BeMaTaC
    (2014) BeMaTaC: A Deeply Annotated Multimodal Map-task Corpus of Spoken Learner and Native German. Retrieved fromu.hu-berlin.de/bematac (last accessedMarch 2014).
    [Google Scholar]
  4. Boersma, P
    (2010) Praat: A system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
    [Google Scholar]
  5. Brinckmann, C. , Kleiner, S. , Knöbl, R. , & Berend, N
    (2008) German today: An areally extensive corpus of spoken Standard German. In N. Calzolari , Kh. Choukri , B. Maegaard , J. Mariani , J. Odijk , S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp.3185–3191). Paris: ELRA.
    [Google Scholar]
  6. Buchholz, S. , & Marsi, E
    (2006) CoNLL-X shared task on multilingual dependency parsing. In L. Màrquez & D. Klein (Eds.), Proceedings of the 10th Conference on Computational Natural Language Learning (pp.149–164). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  7. Burnard, L
    (Ed.) (2007) Reference Guide for the British National Corpus (XML Edition). Oxford: Research Technologies Service. Retrieved fromwww.natcorp.ox.ac.uk/XMLedition/URG (last accessedMarch 2014).
    [Google Scholar]
  8. Carletta J. , Evert, S. , Heid, U. , Kilgour, J. , Robertson, J. , & Voormann, H
    (2003) The NITE XML Toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, & Computers, 35(3), 353–363. doi: 10.3758/BF03195511
    https://doi.org/10.3758/BF03195511 [Google Scholar]
  9. Carletta J. , Evert, S. , Heid, U. , & Kilgour, J
    (2005) The NITE XML Toolkit: Data model and query. Language Resources and Evaluation, 39(4), 313–334. doi: 10.1007/s10579‑006‑9001‑9
    https://doi.org/10.1007/s10579-006-9001-9 [Google Scholar]
  10. Chiarcos, C. , Dipper, S. , Götze, M. , Leser, U. , Lüdeling, A. , Ritz, J. , & Stede, M
    (2009) A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique des Langues, 49(2), 271–291.
    [Google Scholar]
  11. Creative Commons
    (2014) About the Licenses - Creative Commons. Retrieved fromcreativecommons.org/licenses (last accessedMarch 2014).
    [Google Scholar]
  12. Dipper, S
    (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In R. Eckstein & R. Tolksdorf (Eds.), Proceedings of Berliner XML Tage 2005 (pp.39–50). Berlin: Humboldt-Universität zu Berlin.
    [Google Scholar]
  13. Dipper, S. , Lüdeling, A. , & Reznicek, M
    (2013) NoSta-D: A corpus of German non-standard varieties. In M. Zampieri & S. Diwersy (Eds.), Non-Standard Data Sources in Corpus-Based Research (pp.69–76). Aachen: Shaker.
    [Google Scholar]
  14. Druskat, S. , Bierkandt, L. , Gast, V. , Rzymski, C. , & Zipser, F
    (2014) Atomic: An open-source software platform for multi-level corpus annotation. In J. Ruppenhofer & G. Faaß (Eds.), Proceedings of the 12th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2014) (pp.228–234). Retrieved fromnbn-resolving.de/urn:nbn:de:gbv:hil2-opus-2866 (last accessedMay 2015).
    [Google Scholar]
  15. Gerdes, K
    (2014) Arborator [Computer software]. Retrieved fromarborator.ilpga.fr (last accessed March 2014).
    [Google Scholar]
  16. Giesel, L. , Klapi, M. , Krüger, D. , Nunberger, I. , Rasskazova, O. , & Sauer, S
    (2013) Berlin Map Task Corpus: A deeply annotated multimodal map-task corpus of spoken learner and native German. Poster presented at the 35. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft , Potsdam, Germany. Retrieved fromkorpling.german.hu-berlin.de/bematac/publications/Giesel-et-al_2013_DGfS-CL-2013.pdf (last accessedMarch 2014).
    [Google Scholar]
  17. Hall, M. , Frank, E. , Holmes, G. , Pfahringer, B. , Reutemann, P. , & Witten, I.H
    (2009) The WEKA data mining software: An update. In O.R. Zaiane (Ed.), SIGKDD Explorations, 11(1), 10–18.
    [Google Scholar]
  18. Hanke, T. , & Storz, J
    (2008) iLex: A database tool for integrating sign language corpus linguistics and sign language lexicography. In O. Crasborn , E. Efthimiou , T. Hanke , E. Thoutenhoofd & I. Zwitserlood (Eds.), LREC 2008 Workshop, Proceedings, W 25: 3rd Workshop on the Representation and Processing of Sign Languages: Construction and Exploitation of Sign Language Corpora (pp.64–67). Paris: ELRA.
    [Google Scholar]
  19. Himmelmann, N.P
    (2012) Linguistic data types and the interface between language documentation and description. Language Documentation & Conservation, 6, 187–207.
    [Google Scholar]
  20. Hinrichs, E.W. , Hinrichs, M. , & Zastrow, T
    (2010) WebLicht: Web-Based LRT services for German. InACL 2010 System Demonstrations, Proceeding (pp.25–29). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  21. Ide, N. , & Suderman, K
    (2007) GrAF: A graph-based format for linguistic annotations. In B. Boguraev , N. Ide , A. Meyers , Sh. Nariyama , M. Stede , J. Wiebe & G. Wilcock (Eds.), ACL 2007 Workshop, Proceedings, Linguistic Annotation Workshop (pp.25–29). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  22. Kirk, J.M
    . (this volume). The pragmatic annotation scheme of the SPICE-Ireland corpus.
    [Google Scholar]
  23. Krause, T. , Lüdeling, A. , Odebrecht, C. , & Zeldes, A
    (2012) Multiple tokenization in a diachronic corpus. Paper presented at Exploring Ancient Languages through Corpora Conference 2012 , Oslo. Retrieved fromwww.hf.uio.no/ifikk/english/research/projects/proiel/ealc/abstracts/Krause_et_al.pdf (last accessedMarch 2014).
    [Google Scholar]
  24. Krause, T. , & Zeldes, A
    (2014) ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities. Retrieved fromdsh.oxfordjournals.org/content/early/2014/12/02/llc.fqu057.full (last accessedMay 2015).
    [Google Scholar]
  25. Lüdeling, A
    (2011) Corpora in linguistics: Sampling and annotation. In K. Grandin (Ed.), Going Digital. Evolutionary and Revolutionary Aspects of Digitization (pp.220–243). New York, NY: Science History Publications.
    [Google Scholar]
  26. Max Planck Society
    (2014) Max Planck Open Access: Berlin Declaration. Retrieved fromopenaccess.mpg.de/Berlin-Declaration (last accessedMarch 2014).
    [Google Scholar]
  27. Müller, C. , & Strube, M
    (2006) Multi-level annotation of linguistic data with MMAX2. In S. Braun , K. Kohn & J. Mukherjee (Eds.), Corpus Technology and Language Pedagogy (pp.197–214). Frankfurt am Main: Peter Lang,
    [Google Scholar]
  28. Nivre, J
    (2008) Treebanks. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.225–241). Berlin: Mouton de Gruyter.
    [Google Scholar]
  29. Pajas P. , & Stepanek, J
    (2008) Recent advances in a feature-rich framework for treebank annotation. In Proceedings of the 22nd International Conference on Computational Linguistics (pp.673–680). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  30. R Core Team
    (2013) R: A Language and Environment for Statistical Computing [Computer software]. Retrieved fromwww.R-project.org (last accessedMarch 2014).
    [Google Scholar]
  31. Sauer, S. , & Rasskazova, O
    (2014) BeMaTaC: Eine digitale multimodale Ressource für Sprach- und Dialogforschung. Poster presented at the workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen , Berlin, Germany. Retrieved fromkorpling.german.hu-berlin.de/bematac/publications/Sauer-Rasskazova_2014_3WS-DHB.pdf (last accessedMarch 2014).
    [Google Scholar]
  32. Schiel, F. , Draxler, C. , & Harrington, J
    (2011) Phonemic segmentation and labelling using the MAUS technique. Workshop New Tools and Methods for Very-Large-Scale Phonetics Research . Retrieved fromwww.phonetik.uni-muenchen.de/forschung/publikationen/Schiel-VLSP2011.pdf (last accessedApril 2016).
    [Google Scholar]
  33. Schiller, A. , Teufel, S. , Stöckert, C. , & Thielen, C
    (1999) Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Retrieved fromwww.sfs.uni-tuebingen.de/resources/stts-1999.pdf (last accessedMarch 2014).
    [Google Scholar]
  34. Schmid, H
    (1994) Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing . Retrieved fromftp://ftp.ims.uni-stuttgart.de/pub/corpora/tree-tagger1.pdf (last accessed November 2014).
    [Google Scholar]
  35. 2008 Tokenizing and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.527–551). Berlin: Mouton de Gruyter.
    [Google Scholar]
  36. Schmidt, T
    (2004) Transcribing and annotating spoken language with EXMARaLDA. In A. Witt , U. Heid , H.S. Thompson , J. Carletta & P. Wittenburg (Eds.), LREC 2004 Workshop, Proceedings, XML-based Richly Annotated Corpora (pp.69–74). Paris: ELRA.
    [Google Scholar]
  37. Schmidt, T. , & Wörner, K
    (2009.) EXMARaLDA: Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics, 19(4), 565–582. doi: 10.1075/prag.19.4.06sch
    https://doi.org/10.1075/prag.19.4.06sch [Google Scholar]
  38. Schmidt, T. , Hedeland, H. , Lehmberg, T. , & Wörner, K
    (2010) HAMATAC: The Hamburg MapTask Corpus. Retrieved fromwww.exmaralda.org/files/HAMATAC.pdf (last accessedMarch 2014).
  39. Sloetjes, H. , & Wittenburg, P
    (2008) Annotation by category: ELAN and ISO DCR. In N. Calzolari , Kh. Choukri , B. Maegaard , J. Mariani , J. Odijk , S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp.816–820). Paris: ELRA.
    [Google Scholar]
  40. Stede, M
    (2011) Discourse Processing. San Rafael, CA: Morgan & Claypool.
    [Google Scholar]
  41. Stenetorp, P. , Pyysalo, S. , Topić, G. , Ohta, T. , Ananiadou, S. , & Tsujii, J
    2012 Brat: A web-based tool for NLP-assisted text annotation. In F. Segond (Ed.), Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp.102–107). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  42. Stührenberg, M
    (2012) The TEI and current standards for structuring linguistic data. In P. Bański , E. Litta Modignani Picozzi & A. Witt (Eds.), Journal of the Text Encoding Initiative, 3. Retrieved fromjtei.revues.org/523 (last accessedMarch 2014).
    [Google Scholar]
  43. TEI Consortium
    (2014) TEI: Text Encoding Initiative. Retrieved fromwww.tei-c.org (last accessedMarch 2014).
    [Google Scholar]
  44. Thompson, P
    (2005) Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp.59–70). Oxford: Oxbow Books. Retrieved fromahds.ac.uk/linguistic-corpora (last accessedMarch 2014).
    [Google Scholar]
  45. Wichmann, A
    (2008) Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.187–207). Berlin: Mouton de Gruyter.
    [Google Scholar]
  46. Wörner, K
    (2009) Werkzeuge zur flachen Annotation von Transkriptionen gesprochener Sprache. Bielefeld: Bielefeld University. Retrieved fromhttps://pub.uni-bielefeld.de/download/2301935/2301938 (last accessedApril 2016).
    [Google Scholar]
  47. Wynne, M
    (2008) Searching and concordancing. In A. Lüdeling , & M. Kytö . (Eds.), Corpus Linguistics. An International Handbook (pp.706–737). Berlin: Mouton de Gruyter.
    [Google Scholar]
  48. Yimam, S.M. , Gurevych, I. , Eckart de Castilho, R. , & Biemann, C
    (2013) WebAnno: A flexible, web-based and visually supported system for distributed annotations. In M. Butt & S. Hussain (Eds.), 51st Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference System Demonstration (pp.1–6). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  49. Zeldes, A. , Ritz, J. , Lüdeling, A. , & Chiarcos, C
    (2009) ANNIS: A search tool for multi-layer annotated corpora. In M. Mahlberg , V. González-Díaz & C. Smith (Eds.), Proceedings of Corpus Linguistics 2009. Retrieved fromedoc.hu-berlin.de/docviews/abstract.php?id=36996 (last accessedMarch 2014).
    [Google Scholar]
  50. Zipser, F. , & Romary, L
    (2010) A model oriented approach to the mapping of annotation formats using standards. In G. Budin , L. Romary , T. Declerck & P. Wittenburg (Eds.), LREC 2010 Workshop, Proceedings, W4: Language Resource and Language Technology Standards. Paris: ELRA. Retrieved fromhal.inria.fr/inria-00527799 (last accessedNovember 2014).
    [Google Scholar]
  • Article Type: Research Article
Keyword(s): annotation; annotation tools; multi-layer architecture; spoken corpora; standoff
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error