1887
Compilation, transcription, markup and annotation of spoken corpora
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
USD
Buy:$35.00 + Taxes

Abstract

This paper describes the construction of deeply annotated spoken dialogue corpora. To ensure a maximum of flexibility — in the degree of normalization, the types and formats of annotations, the possibilities for modifying and extending the corpus, or the use for research questions not originally anticipated — we propose a flexible multi-layer standoff architecture. We also take a closer look at the interoperability of tools and formats compatible with such an architecture. Free access to the corpus data through corpus queries, visualizations, and downloads — including documentation, metadata, and the original recordings — enables transparency, verifiability, and reproducibility of every step of interpretation throughout corpus construction and of any research findings obtained from this data.

Loading

Article metrics loading...

/content/journals/10.1075/ijcl.21.3.06sau
2016-09-19
2024-12-04
Loading full text...

Full text loading...

References

  1. Anderson, A.H. , Bader, M. , Gurman Bard, E. , Boyle, E. , Doherty, G. , Garrod, S. , Isard, S. , Kowtko, J. , McAllister, J. , Miller, J. , Sotillo, C. , Thompson, H.S. , & Weinert, R
    (1991) The HCRC Map Task Corpus. Language and Speech, 34(4), 351–366.
    [Google Scholar]
  2. Belz, M
    (2013) Disfluencies und Reparaturen bei Muttersprachlern und Lernern: Eine kontrastive Analyse. Humboldt-Universität zu Berlin. Retrieved fromedoc.hu-berlin.de/docviews/abstract.php?id=40482 (last accessedMarch 2014).
    [Google Scholar]
  3. BeMaTaC
    (2014) BeMaTaC: A Deeply Annotated Multimodal Map-task Corpus of Spoken Learner and Native German. Retrieved fromu.hu-berlin.de/bematac (last accessedMarch 2014).
    [Google Scholar]
  4. Boersma, P
    (2010) Praat: A system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
    [Google Scholar]
  5. Brinckmann, C. , Kleiner, S. , Knöbl, R. , & Berend, N
    (2008) German today: An areally extensive corpus of spoken Standard German. In N. Calzolari , Kh. Choukri , B. Maegaard , J. Mariani , J. Odijk , S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp.3185–3191). Paris: ELRA.
    [Google Scholar]
  6. Buchholz, S. , & Marsi, E
    (2006) CoNLL-X shared task on multilingual dependency parsing. In L. Màrquez & D. Klein (Eds.), Proceedings of the 10th Conference on Computational Natural Language Learning (pp.149–164). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  7. Burnard, L
    (Ed.) (2007) Reference Guide for the British National Corpus (XML Edition). Oxford: Research Technologies Service. Retrieved fromwww.natcorp.ox.ac.uk/XMLedition/URG (last accessedMarch 2014).
    [Google Scholar]
  8. Carletta J. , Evert, S. , Heid, U. , Kilgour, J. , Robertson, J. , & Voormann, H
    (2003) The NITE XML Toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, & Computers, 35(3), 353–363. doi: 10.3758/BF03195511
    https://doi.org/10.3758/BF03195511 [Google Scholar]
  9. Carletta J. , Evert, S. , Heid, U. , & Kilgour, J
    (2005) The NITE XML Toolkit: Data model and query. Language Resources and Evaluation, 39(4), 313–334. doi: 10.1007/s10579‑006‑9001‑9
    https://doi.org/10.1007/s10579-006-9001-9 [Google Scholar]
  10. Chiarcos, C. , Dipper, S. , Götze, M. , Leser, U. , Lüdeling, A. , Ritz, J. , & Stede, M
    (2009) A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique des Langues, 49(2), 271–291.
    [Google Scholar]
  11. Creative Commons
    (2014) About the Licenses - Creative Commons. Retrieved fromcreativecommons.org/licenses (last accessedMarch 2014).
    [Google Scholar]
  12. Dipper, S
    (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In R. Eckstein & R. Tolksdorf (Eds.), Proceedings of Berliner XML Tage 2005 (pp.39–50). Berlin: Humboldt-Universität zu Berlin.
    [Google Scholar]
  13. Dipper, S. , Lüdeling, A. , & Reznicek, M
    (2013) NoSta-D: A corpus of German non-standard varieties. In M. Zampieri & S. Diwersy (Eds.), Non-Standard Data Sources in Corpus-Based Research (pp.69–76). Aachen: Shaker.
    [Google Scholar]
  14. Druskat, S. , Bierkandt, L. , Gast, V. , Rzymski, C. , & Zipser, F
    (2014) Atomic: An open-source software platform for multi-level corpus annotation. In J. Ruppenhofer & G. Faaß (Eds.), Proceedings of the 12th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2014) (pp.228–234). Retrieved fromnbn-resolving.de/urn:nbn:de:gbv:hil2-opus-2866 (last accessedMay 2015).
    [Google Scholar]
  15. Gerdes, K
    (2014) Arborator [Computer software]. Retrieved fromarborator.ilpga.fr (last accessed March 2014).
    [Google Scholar]
  16. Giesel, L. , Klapi, M. , Krüger, D. , Nunberger, I. , Rasskazova, O. , & Sauer, S
    (2013) Berlin Map Task Corpus: A deeply annotated multimodal map-task corpus of spoken learner and native German. Poster presented at the 35. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft , Potsdam, Germany. Retrieved fromkorpling.german.hu-berlin.de/bematac/publications/Giesel-et-al_2013_DGfS-CL-2013.pdf (last accessedMarch 2014).
    [Google Scholar]
  17. Hall, M. , Frank, E. , Holmes, G. , Pfahringer, B. , Reutemann, P. , & Witten, I.H
    (2009) The WEKA data mining software: An update. In O.R. Zaiane (Ed.), SIGKDD Explorations, 11(1), 10–18.
    [Google Scholar]
  18. Hanke, T. , & Storz, J
    (2008) iLex: A database tool for integrating sign language corpus linguistics and sign language lexicography. In O. Crasborn , E. Efthimiou , T. Hanke , E. Thoutenhoofd & I. Zwitserlood (Eds.), LREC 2008 Workshop, Proceedings, W 25: 3rd Workshop on the Representation and Processing of Sign Languages: Construction and Exploitation of Sign Language Corpora (pp.64–67). Paris: ELRA.
    [Google Scholar]
  19. Himmelmann, N.P
    (2012) Linguistic data types and the interface between language documentation and description. Language Documentation & Conservation, 6, 187–207.
    [Google Scholar]
  20. Hinrichs, E.W. , Hinrichs, M. , & Zastrow, T
    (2010) WebLicht: Web-Based LRT services for German. InACL 2010 System Demonstrations, Proceeding (pp.25–29). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  21. Ide, N. , & Suderman, K
    (2007) GrAF: A graph-based format for linguistic annotations. In B. Boguraev , N. Ide , A. Meyers , Sh. Nariyama , M. Stede , J. Wiebe & G. Wilcock (Eds.), ACL 2007 Workshop, Proceedings, Linguistic Annotation Workshop (pp.25–29). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  22. Kirk, J.M
    . (this volume). The pragmatic annotation scheme of the SPICE-Ireland corpus.
    [Google Scholar]
  23. Krause, T. , Lüdeling, A. , Odebrecht, C. , & Zeldes, A
    (2012) Multiple tokenization in a diachronic corpus. Paper presented at Exploring Ancient Languages through Corpora Conference 2012 , Oslo. Retrieved fromwww.hf.uio.no/ifikk/english/research/projects/proiel/ealc/abstracts/Krause_et_al.pdf (last accessedMarch 2014).
    [Google Scholar]
  24. Krause, T. , & Zeldes, A
    (2014) ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities. Retrieved fromdsh.oxfordjournals.org/content/early/2014/12/02/llc.fqu057.full (last accessedMay 2015).
    [Google Scholar]
  25. Lüdeling, A
    (2011) Corpora in linguistics: Sampling and annotation. In K. Grandin (Ed.), Going Digital. Evolutionary and Revolutionary Aspects of Digitization (pp.220–243). New York, NY: Science History Publications.
    [Google Scholar]
  26. Max Planck Society
    (2014) Max Planck Open Access: Berlin Declaration. Retrieved fromopenaccess.mpg.de/Berlin-Declaration (last accessedMarch 2014).
    [Google Scholar]
  27. Müller, C. , & Strube, M
    (2006) Multi-level annotation of linguistic data with MMAX2. In S. Braun , K. Kohn & J. Mukherjee (Eds.), Corpus Technology and Language Pedagogy (pp.197–214). Frankfurt am Main: Peter Lang,
    [Google Scholar]
  28. Nivre, J
    (2008) Treebanks. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.225–241). Berlin: Mouton de Gruyter.
    [Google Scholar]
  29. Pajas P. , & Stepanek, J
    (2008) Recent advances in a feature-rich framework for treebank annotation. In Proceedings of the 22nd International Conference on Computational Linguistics (pp.673–680). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  30. R Core Team
    (2013) R: A Language and Environment for Statistical Computing [Computer software]. Retrieved fromwww.R-project.org (last accessedMarch 2014).
    [Google Scholar]
  31. Sauer, S. , & Rasskazova, O
    (2014) BeMaTaC: Eine digitale multimodale Ressource für Sprach- und Dialogforschung. Poster presented at the workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen , Berlin, Germany. Retrieved fromkorpling.german.hu-berlin.de/bematac/publications/Sauer-Rasskazova_2014_3WS-DHB.pdf (last accessedMarch 2014).
    [Google Scholar]
  32. Schiel, F. , Draxler, C. , & Harrington, J
    (2011) Phonemic segmentation and labelling using the MAUS technique. Workshop New Tools and Methods for Very-Large-Scale Phonetics Research . Retrieved fromwww.phonetik.uni-muenchen.de/forschung/publikationen/Schiel-VLSP2011.pdf (last accessedApril 2016).
    [Google Scholar]
  33. Schiller, A. , Teufel, S. , Stöckert, C. , & Thielen, C
    (1999) Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Retrieved fromwww.sfs.uni-tuebingen.de/resources/stts-1999.pdf (last accessedMarch 2014).
    [Google Scholar]
  34. Schmid, H
    (1994) Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing . Retrieved fromftp://ftp.ims.uni-stuttgart.de/pub/corpora/tree-tagger1.pdf (last accessed November 2014).
    [Google Scholar]
  35. 2008 Tokenizing and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.527–551). Berlin: Mouton de Gruyter.
    [Google Scholar]
  36. Schmidt, T
    (2004) Transcribing and annotating spoken language with EXMARaLDA. In A. Witt , U. Heid , H.S. Thompson , J. Carletta & P. Wittenburg (Eds.), LREC 2004 Workshop, Proceedings, XML-based Richly Annotated Corpora (pp.69–74). Paris: ELRA.
    [Google Scholar]
  37. Schmidt, T. , & Wörner, K
    (2009.) EXMARaLDA: Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics, 19(4), 565–582. doi: 10.1075/prag.19.4.06sch
    https://doi.org/10.1075/prag.19.4.06sch [Google Scholar]
  38. Schmidt, T. , Hedeland, H. , Lehmberg, T. , & Wörner, K
    (2010) HAMATAC: The Hamburg MapTask Corpus. Retrieved fromwww.exmaralda.org/files/HAMATAC.pdf (last accessedMarch 2014).
  39. Sloetjes, H. , & Wittenburg, P
    (2008) Annotation by category: ELAN and ISO DCR. In N. Calzolari , Kh. Choukri , B. Maegaard , J. Mariani , J. Odijk , S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp.816–820). Paris: ELRA.
    [Google Scholar]
  40. Stede, M
    (2011) Discourse Processing. San Rafael, CA: Morgan & Claypool.
    [Google Scholar]
  41. Stenetorp, P. , Pyysalo, S. , Topić, G. , Ohta, T. , Ananiadou, S. , & Tsujii, J
    2012 Brat: A web-based tool for NLP-assisted text annotation. In F. Segond (Ed.), Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp.102–107). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  42. Stührenberg, M
    (2012) The TEI and current standards for structuring linguistic data. In P. Bański , E. Litta Modignani Picozzi & A. Witt (Eds.), Journal of the Text Encoding Initiative, 3. Retrieved fromjtei.revues.org/523 (last accessedMarch 2014).
    [Google Scholar]
  43. TEI Consortium
    (2014) TEI: Text Encoding Initiative. Retrieved fromwww.tei-c.org (last accessedMarch 2014).
    [Google Scholar]
  44. Thompson, P
    (2005) Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp.59–70). Oxford: Oxbow Books. Retrieved fromahds.ac.uk/linguistic-corpora (last accessedMarch 2014).
    [Google Scholar]
  45. Wichmann, A
    (2008) Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.187–207). Berlin: Mouton de Gruyter.
    [Google Scholar]
  46. Wörner, K
    (2009) Werkzeuge zur flachen Annotation von Transkriptionen gesprochener Sprache. Bielefeld: Bielefeld University. Retrieved fromhttps://pub.uni-bielefeld.de/download/2301935/2301938 (last accessedApril 2016).
    [Google Scholar]
  47. Wynne, M
    (2008) Searching and concordancing. In A. Lüdeling , & M. Kytö . (Eds.), Corpus Linguistics. An International Handbook (pp.706–737). Berlin: Mouton de Gruyter.
    [Google Scholar]
  48. Yimam, S.M. , Gurevych, I. , Eckart de Castilho, R. , & Biemann, C
    (2013) WebAnno: A flexible, web-based and visually supported system for distributed annotations. In M. Butt & S. Hussain (Eds.), 51st Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference System Demonstration (pp.1–6). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  49. Zeldes, A. , Ritz, J. , Lüdeling, A. , & Chiarcos, C
    (2009) ANNIS: A search tool for multi-layer annotated corpora. In M. Mahlberg , V. González-Díaz & C. Smith (Eds.), Proceedings of Corpus Linguistics 2009. Retrieved fromedoc.hu-berlin.de/docviews/abstract.php?id=36996 (last accessedMarch 2014).
    [Google Scholar]
  50. Zipser, F. , & Romary, L
    (2010) A model oriented approach to the mapping of annotation formats using standards. In G. Budin , L. Romary , T. Declerck & P. Wittenburg (Eds.), LREC 2010 Workshop, Proceedings, W4: Language Resource and Language Technology Standards. Paris: ELRA. Retrieved fromhal.inria.fr/inria-00527799 (last accessedNovember 2014).
    [Google Scholar]
/content/journals/10.1075/ijcl.21.3.06sau
Loading
  • Article Type: Research Article
Keyword(s): annotation; annotation tools; multi-layer architecture; spoken corpora; standoff
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error