Flexible multi-layer spoken dialogue corpora

Simon Sauer; Anke Lüdeling

doi:10.1075/ijcl.21.3.06sau

Compilation, transcription, markup and annotation of spoken corpora

ISSN 1384-6655
E-ISSN: 1569-9811

GBP

Flexible multi-layer spoken dialogue corpora
Author(s): Simon Sauer ¹ and Anke Lüdeling
View Affiliations Hide Affiliations

Affiliations:
¹ Humboldt-Universität zu Berlin
Source: International Journal of Corpus Linguistics, Volume 21, Issue 3, Jan 2016, p. 419 - 438
DOI: https://doi.org/10.1075/ijcl.21.3.06sau
- Version of Record published : 19 Sept 2016

Abstract

This paper describes the construction of deeply annotated spoken dialogue corpora. To ensure a maximum of flexibility — in the degree of normalization, the types and formats of annotations, the possibilities for modifying and extending the corpus, or the use for research questions not originally anticipated — we propose a flexible multi-layer standoff architecture. We also take a closer look at the interoperability of tools and formats compatible with such an architecture. Free access to the corpus data through corpus queries, visualizations, and downloads — including documentation, metadata, and the original recordings — enables transparency, verifiability, and reproducibility of every step of interpretation throughout corpus construction and of any research findings obtained from this data.

Article metrics loading...

/content/journals/10.1075/ijcl.21.3.06sau

2016-09-19

2024-04-16

From This Site

/content/journals/10.1075/ijcl.21.3.06sau

dcterms_title,dcterms_subject,pub_keyword

-contentType:Journal -contentType:Contributor -contentType:Concept -contentType:Institution

10

5

Full text loading...

References

Anderson, A.H. , Bader, M. , Gurman Bard, E. , Boyle, E. , Doherty, G. , Garrod, S. , Isard, S. , Kowtko, J. , McAllister, J. , Miller, J. , Sotillo, C. , Thompson, H.S. , & Weinert, R
(1991) The HCRC Map Task Corpus. Language and Speech, 34(4), 351–366.
[Google Scholar]
Belz, M
(2013) Disfluencies und Reparaturen bei Muttersprachlern und Lernern: Eine kontrastive Analyse. Humboldt-Universität zu Berlin. Retrieved fromedoc.hu-berlin.de/docviews/abstract.php?id=40482 (last accessedMarch 2014).
[Google Scholar]
BeMaTaC
(2014) BeMaTaC: A Deeply Annotated Multimodal Map-task Corpus of Spoken Learner and Native German. Retrieved fromu.hu-berlin.de/bematac (last accessedMarch 2014).
[Google Scholar]
Boersma, P
(2010) Praat: A system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
[Google Scholar]
Brinckmann, C. , Kleiner, S. , Knöbl, R. , & Berend, N
(2008) German today: An areally extensive corpus of spoken Standard German. In N. Calzolari , Kh. Choukri , B. Maegaard , J. Mariani , J. Odijk , S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp.3185–3191). Paris: ELRA.
[Google Scholar]
Buchholz, S. , & Marsi, E
(2006) CoNLL-X shared task on multilingual dependency parsing. In L. Màrquez & D. Klein (Eds.), Proceedings of the 10th Conference on Computational Natural Language Learning (pp.149–164). Stroudsburg, PA: Association for Computational Linguistics.
[Google Scholar]
Burnard, L
(Ed.) (2007) Reference Guide for the British National Corpus (XML Edition). Oxford: Research Technologies Service. Retrieved fromwww.natcorp.ox.ac.uk/XMLedition/URG (last accessedMarch 2014).
[Google Scholar]
Carletta J. , Evert, S. , Heid, U. , Kilgour, J. , Robertson, J. , & Voormann, H
(2003) The NITE XML Toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, & Computers, 35(3), 353–363. doi: 10.3758/BF03195511
https://doi.org/10.3758/BF03195511 [Google Scholar]
Carletta J. , Evert, S. , Heid, U. , & Kilgour, J
(2005) The NITE XML Toolkit: Data model and query. Language Resources and Evaluation, 39(4), 313–334. doi: 10.1007/s10579‑006‑9001‑9
https://doi.org/10.1007/s10579-006-9001-9 [Google Scholar]
Chiarcos, C. , Dipper, S. , Götze, M. , Leser, U. , Lüdeling, A. , Ritz, J. , & Stede, M
(2009) A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique des Langues, 49(2), 271–291.
[Google Scholar]
Creative Commons
(2014) About the Licenses - Creative Commons. Retrieved fromcreativecommons.org/licenses (last accessedMarch 2014).
[Google Scholar]
Dipper, S
(2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In R. Eckstein & R. Tolksdorf (Eds.), Proceedings of Berliner XML Tage 2005 (pp.39–50). Berlin: Humboldt-Universität zu Berlin.
[Google Scholar]
Dipper, S. , Lüdeling, A. , & Reznicek, M
(2013) NoSta-D: A corpus of German non-standard varieties. In M. Zampieri & S. Diwersy (Eds.), Non-Standard Data Sources in Corpus-Based Research (pp.69–76). Aachen: Shaker.
[Google Scholar]
Druskat, S. , Bierkandt, L. , Gast, V. , Rzymski, C. , & Zipser, F
(2014) Atomic: An open-source software platform for multi-level corpus annotation. In J. Ruppenhofer & G. Faaß (Eds.), Proceedings of the 12th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2014) (pp.228–234). Retrieved fromnbn-resolving.de/urn:nbn:de:gbv:hil2-opus-2866 (last accessedMay 2015).
[Google Scholar]
Gerdes, K
(2014) Arborator [Computer software]. Retrieved fromarborator.ilpga.fr (last accessed March 2014).
[Google Scholar]
Giesel, L. , Klapi, M. , Krüger, D. , Nunberger, I. , Rasskazova, O. , & Sauer, S
(2013) Berlin Map Task Corpus: A deeply annotated multimodal map-task corpus of spoken learner and native German. Poster presented at the 35. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft , Potsdam, Germany. Retrieved fromkorpling.german.hu-berlin.de/bematac/publications/Giesel-et-al_2013_DGfS-CL-2013.pdf (last accessedMarch 2014).
[Google Scholar]
Hall, M. , Frank, E. , Holmes, G. , Pfahringer, B. , Reutemann, P. , & Witten, I.H
(2009) The WEKA data mining software: An update. In O.R. Zaiane (Ed.), SIGKDD Explorations, 11(1), 10–18.
[Google Scholar]
Hanke, T. , & Storz, J
(2008) iLex: A database tool for integrating sign language corpus linguistics and sign language lexicography. In O. Crasborn , E. Efthimiou , T. Hanke , E. Thoutenhoofd & I. Zwitserlood (Eds.), LREC 2008 Workshop, Proceedings, W 25: 3rd Workshop on the Representation and Processing of Sign Languages: Construction and Exploitation of Sign Language Corpora (pp.64–67). Paris: ELRA.
[Google Scholar]
Himmelmann, N.P
(2012) Linguistic data types and the interface between language documentation and description. Language Documentation & Conservation, 6, 187–207.
[Google Scholar]
Hinrichs, E.W. , Hinrichs, M. , & Zastrow, T
(2010) WebLicht: Web-Based LRT services for German. InACL 2010 System Demonstrations, Proceeding (pp.25–29). Stroudsburg, PA: Association for Computational Linguistics.
[Google Scholar]
Ide, N. , & Suderman, K
(2007) GrAF: A graph-based format for linguistic annotations. In B. Boguraev , N. Ide , A. Meyers , Sh. Nariyama , M. Stede , J. Wiebe & G. Wilcock (Eds.), ACL 2007 Workshop, Proceedings, Linguistic Annotation Workshop (pp.25–29). Stroudsburg, PA: Association for Computational Linguistics.
[Google Scholar]
Kirk, J.M
. (this volume). The pragmatic annotation scheme of the SPICE-Ireland corpus.
[Google Scholar]
Krause, T. , Lüdeling, A. , Odebrecht, C. , & Zeldes, A
(2012) Multiple tokenization in a diachronic corpus. Paper presented at Exploring Ancient Languages through Corpora Conference 2012 , Oslo. Retrieved fromwww.hf.uio.no/ifikk/english/research/projects/proiel/ealc/abstracts/Krause_et_al.pdf (last accessedMarch 2014).
[Google Scholar]
Krause, T. , & Zeldes, A
(2014) ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities. Retrieved fromdsh.oxfordjournals.org/content/early/2014/12/02/llc.fqu057.full (last accessedMay 2015).
[Google Scholar]
Lüdeling, A
(2011) Corpora in linguistics: Sampling and annotation. In K. Grandin (Ed.), Going Digital. Evolutionary and Revolutionary Aspects of Digitization (pp.220–243). New York, NY: Science History Publications.
[Google Scholar]
Max Planck Society
(2014) Max Planck Open Access: Berlin Declaration. Retrieved fromopenaccess.mpg.de/Berlin-Declaration (last accessedMarch 2014).
[Google Scholar]
Müller, C. , & Strube, M
(2006) Multi-level annotation of linguistic data with MMAX2. In S. Braun , K. Kohn & J. Mukherjee (Eds.), Corpus Technology and Language Pedagogy (pp.197–214). Frankfurt am Main: Peter Lang,
[Google Scholar]
Nivre, J
(2008) Treebanks. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.225–241). Berlin: Mouton de Gruyter.
[Google Scholar]
Pajas P. , & Stepanek, J
(2008) Recent advances in a feature-rich framework for treebank annotation. In Proceedings of the 22nd International Conference on Computational Linguistics (pp.673–680). Stroudsburg, PA: Association for Computational Linguistics.
[Google Scholar]
R Core Team
(2013) R: A Language and Environment for Statistical Computing [Computer software]. Retrieved fromwww.R-project.org (last accessedMarch 2014).
[Google Scholar]
Sauer, S. , & Rasskazova, O
(2014) BeMaTaC: Eine digitale multimodale Ressource für Sprach- und Dialogforschung. Poster presented at the workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen , Berlin, Germany. Retrieved fromkorpling.german.hu-berlin.de/bematac/publications/Sauer-Rasskazova_2014_3WS-DHB.pdf (last accessedMarch 2014).
[Google Scholar]
Schiel, F. , Draxler, C. , & Harrington, J
(2011) Phonemic segmentation and labelling using the MAUS technique. Workshop New Tools and Methods for Very-Large-Scale Phonetics Research . Retrieved fromwww.phonetik.uni-muenchen.de/forschung/publikationen/Schiel-VLSP2011.pdf (last accessedApril 2016).
[Google Scholar]
Schiller, A. , Teufel, S. , Stöckert, C. , & Thielen, C
(1999) Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Retrieved fromwww.sfs.uni-tuebingen.de/resources/stts-1999.pdf (last accessedMarch 2014).
[Google Scholar]
Schmid, H
(1994) Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing . Retrieved fromftp://ftp.ims.uni-stuttgart.de/pub/corpora/tree-tagger1.pdf (last accessed November 2014).
[Google Scholar]
Schmid, H
2008 Tokenizing and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.527–551). Berlin: Mouton de Gruyter.
[Google Scholar]
Schmidt, T
(2004) Transcribing and annotating spoken language with EXMARaLDA. In A. Witt , U. Heid , H.S. Thompson , J. Carletta & P. Wittenburg (Eds.), LREC 2004 Workshop, Proceedings, XML-based Richly Annotated Corpora (pp.69–74). Paris: ELRA.
[Google Scholar]
Schmidt, T. , & Wörner, K
(2009.) EXMARaLDA: Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics, 19(4), 565–582. doi: 10.1075/prag.19.4.06sch
https://doi.org/10.1075/prag.19.4.06sch [Google Scholar]
Schmidt, T. , Hedeland, H. , Lehmberg, T. , & Wörner, K
(2010) HAMATAC: The Hamburg MapTask Corpus. Retrieved fromwww.exmaralda.org/files/HAMATAC.pdf (last accessedMarch 2014).
Sloetjes, H. , & Wittenburg, P
(2008) Annotation by category: ELAN and ISO DCR. In N. Calzolari , Kh. Choukri , B. Maegaard , J. Mariani , J. Odijk , S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp.816–820). Paris: ELRA.
[Google Scholar]
Stede, M
(2011) Discourse Processing. San Rafael, CA: Morgan & Claypool.
[Google Scholar]
Stenetorp, P. , Pyysalo, S. , Topić, G. , Ohta, T. , Ananiadou, S. , & Tsujii, J
2012 Brat: A web-based tool for NLP-assisted text annotation. In F. Segond (Ed.), Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp.102–107). Stroudsburg, PA: Association for Computational Linguistics.
[Google Scholar]
Stührenberg, M
(2012) The TEI and current standards for structuring linguistic data. In P. Bański , E. Litta Modignani Picozzi & A. Witt (Eds.), Journal of the Text Encoding Initiative, 3. Retrieved fromjtei.revues.org/523 (last accessedMarch 2014).
[Google Scholar]
TEI Consortium
(2014) TEI: Text Encoding Initiative. Retrieved fromwww.tei-c.org (last accessedMarch 2014).
[Google Scholar]
Thompson, P
(2005) Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp.59–70). Oxford: Oxbow Books. Retrieved fromahds.ac.uk/linguistic-corpora (last accessedMarch 2014).
[Google Scholar]
Wichmann, A
(2008) Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp.187–207). Berlin: Mouton de Gruyter.
[Google Scholar]
Wörner, K
(2009) Werkzeuge zur flachen Annotation von Transkriptionen gesprochener Sprache. Bielefeld: Bielefeld University. Retrieved fromhttps://pub.uni-bielefeld.de/download/2301935/2301938 (last accessedApril 2016).
[Google Scholar]
Wynne, M
(2008) Searching and concordancing. In A. Lüdeling , & M. Kytö . (Eds.), Corpus Linguistics. An International Handbook (pp.706–737). Berlin: Mouton de Gruyter.
[Google Scholar]
Yimam, S.M. , Gurevych, I. , Eckart de Castilho, R. , & Biemann, C
(2013) WebAnno: A flexible, web-based and visually supported system for distributed annotations. In M. Butt & S. Hussain (Eds.), 51st Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference System Demonstration (pp.1–6). Stroudsburg, PA: Association for Computational Linguistics.
[Google Scholar]
Zeldes, A. , Ritz, J. , Lüdeling, A. , & Chiarcos, C
(2009) ANNIS: A search tool for multi-layer annotated corpora. In M. Mahlberg , V. González-Díaz & C. Smith (Eds.), Proceedings of Corpus Linguistics 2009. Retrieved fromedoc.hu-berlin.de/docviews/abstract.php?id=36996 (last accessedMarch 2014).
[Google Scholar]
Zipser, F. , & Romary, L
(2010) A model oriented approach to the mapping of annotation formats using standards. In G. Budin , L. Romary , T. Declerck & P. Wittenburg (Eds.), LREC 2010 Workshop, Proceedings, W4: Language Resource and Language Technology Standards. Paris: ELRA. Retrieved fromhal.inria.fr/inria-00527799 (last accessedNovember 2014).
[Google Scholar]

http://instance.metastore.ingenta.com/content/journals/10.1075/ijcl.21.3.06sau

Article Type: Research Article

Keyword(s): annotation; annotation tools; multi-layer architecture; spoken corpora; standoff

Most Cited

- Collostructions: Investigating the interaction of words and constructions
  
  Author(s): Anatol Stefanowitsch and Stefan Th. Gries
- Automatic analysis of syntactic complexity in second language writing
  
  Author(s): Xiaofei Lu
- Extending collostructional analysis: A corpus-based perspective on `alternations'
  
  Author(s): Stefan Th. Gries and Anatol Stefanowitsch
- From key words to key semantic domains
  
  Author(s): Paul Rayson
- The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights
  
  Author(s): Mark Davies
- A corpus-driven approach to formulaic language in English
  
  Author(s): Douglas Biber
- Collocations in context: A new perspective on collocation networks
  
  Author(s): Vaclav Brezina, Tony McEnery and Stephen Wattam
- CQPweb — combining power, flexibility and usability in a corpus analysis tool
  
  Author(s): Andrew Hardie
- Dispersions and adjusted frequencies in corpora
  
  Author(s): Stefan Th. Gries
- Comparing Corpora
  
  Author(s): Adam Kilgarriff
More Less

Flexible multi-layer spoken dialogue corpora

Abstract

From This Site

Most Read This Month

Most Cited

Collostructions: Investigating the interaction of words and constructions

Automatic analysis of syntactic complexity in second language writing

Extending collostructional analysis: A corpus-based perspective on `alternations'

From key words to key semantic domains

The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights

A corpus-driven approach to formulaic language in English

Collocations in context: A new perspective on collocation networks

CQPweb — combining power, flexibility and usability in a corpus analysis tool

Dispersions and adjusted frequencies in corpora

Comparing Corpora