1887
Compilation, transcription, markup and annotation of spoken corpora
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
GBP
Buy:£15.00 + Taxes

Abstract

This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea discussed is that good practices cannot be developed without considering methodological, technological and organisational aspects on equal footing. Starting from this idea, this paper inspects more closely some actual practices in FOLK, namely the handling of legal (especially privacy protection) issues, the decisions taken for the transcription and annotation workflow, and the question of how to best disseminate a corpus like FOLK. The final section sketches some possible future improvements for practices in FOLK.

Loading

Article metrics loading...

/content/journals/10.1075/ijcl.21.3.05sch
2016-09-19
2018-09-22
Loading full text...

Full text loading...

References

  1. Baude, O. , Blanche-Benveniste, C. , Calas, M.-F. , Cappeau, P. , Corderereix, P. , Goury, L. , Jacobson, M. , de Lambertierie, I. , Marchello-Nizia, C. , & Mondada, L
    (2006) Corpus Oraux: Guide des Bonnes Pratiques. Orléans: Presses Universitaires d’Orléans. Retrieved fromhttps://halshs.archives-ouvertes.fr/hal-00357706/ (last accessedOctober 2014).
    [Google Scholar]
  2. Berens, F.-J. , Jäger, K.-H. , Schank, G. , & Schwitalla, J
    (1976) Projekt Dialogstrukturen. Ein Arbeitsbericht. Heutiges Deutsch, I(12), 1–147.
    [Google Scholar]
  3. Bird, S. , & Liberman, M
    (2001) A formal framework for linguistic annotation. Speech Communication, 33(1,2), 23–60. doi: 10.1016/S0167‑6393(00)00068‑6
    https://doi.org/10.1016/S0167-6393(00)00068-6 [Google Scholar]
  4. Bird, S. , & Simons, G
    (2002) Seven dimensions of portability for language documentation and description. Language, 79(3), 557–582. doi: 10.1353/lan.2003.0149
    https://doi.org/10.1353/lan.2003.0149 [Google Scholar]
  5. Brinckmann, C. , Kleiner, S. , Knöbl, R. , & Berend, N
    (2008) German today: An areally extensive corpus of spoken standard German. Proceedings 6th International Conference on Language Resources and Evaluation (LREC 2008) , Marrakesch, Marokko (pp.3185–3191). Retrieved fromwww.lrec-conf.org/proceedings/lrec2008/pdf/806_paper.pdf (last accessedNovember 2015).
    [Google Scholar]
  6. Carletta, J. , Kilgour, J. , O’Donnell, T. , Evert, S. , & Voorman, H
    (2003) The NITE object model library for handling structured linguistic annotation on multimodaldata sets. Proceedings of the EACL Workshop on Language Technology and the Semantic Web. Budapest (pp.17–24). Retrieved fromwww.stefan-evert.de/PUB/CarlettaEtc2003.pdf (last accessedNovember 2015).
    [Google Scholar]
  7. CLARIN
    (2010) Interoperability and standards. CLARIN deliverable D5.C-3. Retrieved fromwww.clarin.eu/system/files/clarin-deliverable-D5C3_v1_5-finaldraft.pdf (last accessedNovember 2015).
  8. Deppermann, A. , & Hartung, M
    (2011) Was gehört in ein nationales Gesprächskorpus? Kriterien, Probleme und Prioritäten der Stratifikation des ‘Forschungs- und Lehrkorpus Gesprochenes Deutsch’ (FOLK) am Institut für Deutsche Sprache (Mannheim). In E. Felder , M. Müller , & F. Vogel, F. . (Eds.), Korpuspragmatik. Thematische Korpora als Basis diskurslinguistischer Analysen (pp.414–450). Berlin: de Gruyter.
    [Google Scholar]
  9. Deppermann, A. , & Proske, N
    (2015) Grundeinheiten der Sprache und des Sprechens. In C. Dürscheid & J.-G. Schneider (Eds.), Satz, Äußerung, Schema (pp.17–47). Berlin: de Gruyter,
    [Google Scholar]
  10. Fandrych, C. , Meißner, C. , & Slavcheva, A
    (2012) The GeWiss Corpus: Comparing spoken academic German, English and Polish. In T. Schmidt & K. Wörner (Eds.), Multilingual Corpora and Multilingual Corpus Analysis (pp.319–337). Amsterdam: John Benjamins. doi: 10.1075/hsm.14.23fan
    https://doi.org/10.1075/hsm.14.23fan [Google Scholar]
  11. Goldman, J. , Renals, S. , Bird, S. , de Jong, F. , Federico, M. , Fleischhauer, C. , Kornbluh, M. , Lamel, L. , Oard, D.W. , Stewart, C. , & Wright, R
    (2005) Accessing the spoken word. International Journal on Digital Libraries, 5(4), 287–298. doi: 10.1007/s00799‑004‑0101‑0
    https://doi.org/10.1007/s00799-004-0101-0 [Google Scholar]
  12. Habscheid, S
    (2014) Haben sich Sprach- und Literaturwissenschaft heute noch etwas zu sagen? Eine Antwort aus sprachwissenschaftlicher Perspektive – am Beispiel eines gesprächslinguistischen Forschungsprojekts über Pausengespräche im Theater. In H.-R. Fluck & J. Zhu (Eds.), Vielfalt und Interkulturalität der internationalen Germanistik. Festgabe für Siegfried Grosse zum 90. Geburtstag (pp.73–85). Tübingen: Stauffenburg,.
    [Google Scholar]
  13. Hedeland, H. , Lehmberg, T. , Schmidt, T. , & Wörner, K
    (2014) Multilingual corpora at the Hamburg Centre for Language Corpora. In S. Ruhi , M. Haugh , T. Schmidt & K. Wörner (Eds.), Best Practices for Spoken Corpora in Linguistic Research (pp.208–224). Newcastle-upon-Tyne: Cambridge Scholars Press.
    [Google Scholar]
  14. Hee, K
    (2012) Polizeivernehmungen von Migranten: Eine gesprächsanalytische Studie interkultureller Interaktionen in Institutionen. Heidelberg: Universitätsverlag Winter.
    [Google Scholar]
  15. IDS [Institut für Deutsche Sprache]
    (1975) Gesprochene Sprache. Tübingen: Narr.
    [Google Scholar]
  16. Kellner, B. , Lehmberg, T. , Schröder, I. , & Wörner, K
    (2008) Data structures for the analysis of regional language variation. In A. Storrer , A. Geyken , A. Siebert & K.-M. Würzner (Eds.), Text Resources and Lexical Knowledge (pp.53–63). Berlin: Walter de Gruyter. doi: 10.1515/9783110211818.1.53
    https://doi.org/10.1515/9783110211818.1.53 [Google Scholar]
  17. Kupietz, M. , & Schmidt, T
    (2015) Schriftliche und mündliche Korpora am IDS als Grundlage für die empirische Forschung. In L.M. Eichinger , (Ed.), Sprachwissenschaft im Fokus: Positionsbestimmungen und Perspektiven (pp.297–322). Berlin: De Gruyter Mouton.
    [Google Scholar]
  18. Kucharczik, K
    . (no date). Korpus der gesprochenen Sprache im Ruhrgebiet (KgSR). Retrieved fromwww.ruhr-uni-bochum.de/kgsr/ (last accessedJanuary 2014).
  19. Leech, G. , Myers, G. , & Thomas, J
    (Eds.) (1995) Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.
    [Google Scholar]
  20. MacWhinney, B
    (2000) The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum.
    [Google Scholar]
  21. Ochs, E
    (1979) Transcription as theory. In E. Ochs & B.B. Schieffelin (Eds.)Developmental Pragmatics (pp.43–72). New York, NY: Academic Press.
    [Google Scholar]
  22. O’Connell, D. , & Kowal, S
    (1994) Some current transcription systems for spoken discourse: A critical analysis. Pragmatics, 4(1), 81–107. doi: 10.1075/prag.4.1.04con
    https://doi.org/10.1075/prag.4.1.04con [Google Scholar]
  23. (2000) Are transcripts reproducible?Pragmatics, 10(2), 247–269. doi: 10.1075/prag.10.2.05con
    https://doi.org/10.1075/prag.10.2.05con [Google Scholar]
  24. Oostdijk, N. , & Broeder, D
    (2003) The Spoken Dutch Corpus and its exploitation environment. In A. Abeille , S. Hansen-Schirra & H. Uszkoreit (Eds.)Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary (pp.93–101).
    [Google Scholar]
  25. Parisse, C. , & Morgenstern, A
    (2010) A multi-software integration platform and support for multimedia transcripts of language. In M. Kipp , J.C. Martin , P. Paggio & D. Heylen (Eds.), Proceedings of the LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, (pp. 106–110). Retrieved fromwww.lrec-conf.org/proceedings/lrec2010/workshops/W6.pdf (last accessedNovember 2015).
    [Google Scholar]
  26. Rehbein, J. , Grießhaber, W. , Löning, P. , Hartung, M. , & Bührig, K
    (1993) Manual für das computergestützte Transkribieren mit dem Programm syncWRITER nach dem Verfahren der Halbinterpretativen Arbeitstranskriptionen (HIAT). Hamburg: Universität Hamburg.
    [Google Scholar]
  27. Rehbein, J. , Schmidt, T. , Meyer, B. , Watzke, F. , & Herkenrath, A
    (2004) Handbuch für das computergestützte Transkribieren nach HIAT. Retrieved fromwww.exmaralda.org/files/azm_56.pdf (last accessedNovember 2015).
    [Google Scholar]
  28. Rohlfing, K. , Loehr, D. , Duncan, S. , Brown, A. , Franklin, A. , Kimbara, I. , Milde, J.-T. , Parrill, F. , Rose, T. , Schmidt, T. , Sloetjes, H. , & Thies, A
    (2006) Comparison of multimodal annotation tools: Workshop report. InGesprächsforschung: Online-Zeitschrift zur verbalen Interaktion7, 99–123.
    [Google Scholar]
  29. Schmid, H
    (1995) Improvements in part-of-speech tagging with an application to German. Proceedings of the ACL SIGDAT-Workshop . Dublin, Ireland. Retrieved fromftp://ftp.ims.uni-stuttgart.de/pub/corpora/tree-tagger2.pdf(last accessedNovember 2015).
    [Google Scholar]
  30. Schmidt, T. , & Schütte, W
    (2010) FOLKER: An annotation tool for efficient transcription of natural, multi-party interaction. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC10) , Valletta, Malta (pp.2091–2096). Retrieved fromwww.exmaralda.org/files/LREC_Folker.pdf (last accessedNovember 2015).
    [Google Scholar]
  31. Schmidt, T
    (2011) A TEI-based approach to standardising spoken language transcription. Journal of the Text Encoding Initiative1. Retrieved fromjtei.revues.org/142 (last accessedNovember 2015).
    [Google Scholar]
  32. (2012) EXMARaLDA and the FOLK tools. In Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC’10) , Istanbul, Turkey: European Language Resources Association (ELRA), (pp.236–240). Retrieved fromwww.lrec-conf.org/proceedings/lrec2012/pdf/529_Paper.pdf(last accessedNovember 2015).
    [Google Scholar]
  33. (2014) The Database for Spoken German – DGD2. In Proceedings of the Ninth conference on International Language Resources and Evaluation (LREC’14) , Reykjavik, Iceland: European Language Resources Association (ELRA) (pp.1451–1457). Retrieved fromwww.lrec-conf.org/proceedings/lrec2014/pdf/171_Paper.pdf (last accessedNovember 2015).
    [Google Scholar]
  34. Schmidt, T. , Dickgießer S. , & Gasch, J
    (2013) Die Datenbank für Gesprochenes Deutsch (DGD2). Mannheim: Institut für Deutsche Sprache. Retrieved fromids-pub.bsz-bw.de/frontdoor/index/index/docId/1274 (last accessedNovember 2015).
  35. Schmidt, T. , & Wörner, K
    (2014) EXMARaLDA. In J. Durand , U. Gut & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology (pp.402–419.). Oxford: Oxford University Press.
    [Google Scholar]
  36. Selting, M. , Auer, P. , Barden, B. Bergmann, J. , Couper-Kuhlen, E. , Günthner, S. , Meier, C. , Quasthoff, U. , Schlobinski, P. , & Uhmann, S
    (1998) Gesprächsanalytisches Transkriptionssystem (GAT). Linguistische Berichte, 173, 91–122.
    [Google Scholar]
  37. Selting, M. , Auer, P. , Barth-Weingarten, D. , Bergmann, J. , Bergmann P. , Birkner, K. , Couper-Kuhlen, E. , Deppermann, A. , Gilles, P. , Günthner, S. , & Hartung, M
    (2009) Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). InGesprächsforschung: Online-Zeitschrift zur verbalen Interaktion,10, 353–402.
    [Google Scholar]
  38. Stift, U.-M. , & Schmidt, T
    (2014) Mündliche Korpora am IDS: Vom Deutschen Spracharchiv zur Datenbank für Gesprochenes Deutsch. InAnsichten und Einsichten. 50 Jahre Institut für Deutsche Sprache (pp.360–375). Mannheim: Institut für Deutsche Sprache (IDS).
    [Google Scholar]
  39. Thompson, P
    (2005) Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp.59–70). Oxford: Oxbow Books. Retrieved fromwww.ahds.ac.uk/guides/linguistic-corpora/chapter5.htm (last accessed November 2015).
    [Google Scholar]
  40. Westpfahl, S. , & Schmidt, T
    (2013) POS für(s) FOLK: Part of Speech Tagging des Forschungs- und Lehrkorpus Gesprochenes Deutsch. Journal for Language Technology and Computational Linguistics, 28(1), 139–156.
    [Google Scholar]
  41. Wiese, H. , Freywald, U. , Schalowski, S. , & Mayr, K
    (2012) Das KiezDeutsch- Korpus. Spontansprachliche Daten Jugendlicher aus urbanen Wohngebieten. Deutsche Sprache40, 97–123.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.1075/ijcl.21.3.05sch
Loading
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error