1887
Volume 25, Issue 4
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
USD
Buy:$35.00 + Taxes

Abstract

Abstract

This paper outlines the construction of the corpus Alpenwort, a large, genre-based corpus of German texts on alpinism. We report on issues related to building the corpus from the (1869–2010). First, a general description of our data and the project phases from digitization and annotation to publication is given. We focus on the most interesting challenges that the diverse layouts and the extensive use of Fraktur typefacing posed for optical layout recognition and optical character recognition (OCR) as well as post correction. The corrected data was lemmatized and annotated with part-of-speech information including named entities as well as TEI-conformant metadata. The resulting 19.9-million-word corpus is designed to be queried using and and can be accessed freely online. Lastly, we give a short roadmap of current and future expansions and improvements as corpus data has been and is being enhanced in follow-up projects.

Loading

Article metrics loading...

/content/journals/10.1075/ijcl.19094.pos
2020-10-27
2020-11-27
Loading full text...

Full text loading...

References

  1. Achrainer, M.
    (2014) Das Historische Alpenarchiv der Alpenvereine. arbido – Fachzeitschrift für Archiv, Bibliothek und Dokumentation, 1, 14–17.
    [Google Scholar]
  2. Baker, P., & McEnery, T.
    (Eds.) (2015) Corpora and Discourse Studies: Integrating Discourse and Corpora. Palgrave Macmillan. 10.1057/9781137431738
    https://doi.org/10.1057/9781137431738 [Google Scholar]
  3. Beck, F.
    (2006) „Schwabacher Judenlettern“: Schriftverruf im Dritten Reich [“Schwabach Jewish Typeface”: The discrediting of a typeface in the Third Reich]. InB. Brachmann (Ed.), Die Kunst des Vernetzens: Festschrift für Wolfgang Hempel [The Art of Networking: Festschrift for Wolfgang Hempel] (pp.251–269). Verl. für Berlin-Brandenburg.
    [Google Scholar]
  4. Brezina, V., Timperley, M., & McEnery, T.
    (2018) #LancsBox (Version 4.0) [Computer software]. corpora.lancs.ac.uk/lancsbox/index.php
    [Google Scholar]
  5. Bubenhofer, N., Volk, M., Leuenberger, F., & Wüest, D.
    (2015) Text+Berg-Korpus (Release 151_v01). textberg.ch
  6. Carrasco, R. C.
    (2014) An open-source OCR evaluation tool. InA. Antonacopoulos & K. U. Schulz (Eds.), Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage – DATeCH ’14. (pp.179–184). Madrid, Spain, 19.05.2014 – 20.05.2014. ACM. https://www.aclweb.org/anthology/W19-6004.pdf. 10.1145/2595188.2595221
    https://doi.org/10.1145/2595188.2595221 [Google Scholar]
  7. CLARIN-D/SfS-Uni. Tübingen
    CLARIN-D/SfS-Uni. Tübingen (2012) WebLicht: Web-Based Linguistic Chaining Tool [Computer software]. https://weblicht.sfs.uni-tuebingen.de
    [Google Scholar]
  8. Clausner, C., Pletschacher, S., & Antonacopoulos, A.
    (2020) Flexible character accuracy measure for reading-order-independent evaluation. Pattern Recognition Letters, 131, 390–397. 10.1016/j.patrec.2020.02.003
    https://doi.org/10.1016/j.patrec.2020.02.003 [Google Scholar]
  9. Cunningham, H., Tablan, V., Roberts, A., & Bontcheva, K.
    (2013) Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Computational Biology, 9(2), e1002854. doi:  10.1371/journal.pcbi.1002854
    https://doi.org/10.1371/journal.pcbi.1002854 [Google Scholar]
  10. Durán-Muñoz, I.
    (2019) Adjectives and their keyness: A corpus-based analysis of tourism discourse in English. Corpora, 14(3), 351–378. 10.3366/cor.2019.0178
    https://doi.org/10.3366/cor.2019.0178 [Google Scholar]
  11. Gander, L., Lezuo, C., & Unterweger, R.
    (2011) Rule based document understanding of historical books using a hybrid fuzzy classification system. InB. Barrett, M. S. Brown, R. Manmatha, & J. Gehring (Eds.), Proceedings of the 2011 Workshop on Historical Document Imaging and Processing – HIP ’11 (p.91). ACM. doi:  10.1145/2037342.2037358
    https://doi.org/10.1145/2037342.2037358 [Google Scholar]
  12. Généreux, M., & Spano, D.
    (2015) NLP challenges in dealing with OCR-ed documents of derogated quality. InWorkshop on Replicability and Reproducibility in Natural Language Processing: Adaptive methods, resources and software (pp.1–7). Buenos Aires.
    [Google Scholar]
  13. Généreux, M., Stemle, E. W., Lyding, V., & Nicolas, L.
    (2014) Correcting OCR errors for German in Fraktur font. InR. Basili, A. Lenci, & B. Magnini (Eds.), The First Italian Conference on Computational Linguistics CLiC-it 2014. Proceedings (pp.186–190). Pisa University Press. clic2014.fileli.unipi.it/proceedings/Proceedings-CLICit-2014.pdf
    [Google Scholar]
  14. Hartmann, S.
    (1998) Fraktur oder Antiqua: Der Schriftstreit von 1881 bis 1941 [Fraktur or Antiqua: The font controversy of 1881 to 1941]. Lang.
    [Google Scholar]
  15. Hauser, A. W.
    (2007) OCR Postcorrection of Historical Texts [Unpublished Master’s thesis]. Ludwig-Maximilians-Universität.
    [Google Scholar]
  16. Hiebel, G., Posch, C., Rampl, G., Gruber, E., Hanke, K., & Zangerle, E.
    (2017) Semantics for Mountaineering History. In4th Digital Humanities Austria Conference – dha2017: Abstracts. Innsbruck. https://www.uibk.ac.at/congress/dha2017/bilder-und-dateien/semantics-for-mountaineering-history.pdf
    [Google Scholar]
  17. Hiebel, G., Rampl, G., & Posch, C.
    (2020) Angereichtertes Alpenwortcorpus/Enriched Alpenwort-Corpus. (Version 1.0.0). [Data Set]. doi:  10.5281/zenodo.3703068
    https://doi.org/10.5281/zenodo.3703068 [Google Scholar]
  18. Holley, R.
    (2009) How good can it get?D-Lib Magazine, 15(3/4). doi:  10.1045/march2009‑holley
    https://doi.org/10.1045/march2009-holley [Google Scholar]
  19. Kahle, P., Colutto, S., Hackl, G., & Mühlberger, G.
    (2017) Transkribus – A service platform for transcription, recognition and retrieval of historical documents. In2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (pp.19–24). doi:  10.1109/ICDAR.2017.307
    https://doi.org/10.1109/ICDAR.2017.307 [Google Scholar]
  20. Kermes, H., Degaetano-Ortlieb, S., Khamis, A., Knappen, J., & Teich, E.
    (2016) The Royal Society Corpus: From uncharted data to corpus. InN. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp.1928–1931). European Language Resources Association (ELRA). www.lrec-conf.org/proceedings/lrec2016/pdf/792_Paper.pdf
    [Google Scholar]
  21. Klijn, E.
    (2008) The current state-of-art in newspaper digitization. D-Lib Magazine, 14(1/2). doi:  10.1045/january2008‑klijn
    https://doi.org/10.1045/january2008-klijn [Google Scholar]
  22. Law, R. W.
    (2019) Transnational Nazism: Ideology and Culture in German-Japanese Relations, 1919–1936. Cambridge University Press. 10.1017/9781108565714
    https://doi.org/10.1017/9781108565714 [Google Scholar]
  23. Mautner, G.
    (2015) Checks and balances: how corpus linguistics can contribute to CDA. InM. Meyer & R. Wodak (Eds.), Methods of Critical Discourse Studies (pp.154–179). Sage.
    [Google Scholar]
  24. (2019) A research note on corpora and discourse: Points to ponder in research design. Journal of Corpora and Discourse Studies, 2, 2–13. 10.18573/jcads.32
    https://doi.org/10.18573/jcads.32 [Google Scholar]
  25. McEnery, T., & Brookes, G.
    (forthcoming). Register, belief and violence: A multi-dimensional approach. InPosch, C., Rampl, G., & Irschara, K. Eds. Wort – Satz – Korpus. Beiträge zur Korpuslinguistik [Word – Sentence – Corpus: Contributions to Corpus Linguistics]. iup.
    [Google Scholar]
  26. Mühlberger, G.
    (2009–2011) Functional Extension Parser [Computer software]. https://www.digitisation.eu/tools-resources/tools-for-text-digitisation/functional-extension-parser
    [Google Scholar]
  27. (2011) Digitalisierung historischer Zeitungen aus dem Blickwinkel der automatisierten Text- und Strukturerkennung (OCR) [Digitalisation of historical newspapers from the perspective of automated text and structure recognition (OCR)]. Zeitschrift für Bibliothekswesen und Bibliographie, 58(1), 10–18. doi:  10.3196/186429501158135
    https://doi.org/10.3196/186429501158135 [Google Scholar]
  28. Mühlberger, G., Zelger, J., & Sagmeister, D.
    (2014) User-driven correction of OCR errors. Combining crowdsourcing and information retrieval. InICPS. Digital Access to Textual Cultural Heritage. DATeCH 2014 conference proceedings (pp.53–56). Madrid, Spain, May 19 – 20, 2014. ACM. doi:  10.1145/2595188.2595212
    https://doi.org/10.1145/2595188.2595212 [Google Scholar]
  29. Pointal, L.
    (2004–2016) TreeTagger Python Wrapper: CNRS – LIMSI [Computer software]. treetaggerwrapper.readthedocs.io/en/latest/#about-treetaggerwrapper
    [Google Scholar]
  30. Posch, C., & Rampl, G.
    (2017) Alpenwort – Korpus der Zeitschrift des Deutschen und Österreichischen Alpenvereins (1869–1998) [Alpenwort – Corpus of the Almanac of the Austrian Alpine Club]. alpenwort.at
    [Google Scholar]
  31. (2018a) Alpenwort – Corpus of the Almanac of the Austrian Alpine Club (Version 1.0.0) [Data set]. doi: 10.5281/zenodo.1243678
    https://doi.org/10.5281/zenodo.1243678 [Google Scholar]
  32. (2018b) Alpenwort Hyperbase web edition (v. 1.0). hyperbase.unice.fr/hyperbase
    [Google Scholar]
  33. Posch, C., Rampl, G., & Cullen, R.
    (2019) New Zealand Alpine Journal Archive: New Zealand’s alpine heritage at your fingertips. https://www.nzaj-archive.nz
    [Google Scholar]
  34. Puzey, G., & Kostanski, L.
    (Eds.) (2016) Names and Naming: People, Places, Perceptions and Power. Multilingual Matters. 10.21832/9781783094929
    https://doi.org/10.21832/9781783094929 [Google Scholar]
  35. Rampl, G., Gruber, E., Posch, C., & Hiebel, G.
    (in press). Toponomastik und Korpuslinguistik: Bergnamen im (Kon-)Text [Toponomastics and Corpuslinguistics: Mountain names in (con)text.]. InK. Dräger, R. Heuser, & M. Prinz Eds. Proceedings of Toponyme – Eine Standortbestimmung [Current Tendencies in Toponyms]. De Gruyter.
    [Google Scholar]
  36. Rampl, G., & Posch, C.
    (2019) Alpenwort CQPweb Edition. sprawi-cqpweb.uibk.ac.at
    [Google Scholar]
  37. Rheindorf, M., & Wodak, R.
    (2019a) ‘Austria First’ revisited: A diachronic cross-sectional analysis of the gender and body politics of the extreme right. Patterns of Prejudice, 53(3), 302–320. 10.1080/0031322X.2019.1595392
    https://doi.org/10.1080/0031322X.2019.1595392 [Google Scholar]
  38. (2019b) Genre-related language change: Discourse- and corpus-linguistic perspectives on Austrian German 1970–2010. Folia Linguistica, 53(1), 125–167. 10.1515/flin‑2019‑2006
    https://doi.org/10.1515/flin-2019-2006 [Google Scholar]
  39. Rigaud, C., Doucet, A., Coustaty, M., & Moreux, J. P.
    (2019) Competition on post-OCR text correction. https://sites.google.com/view/icdar2019-postcorrectionocr
    [Google Scholar]
  40. Rose-Redwood, R., Alderman, D., & Azaryahu, M.
    (2010) Geographies of toponymic inscription: New directions in Critical Place-name Studies. Progress in Human Geography, 34(4), 453–470. 10.1177/0309132509351042
    https://doi.org/10.1177/0309132509351042 [Google Scholar]
  41. Schmid, H.
    (1994–1995) TreeTagger – A language independent part-of-speech tagger [Computer software]. Center for Information and Language Processing. www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html
    [Google Scholar]
  42. (1999) Improvements in Part-of-Speech Tagging with an application to German. InS. Armstrong, K. Church, P. Isabelle, S. Manzi, E. Tzoukermann, & D. Yarowsky (Eds.), Natural Language Processing Using Very Large Corpora (pp.13–25). Springer. doi:  10.1007/978‑94‑017‑2390‑9_2
    https://doi.org/10.1007/978-94-017-2390-9_2 [Google Scholar]
  43. Schulz, K., Ringlstetter, C., Vobl, T., Gotscharek, A., & Reffle, U.
    (2008) PoToCo [Computer software]. Centrum für Informations- und Sprachverarbeitung. ocr.cis.uni-muenchen.de
    [Google Scholar]
  44. Underwood, T., & Auvil, L.
    (n.d.). Basic OCR correction [Blog post]. https://usesofscale.com/gritty-details/basic-ocr-correction
    [Google Scholar]
  45. van Dalen-Oskam, K.
    (2016) Corpus-based approaches to names in literature. InC. Hough & D. Izdebska (Eds.), The Oxford Handbook of Names and Naming (pp.344–353). Oxford University Press.
    [Google Scholar]
  46. Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M., Furrer, L., & Ruef, B.
    (2010) Challenges in building a multilingual alpine heritage corpus. InSeventh International Conference on Language Resources and Evaluation (LREC) Malta, 19 May 2010 – 21 May 2010 (pp.1653–1659). www.zora.uzh.ch
    [Google Scholar]
  47. Volk, M., Furrer, L., & Sennrich, R.
    (2011) Strategies for reducing and correcting OCR Errors. InC. Sporleder, A. van den Bosch, & K. Zervanou (Eds.), Theory and Applications of Natural Language Processing: Language Technology for Cultural Heritage (pp.3–22). Springer. doi:  10.1007/978‑3‑642‑20227‑8_1
    https://doi.org/10.1007/978-3-642-20227-8_1 [Google Scholar]
  48. Wiegand, V.
    (2019) A Corpus Linguistic Approach to Meaning-making Patterns in Surveillance Discourse [Doctoral dissertation, University of Birmingham]. UBIRA E THESES. https://etheses.bham.ac.uk/id/eprint/9778/
    [Google Scholar]
  49. Wiegand, V., & Mahlberg, M.
    (Eds.) (2019) Corpus Linguistics, Context and Culture. De Gruyter.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.1075/ijcl.19094.pos
Loading
/content/journals/10.1075/ijcl.19094.pos
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): alpinism , document structure recognition , German Fraktur typeface , OCR and specialized corpora
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error