1887
Volume 22, Issue 1
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
USD
Buy:$35.00 + Taxes

Abstract

Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them.

Loading

Article metrics loading...

/content/journals/10.1075/ijcl.22.1.05kol
2017-07-21
2019-10-16
Loading full text...

Full text loading...

References

  1. Baron, A. , & Rayson, P.
    (2008, August). VARD2: A tool for dealing with spelling variation in historical corpora. Paper presented atPostgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK.
    [Google Scholar]
  2. Barteld, F. , Schröder, I. , & Zinsmeister, H.
    (2015) Unsupervised regularisation of historical texts for POS tagging. In F. Mambrini , M. Passarotti & C. Sporleder (Eds.), Proceedings of the Workshop on Corpus-Based Research in the Humanities (CRH) (pp.3–12). Polish Academy of Sciences: Institute of Computer Science.
    [Google Scholar]
  3. Bennett, P. , Durrell, M. , Scheible, S. , & Whitt, R. J.
    (2010) Annotating a historical corpus of German: A case study. InProceedings of the LREC 2010 workshop on Language Resources and Language Technology Standards (pp.64–68). European Language Resources Association.
    [Google Scholar]
  4. Biber, D. , Conrad, S. , & Reppen, R.
    (1998) Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511804489
    https://doi.org/10.1017/CBO9780511804489 [Google Scholar]
  5. Biebersteadt, A.
    (2015) Variablenlinguistische Beobachtungen zu den mittelniederdeutschen Schreibsprachen des südlichen Ostseeraumes: Wismar und Stralsund als Beispiele. In H. U. Schmid & A. Ziegler (Eds.), 2015: Jahrbuch für Germanistische Sprachgeschichte. Bd. 6: Deutsch im Norden (pp.88–115). Berlin/New York: De Gruyter.
    [Google Scholar]
  6. Bollmann, M. , Petran, F. , Dipper, S. , & Krasselt, J.
    (2014) CorA: A web-based annotation tool for historical and other non-standard language data. InProceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) (pp.86–90). doi: 10.3115/v1/W14‑0612
    https://doi.org/10.3115/v1/W14-0612 [Google Scholar]
  7. Braunmüller, K.
    (1996) Forms of language contact in the area of the Hanseatic League: Dialect contact phenomena and semicommunication. Nordic Journal of Linguistics, 19(2), 141–154. doi: 10.1017/S033258650000336X
    https://doi.org/10.1017/S033258650000336X [Google Scholar]
  8. (2002) Language contact during the Old Nordic period I: With the British Isles, Frisia and the Hanseatic League. In O. Bandle , K. Braunmüller , E. H. Jahr , A. Karker , H.-P. Naumann & U. Teleman (Eds.), The Nordic Languages: An International Handbook of the History of the Nordic Germanic Languages, Volume1 (pp.1028–1039). Berlin/New York: De Gruyter.
    [Google Scholar]
  9. Breitbarth, A. , Walkden, G. , & Watts, S.
    (2011 April). A Corpus for Middle Low German. Paper presented atNew Methods in Historical Corpora, Manchester, UK.
    [Google Scholar]
  10. (2012 April). Building a corpus for Middle Low German: Notes and queries. Paper presented at theForum for Germanic Language Studies (FGLS10), Sheffield, UK.
    [Google Scholar]
  11. Britto, H. , Finger, M. , & Galves, C.
    (2002) Computational and linguistic aspects of the construction of The Tycho Brahe Parsed Corpus of Historical Portuguese. Romanistische Korpuslinguistik, Korpora und gesprochene Sprache, Romance Corpus Linguistics, Corpora and Spoken Language, ScriptOralia, 126.
    [Google Scholar]
  12. Daelemans, W. , Van den Bosch, A. , & Zavrel, J.
    (1999) Forgetting examples is harmful in language learning. Machine Learning, 34(1–3), 11–43. doi: 10.1023/A:1007585615670
    https://doi.org/10.1023/A:1007585615670 [Google Scholar]
  13. Daelemans, W. , & Van den Bosch, A.
    (2005) Memory-based Language Processing. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511486579
    https://doi.org/10.1017/CBO9780511486579 [Google Scholar]
  14. De Clercq, O.
    (2015) Tipping the scales: exploring the added value of deep semantic processing on readability prediction and sentiment analysis (Unpublished doctoral dissertation). Ghent University, Ghent, Belgium.
    [Google Scholar]
  15. Desmet, B. , Hoste, V. , Verstraeten, D. , & Verhasselt, J.
    (2013) Gallop Documentation, (LT3 Technical Report - LT3 13.03).
    [Google Scholar]
  16. Desmet, B.
    (2014) Finding the online cry for help: Automatic text classification for suicide prevention (Unpublished doctoral dissertation). Ghent University, Ghent, Belgium.
    [Google Scholar]
  17. Diel, M. , Fisseni, B. , Lenders, W. , & Schmitz, H.-C.
    (2002) XML-Kodierung des Bonner Frühneuhochdeutschkorpus. Bonn: IKP-Arbeitsbericht NF 02.
    [Google Scholar]
  18. Dipper, S.
    (2015) Annotierte Korpora für die Historische Syntaxforschung: Anwendungsbeispiele anhand des Referenzkorpus Mittelhochdeutsch. Zeitschrift für Germanistische Linguistik, 43(3), 516–563. doi: 10.1515/zgl‑2015‑0020
    https://doi.org/10.1515/zgl-2015-0020 [Google Scholar]
  19. Dipper, S. , Donhauser, K. , Klein, T. , Linde, S. , Müller, S. , & Wegera, K. P.
    (2013) HiTS: ein Tagset für historische Sprachstufen des Deutschen. Journal for Language Technology and Computational Linguistics, 28(1), 85–137.
    [Google Scholar]
  20. Fisseni, B. , Schmitz, H.-C. , & Schröder, B.
    (2007) FnhdC/HTML und FnhdC/S. Sprache und Datenverarbeitung, 1–2/2007, 67–69.
    [Google Scholar]
  21. Geyken, A. , Haaf, S. , Jurish, B. , Schulz, M. , Steinmann, J. , Thomas, C. , & Wiegand, F.
    (2011) Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. InDigitale Wissenschaft. Stand und Entwicklung digital vernetzter Forschung in Deutschland, 20/21, September 2010, Beiträge der Tagung, 2., ergänzte Fassung (pp.157–161).
    [Google Scholar]
  22. Kroch, A. , Taylor, A. , & Ringe, D.
    (2000) The Middle English verb-second constraint: A case study in language contact and language change. In S. Herring , P. van Reenen & L. Schøsler (Eds.), Textual Parameters in Older Languages (pp.353–392). Amsterdam/Philadelphia: Benjamins.
    [Google Scholar]
  23. Lafferty, J. , McCallum, A. , & Pereira, F.
    (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. InProceedings of the 18th International Conference on Machine Learning (pp.282–289). San Francisco, CA: Morgan Kaufmann.
    [Google Scholar]
  24. Linde, S. , & Mittmann, R.
    (2013) Old German reference corpus: Digitizing the knowledge of the 19th century. In P. Bennett , M. Durrell , S. Scheible , R. J. Whitt (Eds.), New Methods in Historical Corpora (pp.235–246). Tübingen: Narr Verlag.
    [Google Scholar]
  25. Marcus, M. P. , Santorini B. , & Marcinkiewicz, M. A.
    (1993) Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
    [Google Scholar]
  26. Martineau, F.
    (2005) Modéliser le changement: Les voies du français/Modelling change: The paths of French. Ottawa: University of Ottawa. Retrieved fromwww.voies.uottawa.ca/corpus_pg_en.html (last accessedMarch 2017).
    [Google Scholar]
  27. Moon, T. , & Baldridge, J.
    (2007) Part-of-speech tagging for Middle English through alignment and projection of parallel diachronic texts. InProceedings of EMNLP/CONLL-2007 (pp.390–399).
    [Google Scholar]
  28. Peters, R.
    (1973) Mittelniederdeutsche Sprache. In J. Goossens (Ed.), Niederdeutsch – Sprache und Literatur. Bd. 1: Sprache (pp.66–115). Neumünster: Wachholtz.
    [Google Scholar]
  29. (2003) Variation und Ausgleich in den mittelniederdeutschen Schreibsprachen. In M. Goyens & W. Verbeke (Eds.), The Dawn of the Written Vernacular in Western Europe (pp.427–440). Leuven: Leuven University Press.
    [Google Scholar]
  30. Peters, R. , & Fischer, C.
    (2007) Der ‘Atlas spätmittelalterlicher Schreibsprachen des niederdeutschen Altlandes und angrenzender Gebiete’. In L. Czajkowski , C. Hoffmann , H. U. Schmid (Eds.), Ostmitteldeutsche Schreibsprachen im Spätmittelalter (pp.23–33). Berlin: De Gruyter. doi: 10.1515/9783110958188.23
    https://doi.org/10.1515/9783110958188.23 [Google Scholar]
  31. Peters, R. , & Nagel, N.
    (2014) Das digitale ‘Referenzkorpus Mittelniederdeutsch/Niederrheinisch (ReN)’. Jahrbuch für Germanistische Sprachgeschichte, 5(1), 165–175. Berlin/Boston: de Gruyter.
    [Google Scholar]
  32. Pettersson, E. , Megyesi, B. , & Nivre, J.
    (2013) Normalisation of historical text using context-sensitive weighted Levenhstein distance and compound splitting. InProceedings of the 19th Nordic Conference on Computational Linguistics (NoDaLiDa 2013) (pp.163–179). Linköping: Linköping Electronic Conference Proceedings 85.
    [Google Scholar]
  33. (2014) A multilingual evaluation of three spelling normalization methods for historical text. InProceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities (LaTeCH 2014) (pp.32–41). Gothenburg: Association for Computational Linguistics. doi: 10.3115/v1/W14‑0605
    https://doi.org/10.3115/v1/W14-0605 [Google Scholar]
  34. Rayson, P. , Archer, D. , Baron, A. , Culpeper, J. , & Smith, N.
    (2007) Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. InProceedings of Corpus Linguistics 2007. Birmingham: University of Birmingham, UK.
    [Google Scholar]
  35. Rögnvaldsson, E. , & Helgadóttir, S.
    (2011) Morphosyntactic tagging of Old Icelandic texts and its use in studying syntactic variation and change. In C. Sporleder , A. van den Bosch , K. Zervanou (Eds.), Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series (pp.63–76). Berlin: Springer. doi: 10.1007/978‑3‑642‑20227‑8_4
    https://doi.org/10.1007/978-3-642-20227-8_4 [Google Scholar]
  36. Sanders, W.
    (1982) Sprachgeschichtliche Grundzüge des Niederdeutschen. Vandenhoeck + Ruprecht Gm.
    [Google Scholar]
  37. Scheible, S. , Whitt, R. J. , Durrell, M. , & Bennett, P.
    (2011a) A gold standard corpus of Early Modern German. InProceedings of the 5th Linguistic Annotation Workshop (LAW V 2011) (pp.124–128). Association for Computational Linguistics.
    [Google Scholar]
  38. (2011b) Evaluating an ‘off-the-shelf’ POS-tagger on early modern German text. InProceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011), pp.19–23. Portland, OR: Association for Computational Linguistics.
    [Google Scholar]
  39. Schiller, A. , Teufel, S. , & Thielen, C.
    (1995) Guidelines für das Tagging deutscher Textkorpora mit STTS. Technical report, Universities of Stuttgart and Tübingen, 66. Retrieved fromwww.sfs.uni-tuebingen.de/resources/stts-1999.pdf (last accessedMarch 2017).
    [Google Scholar]
  40. Schmid, H. , & Laws, F.
    (2008) Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) - Volume 1 (pp.777–784). Manchester: Association for Computational Linguistics. doi: 10.3115/1599081.1599179
    https://doi.org/10.3115/1599081.1599179 [Google Scholar]
  41. Schneider, G. , Lehman, H. M. , & Schneider, P.
    (2015) Parsing early and late modern English corpora. Literary and Linguistic Computing, 30(3), 423–439.
    [Google Scholar]
  42. Schröder, I.
    (2014) Neue Perspektiven für die mittelniederdeutsche Grammatikographie. Jahrbuch für germanistische Sprachgeschichte, 5(1), 150–164. doi: 10.1515/jbgsg‑2014‑0011
    https://doi.org/10.1515/jbgsg-2014-0011 [Google Scholar]
  43. Schulz, S. , De Pauw, G. De Clercq, O. , Desmet, B. , Hoste, V. , Daelemans, W. , & Macken, L.
    (2016) Multimodular Text Normalization of Dutch User-Generated Content. ACM Transactions on Intelligent Systems and Technology (TIST), 7(4), 1–22. doi: 10.1145/2850422
    https://doi.org/10.1145/2850422 [Google Scholar]
  44. Silfverberg, M. , Ruokolainen, B. , Lindén, K. , & Kurimo, M.
    (2014) Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp.259–264). Baltimore, MD.
    [Google Scholar]
  45. Sukhareva, M. , & Chiarcos, C.
    (2016) Combining ontologies and neural networks for analyzing historical language varieties: A case study in Middle Low German. In N. Calzolari , K. Choukri , T. Declerck , M. Grobelnik , B. Maegaard , J. Mariani , A. Moreno , J. Odijk & Stelios Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris: European Language Resources Association (ELRA). Retrieved fromwww.lrec-conf.org/proceedings/lrec2016/summaries/822.html (last accessedMarch 2017).
    [Google Scholar]
  46. Tophinke, D.
    (2009) Vom Vorlesetext zum Lesetext: Zur Syntax mittelniederdeutscher Rechtsverordnungen im Spätmittelalter. In A. Linke , & H. Feilke (Eds.), Oberfläche und Performanz. Untersuchungen zur Sprache als dynamische Gestalt (pp.161–186). Tübingen: Niemeyer. doi: 10.1515/9783484971240.2.161
    https://doi.org/10.1515/9783484971240.2.161 [Google Scholar]
  47. (2012) Syntaktischer Ausbau im Mittelniederdeutschen. Theoretisch-methodische Überlegungen und kursorische Analysen. Niederdeutsches Wort, 52, 19–46.
    [Google Scholar]
  48. Tophinke, D. , & Wallmeier, N.
    (2011) Textverdichtungsprozesse im Spämittelalter: Syntaktischer Wandel in mittelniederdeutschen Rechtstexten des 13.–16. Jahrhunderts. In S. Elspaß & M. Negele (Eds.) Sprachvariation und Sprachwandel in der Stadt der Frühen Neuzeit (pp.97–116). Heidelberg: Winter.
    [Google Scholar]
  49. Van de Kauter, M. , Coorman, G. , Lefever, E. , Desmet, B. , Macken, L. , & Hoste, V.
    (2013) LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit. Computational Linguistics in the Netherlands Journal, 3, 103–120.
    [Google Scholar]
  50. Walkden, G.
    (2016) The HeliPaD: A parsed corpus of Old Saxon. International Journal of Corpus Linguistics, 21(4), 559–571. doi: 10.1075/ijcl.21.4.05wal
    https://doi.org/10.1075/ijcl.21.4.05wal [Google Scholar]
  51. Wallenberg, J. C. , Ingason, A. K. , Sigurðsson, E. F. , & Rögnvaldsson, E.
    (2011) Icelandic parsed historical corpus (IcePaHC) (Version 0.9). Available atwww.linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_%28IcePaHC%29 (last accessedMarch 2017).
  52. Yang, Y. , & Eisenstein, J.
    (2016) Part-of-speech tagging for historical English. InProceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), San Diego. doi: 10.18653/v1/N16‑1157
    https://doi.org/10.18653/v1/N16-1157 [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.1075/ijcl.22.1.05kol
Loading
/content/journals/10.1075/ijcl.22.1.05kol
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error