Volume 36, Issue 1
  • ISSN 0929-7332
  • E-ISSN: 1569-9919



Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve “free” translations. In this paper we explore four possible filters: the Damerau-Levenshtein distance between POS-tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.


Article metrics loading...

Loading full text...

Full text loading...



  1. Abu-Aisheh, Zeina , Romain Raveaux , Jean-Yves Ramel & Patrick Martineau
    2015 “An exact graph edit distance algorithm for solving pattern recognition problems”. 4th International Conference on Pattern Recognition Applications and Methods 2015. Jan 2015, Lisbon, Portugal. ff10.5220/0005209202710278ff. ffhal-01168816. 10.5220/0005209202710278
    https://doi.org/10.5220/0005209202710278 [Google Scholar]
  2. Abzianidze, Lasha , Johannes Bjerva , Kilian Evang , Hessel Haagsma , Rik van Noord , Pierre Ludmann & Johan Bos
    2017 “The Parallel Meaning Bank: Towards a multilingual corpus of translations annotated with compositional meaning representations”. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 242–247.
    [Google Scholar]
  3. Barbiers, Sjef
    2009 “Locus and limits of syntactic microvariation”. Lingua119 (11): 1607–1623. 10.1016/j.lingua.2008.09.013
    https://doi.org/10.1016/j.lingua.2008.09.013 [Google Scholar]
  4. Bard, Gregory V.
    2007 “Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric”. Proceedings of the Fifth Australasian Symposium on ACSW Frontiers: Volume 68, 117–124. Australian Computer Society, Inc.
    [Google Scholar]
  5. Cohen, Jacob
    1960 “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement. 20 (1): 37–46. 10.1177/001316446002000104
    https://doi.org/10.1177/001316446002000104 [Google Scholar]
  6. Fleiss, J. L. & Jacob Cohen
    1973 “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability”. Educational and Psychological Measurement33: 613–619. 10.1177/001316447303300309
    https://doi.org/10.1177/001316447303300309 [Google Scholar]
  7. Hagberg, Aric , Daniel Schult & Pieter Swart
    2008 “Exploring network structure, dynamics, and function using Network”. Proceedings of the 7th Python in Science Conference (SciPy2008)ed. by G. Varoquaux , T. Vaught , & J. Millman , 11–15. Pasadena, CA USA.
    [Google Scholar]
  8. Klis, van der , Martijn, Bert Le Bruyn & Henriëtte de Swart
    2017 “Mapping the perfect via translation mining”. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 497–502.
    [Google Scholar]
  9. Koehn, Philipp
    2005 “Europarl: A parallel corpus for statistical machine translation”. MT Summit: Volume 5, 79–86.
    [Google Scholar]
  10. Levenshtein, Vladimir I.
    1966 “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady10 (8): 707–710.
    [Google Scholar]
  11. Nivre, Joakim , Marie-Catherine de Marneffe , Filip Ginter , Yoav Goldberg , Jan Hajic , Christopher D. Manning , Ryan Mc Donald
    “Universal dependencies v1: A multilingual treebank collection”. LREC 2016, pp.1659–1666.
    [Google Scholar]
  12. Straka, Milan & Jana Straková
    2017 “Tokenizing, POS-tagging, lemmatizing and parsing UD 2.0 with UDPipe”. Proceedings of the CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 88–99. Vancouver, Canada: Association of Computational Linguistics. 10.18653/v1/K17‑3009
    https://doi.org/10.18653/v1/K17-3009 [Google Scholar]
  13. Wiersma, Wybo , John Nerbonne & Timo Lauttamus
    2011 “Automatically extracting typical syntactic differences from corpora”. Literary and Linguistic Computing26 (1): 107–124. 10.1093/llc/fqq017
    https://doi.org/10.1093/llc/fqq017 [Google Scholar]
  14. Youden, William J.
    1950 “Index for rating diagnostic tests”. Cancer3 (1): 32–35. 10.1002/1097‑0142(1950)3:1<32::AID‑CNCR2820030106>3.0.CO;2‑3
    https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): dependency parses; filter; parallel corpus; syntactic comparability
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error