1887
Volume 25, Issue 1
  • ISSN 0929-9971
  • E-ISSN: 1569-9994

Abstract

Abstract

This paper describes TermEnsembler, a bilingual term extraction and alignment system utilizing a novel ensemble learning approach to bilingual term alignment. In the proposed system, the processing starts with monolingual term extraction from a language industry standard file type containing aligned English and Slovenian texts. The two separate term lists are then automatically aligned using an ensemble of seven bilingual alignment methods, which are first executed separately and then merged using the weights learned with an evolutionary algorithm. In the experiments, the weights were learned on one domain and tested on two other domains. When evaluated on the top 400 aligned term pairs, the precision of term alignment is over 96%, while the number of correctly aligned multi-word unit terms exceeds 30% when evaluated on the top 400 term pairs.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 license.
Loading

Article metrics loading...

/content/journals/10.1075/term.00029.rep
2019-07-24
2024-10-12
Loading full text...

Full text loading...

/deliver/fulltext/term.00029.rep.html?itemId=/content/journals/10.1075/term.00029.rep&mimeType=html&fmt=ahah

References

  1. Ahmad, Khurshid, Lee Gillam, and Lena Tostevin
    2000 “Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER).” InProceedings of the 8th Text Retrieval Conference (TREC-8), 717–724. Washington, USA.
    [Google Scholar]
  2. Aker, Ahmet, Monica Paramita, and Rob Gaizauskas
    2013 “Extracting Bilingual Terminologies from Comparable Corpora.” InProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 402–411. Sofia, Bulgaria.
    [Google Scholar]
  3. Amjadian, Ehsan, Diana Inkpen, Tahereh Paribakht, and Farahnaz Faez
    2016 “Local-Global Vectors to Improve Unigram Terminology Extraction.” InProceedings of the 5th International Workshop on Computational Terminology, 2–11. Osaka, Japan.
    [Google Scholar]
  4. Baisa, Vít, Barbora Ulipová, and Michal Cukr
    2015 “Bilingual Terminology Extraction in Sketch Engine.” In9th Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015 – Proceedings, 61–67. Karlova Studánka, Czech Republic.
    [Google Scholar]
  5. Bird, Steven, Ewan Klein, and Edward Loper
    2009Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol: O’Reilly Media Inc.
    [Google Scholar]
  6. Church, Kenneth Ward, and Patrick Hanks
    1990 “Word Association Norms, Mutual Information, and Lexicography.” Computational Linguistics16 (1): 22–29.
    [Google Scholar]
  7. Cohen, Jacob
    1968 “Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit.” Psychological Bulletin70 (4): 213. 10.1037/h0026256
    https://doi.org/10.1037/h0026256 [Google Scholar]
  8. Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou
    2018 “Word Translation Without Parallel Data.” (https://arxiv.org/abs/1710.04087) Accessed2 February 2019.
  9. Daille, Béatrice, and Emmanuel Morin
    2005 “French-English Terminology Extraction from Comparable Corpora.” InProceedings of the 2nd International Joint Conference on Natural Language Processing, 707–718. Jeju Island, South Korea.
    [Google Scholar]
  10. Daille, Béatrice, Éric Gaussier, and Jean-Marc Langé
    1994 “Towards Automatic Extraction of Monolingual and Bilingual Terminology.” InProceedings of the 15th Conference on Computational linguistics, 515–521. Kyoto, Japan. 10.3115/991886.991975
    https://doi.org/10.3115/991886.991975 [Google Scholar]
  11. Dice, LR.
    1945 “Measures of the Amount of Ecologic Association between Species.” Ecology26 (3): 297–302. 10.2307/1932409
    https://doi.org/10.2307/1932409 [Google Scholar]
  12. Foo, Jody
    2012Computational Terminology: Exploring Bilingual and Monolingual Term Extraction. Linköping: Linköping University Electronic Press.
    [Google Scholar]
  13. Fortin, Félix-Antoine, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, and Christian Gagné
    2012 “DEAP: Evolutionary Algorithms Made Easy.” Journal of Machine Learning Research13 (no.Jul): 2171–2175.
    [Google Scholar]
  14. Frantzi, Katerina, Sophia Ananiadou, and Hideki Mirna
    2000 “Automatic Recognition of Multi-Word Terms:. the C-Value/NC-Value Method.” International Journal on Digital Libraries3(2): 115–130. 10.1007/s007999900023
    https://doi.org/10.1007/s007999900023 [Google Scholar]
  15. Gouadec, Daniel
    2007Translation as a Profession. Amsterdam/Philadephia: John Benjamins. 10.1075/btl.73
    https://doi.org/10.1075/btl.73 [Google Scholar]
  16. Haque, Rejwanul, Sergio Penkale, and Andy Way
    2014 “Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation.” InProceedings of the 4th International Workshop on Computational Terminology (Computerm), 42–51. Dublin, Ireland. 10.3115/v1/W14‑4806
    https://doi.org/10.3115/v1/W14-4806 [Google Scholar]
  17. Hazem, Amir, and Emmanuel Morin
    2017 “Bilingual Word Embeddings for Bilingual Terminology Extraction from Specialized Comparable Corpora.” InProceedings of the 8th International Joint Conference on Natural Language Processing, 685–693. Taipei, Taiwan.
    [Google Scholar]
  18. Hiemstra, Djoerd
    1998 “Multilingual Domain Modeling in Twenty-One: Automatic Creation of a Bi-Directional Translation Lexicon from a Parallel Corpus.” InProceedings of the 8th CLIN Meeting, 41–58. Amsterdam, The Netherlands.
    [Google Scholar]
  19. Justeson, John, and Slava Katz
    1995 “Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text.” Natural Language Engineering1 (1): 9–27. 10.1017/S1351324900000048
    https://doi.org/10.1017/S1351324900000048 [Google Scholar]
  20. Kageura, Kyo, and Bin Umino
    1996 “Methods of Automatic Term Recognition: A Review.” Terminology3 (2): 259–289. 10.1075/term.3.2.03kag
    https://doi.org/10.1075/term.3.2.03kag [Google Scholar]
  21. Khan, Muhammad Tahir, Yukun Ma, and Jung-jae Kim
    2016 “Term Ranker: A Graph-Based Re-Ranking Approach.” InProceedings of the 29th International Florida Artificial Intelligence Research Society Conference, 310–315. Key Largo, USA.
    [Google Scholar]
  22. Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan
    2007 “Moses: Open Source Toolkit for Statistical Machine Translation.” InProceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 177–180. Prague, Czech Republic. 10.3115/1557769.1557821
    https://doi.org/10.3115/1557769.1557821 [Google Scholar]
  23. Kupiec, Julian
    1993 “An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora.” InProceedings of the 31st Annual Meeting on Association for Computational Linguistics, 17–22. Columbus, USA. 10.3115/981574.981577
    https://doi.org/10.3115/981574.981577 [Google Scholar]
  24. Landis, Richard, and Gary Koch
    1977 “The Measurement of Observer Agreement for Categorical Data.” Biometrics33 (1): 159–174. 10.2307/2529310
    https://doi.org/10.2307/2529310 [Google Scholar]
  25. Ljubešić, Nikola, and Tomaž Erjavec
    2016 “Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene.” InProceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 23–28. Portorož, Slovenia.
    [Google Scholar]
  26. Logar, Nataša, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, and Simon Krek
    2012Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba [Slovenian language corpora Gigafida, KRES, ccGigafida, ccKRES: creation, content, use]. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.
    [Google Scholar]
  27. Macken, Lieve, Els Lefever, and Veronique Hoste
    2013 “Texsis: Bilingual Terminology Extraction from Parallel Corpora using Chunk-Based Alignment.” Terminology19 (1): 1–30. 10.1075/term.19.1.01mac
    https://doi.org/10.1075/term.19.1.01mac [Google Scholar]
  28. McEnery, Tony, Richard Xiao, and Yukio Tono
    2006Corpus-Based Language Studies: An Advanced Resource Book. London: Taylor & Francis.
    [Google Scholar]
  29. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean
    2013 “Efficient Estimation of Word Representations in Vector Space.” (https://arxiv.org/abs/1301.3781) Accessed10 July 2018.
  30. Neubig, Graham, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara
    2011 “An Unsupervised Model for Joint Phrase Alignment and Extraction.” InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 632–641. Portland, USA.
    [Google Scholar]
  31. Och, Franz Josef, and Hermann Ney
    2003 “A Systematic Comparison of Various Statistical Alignment Models.” Computational Linguistics29 (1): 19–51. 10.1162/089120103321337421
    https://doi.org/10.1162/089120103321337421 [Google Scholar]
  32. Pollak, Senja, Anže Vavpetič, Janez Kranjc, Nada Lavrač, and Špela Vintar
    2012 “NLP Workflow for On-Line Definition Extraction from English and Slovene Text Corpora.” InProceedings of KONVENS 2012, 53–60. Vienna, Austria.
    [Google Scholar]
  33. Repar, Andraž, and Senja Pollak
    2017a “Good Examples for Terminology Databases in Translation.” InElectronic Lexicography in the 21st century. Proceedings of eLex 2017 Conference, 651–661. Leiden, Netherlands.
    [Google Scholar]
  34. 2017b “Ontology-Based Translation Memory Maintenance.” InProceedings of the 20th International Multiconference Information Society 2017, 19–22. Ljubljana, Slovenia.
    [Google Scholar]
  35. Schmitz, Klaus Dirk, and Daniela Straub
    2016 “Tight Budgets and a Growing Number of Languages Impede Terminology Work.” tcworld magazine for international information management (www.tcworld.info/e-magazine/technical-communication/article/tight-budgets-and-a-growing-number-of-languages-impede-terminology-work/). Accessed24 August 2018.
    [Google Scholar]
  36. The British National Corpus, version 3 (BNC XML Edition)
    The British National Corpus, version 3 (BNC XML Edition) 2007Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium. (URL: www.natcorp.ox.ac.uk/). Accessed10 March 2017.
    [Google Scholar]
  37. Vintar, Špela
    2010 “Bilingual Term Recognition Revisited. The Bag-of-Equivalents Term Alignment Approach.” Terminology16 (2): 141–158. 10.1075/term.16.2.01vin
    https://doi.org/10.1075/term.16.2.01vin [Google Scholar]
  38. Wang, Rui, Wei Liu, and Chris McDonald
    2016 “Featureless Domain-Specific Term Extraction with Minimal Labelled Data.” InProceedings of the Australasian Language Technology Association Workshop, 103–112. Melbourne, Australia.
    [Google Scholar]
  39. Wermter, Joachim, and Udo Hahn
    2005 “Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms.” InProceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 843–850. Vancouver, Canada.
    [Google Scholar]
  40. Wüster, Eugene
    1979Introduction to the General Theory of Terminology and Terminological Lexicography. Vienna: Springer.
    [Google Scholar]
  41. Zhang, Zigi, Jie Gao, and Fabio Ciravegna
    2018 “SemRe-Rank: Incorporating Semantic Relatedness to Improve Automatic Term Extraction Using Personalized PageRank.” (https://arxiv.org/abs/1711.03373) Accessed7 January 2019. 10.1145/3201408
    https://doi.org/10.1145/3201408
/content/journals/10.1075/term.00029.rep
Loading
/content/journals/10.1075/term.00029.rep
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error