1887
Volume 27, Issue 2
  • ISSN 0929-9971
  • E-ISSN: 1569-9994

Abstract

Abstract

Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.

Available under the CC BY-NC 4.0 license.
Loading

Article metrics loading...

/content/journals/10.1075/term.20017.rig
2021-08-20
2021-12-03
Loading full text...

Full text loading...

/deliver/fulltext/term.20017.rig.html?itemId=/content/journals/10.1075/term.20017.rig&mimeType=html&fmt=ahah

References

  1. Amjadian, Ehsan, Diana Zaiu Inkpen, T. Sima Paribakht, and Farahnaz Faez
    2018 “Distributed Specificity for Automatic Terminology Extraction.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication24 (1): 23–40. doi:  10.1075/term.00012.amj
    https://doi.org/10.1075/term.00012.amj [Google Scholar]
  2. Astrakhantsev, Nikita, D. Fedorenko, and D. Yu. Turdakov
    2015 “Methods for Automatic Term Recognition in Domain-Specific Text Collections: A Survey.” Programming and Computer Software41 (6): 336–49. doi:  10.1134/S036176881506002X
    https://doi.org/10.1134/S036176881506002X [Google Scholar]
  3. Azé, Jérôme, Mathieu Roche, Yves Kodratoff, and Michèle Sebag
    2005 “Preference Learning in Terminology Extraction: A ROC-Based Approach.” InProceeedings of Applied Stochastic Models and Data Analysis, 209–2019. Brest, France. arxiv.org/abs/cs/0512050
    [Google Scholar]
  4. Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou
    2009 “An Improved Automatic Term Recognition Method for Spanish.” InComputational Linguistics and Intelligent Text Processing, edited byAlexander Gelbukh, 125–36. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:  10.1007/978‑3‑642‑00382‑0_10
    https://doi.org/10.1007/978-3-642-00382-0_10 [Google Scholar]
  5. Bolshakova, Elena, Natalia Loukachevitch, and Michael Nokel
    2013 “Topic Models Can Improve Domain Term Extraction.” InAdvances in Information Retrieval, edited byPavel Serdyukov, Pavel Braslavski, Sergei O. Kuznetsov, Jaap Kamps, Stefan Rüger, Eugene Agichtein, Ilya Segalovich, and Emine Yilmaz, 7814:684–87. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:  10.1007/978‑3‑642‑36973‑5_60
    https://doi.org/10.1007/978-3-642-36973-5_60 [Google Scholar]
  6. Bordea, Georgeta, Paul Buitelaar, and Tamara Polajnar
    2013 “Domain-Independent Term Extraction Through Domain Modelling.” InProceedings of the 10th International Conference for Terminology and Artificial Intelligence (TIA), 61–68. Paris, France.
    [Google Scholar]
  7. Conrado, Merley da Silva, Thiago A. Salgueiro Pardo, and Solange Oliveira Rezende
    2013 “A Machine Learning Approach to Automatic Term Extraction Using a Rich Feature Set.” InProceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, 16–23. Atlanta, GA, USA: Association for Computational Linguistics.
    [Google Scholar]
  8. Davies, Mark
    2017 “The New 4.3 Billion Word NOW Corpus, with 4--5 Million Words of Data Added Every Day.” InProceedings of the 9th International Corpus Linguistics Conference. Birmingham. Birmingham, UK. https://www.english-corpora.org/now
    [Google Scholar]
  9. Drouin, Patrick
    2003 “Term Extraction Using Non-Technical Corpora as a Point of Leverage.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication9 (1): 99–115. 10.1075/term.9.1.06dro
    https://doi.org/10.1075/term.9.1.06dro [Google Scholar]
  10. Drouin, Patrick, Marie-Claude L’Homme, and Benoıt Robichaud
    2018 “Lexical Profiling of Environmental Corpora.” InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 3419–25. Miyazaki, Japan: European Language Resources Association.
    [Google Scholar]
  11. Durán-Muñoz, Isabel
    2019 “Methodological Proposal to Build a Corpus-Based Ontology in Terminology.” Lingue e Linguaggi. doi:  10.1285/i22390359v29p581
    https://doi.org/10.1285/i22390359v29p581 [Google Scholar]
  12. Fedorenko, Denis, Nikita Astrakhantsev, and Denis Turdakov
    2013 “Automatic Recognition of Domain-Specific Terms: An Experimental Evaluation.” InProceedings of the Ninth Spring Researcher’s Colloquium on Database and Information Systems, 26:15–23. Kazan, Russia.
    [Google Scholar]
  13. Foo, Jody
    2009 “Term Extraction Using Machine Learning.” Linköping University, LINKÖPING, 1–8.
    [Google Scholar]
  14. Foo, Jody, and Magnus Merkel
    2010 “Using Machine Learning to Perform Automatic Term Recognition.” InProceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, 49–54. Valetta, Malta: European Language Resources Association.
    [Google Scholar]
  15. Gao, Yuze, and Yu Yuan
    2019 “Feature-Less End-to-End Nested Term Extraction.” ArXiv:1908.05426 [Cs, Stat], August. arxiv.org/abs/1908.05426. 10.1007/978‑3‑030‑32236‑6_55
    https://doi.org/10.1007/978-3-030-32236-6_55 [Google Scholar]
  16. Graff, David, Ângelo Mendonça, and Denise DiPersio
    2011 “French Gigaword Third Edition LDC2011T10.” Philadelphia, USA: Linguistic Data Consortium.
    [Google Scholar]
  17. Hätty, Anna, and Sabine Schulte im Walde
    2018 “Fine-Grained Termhood Prediction for German Compound Terms Using Neural Networks.” InProceedings of the Joint Workshop on,Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 62–73. Sante Fe, New Mexico, USA: Association for Computational Linguistics.
    [Google Scholar]
  18. Hätty, Anna, Simon Tannert, and Ulrich Heid
    2017 “Creating a Gold Standard Corpus for Terminological Annotation from Online Forum Data.” InProceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017). Montpellier, France: Association for Computational Linguistics.
    [Google Scholar]
  19. Hazem, Amir, Mérieme Bouhandi, Florian Boudin, and Béatrice Daille
    2020 “TermEval 2020: TALN-LS2N System for Automatic Term Extraction.” InProceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 95–100. Marseille, France: European Language Resources Association.
    [Google Scholar]
  20. Judea, Alex, Hinrich Schütze, and Sören Brügmann
    2014 “Unsupervised Training Set Generation for Automatic Acquisition of Technical Terminology in Patents.” InProceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 290–300. Dublin, Ireland: Dublin City University and Association for Computational Linguistics.
    [Google Scholar]
  21. Kageura, Kyo, and Elizabeth Marshman
    2019 “Terminology Extraction and Management.” InThe Routledge Handbook of Translation and Technology, edited byO’Hagan, Minako. 10.4324/9781315311258‑4
    https://doi.org/10.4324/9781315311258-4 [Google Scholar]
  22. Kageura, Kyo, and Bin Umino
    1996 “Methods of Automatic Term Recognition.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication3 (2): 259–89. 10.1075/term.3.2.03kag
    https://doi.org/10.1075/term.3.2.03kag [Google Scholar]
  23. Karan, Mladen, Jan Snajder, and Dalbelo Basic, Bojana
    2012 “Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian.” InProceedings of Eighth International Conference on Language Resources and Evaluation (LREC 2012), 657–62. Istanbul, Turkey: European Language Resources Association.
    [Google Scholar]
  24. Kauter, Marian van de, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste
    2013 “LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit.” Computational Linguistics in the Netherlands Journal3: 103–20.
    [Google Scholar]
  25. Kessler, Rémy, Nicolas Béchet, and Giuseppe Berio
    2019 “Extraction of Terminology in the Field of Construction.” InProceedings of the First International Conference on Digital Data Processing (DDP), 22–26. London, UK: IEEE Computer Society. doi:  10.1109/DDP.2019.00015
    https://doi.org/10.1109/DDP.2019.00015 [Google Scholar]
  26. Kosa, Victoria, David Chaves-Fraga, Hennadii Dobrovolskyi, and Vadim Ermolayev
    2020 “Optimized Term Extraction Method Based on Computing Merged Partial C-Values.” InInformation and Communication Technologies in Education, Research, and Industrial Applications. ICTERI 2019, 1175:24–49. Communications in Computer and Information Science. Cham: Springer International Publishing. doi:  10.1007/978‑3‑030‑39459‑2_2
    https://doi.org/10.1007/978-3-030-39459-2_2 [Google Scholar]
  27. Koutropoulou, Theoni, and Efstratios Efstratios
    2019 “TMG-BoBI: Generating Back-of-the-Book Indexes with the Text-to-Matrix-Generator.” InProceedings of the 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, 1–8. Patras, Greece. doi:  10.1109/IISA.2019.8900745
    https://doi.org/10.1109/IISA.2019.8900745 [Google Scholar]
  28. Kozakov, L., Y. Park, T. Fin, Y. Drissi, Y. Doganata, and T. Cofino
    2004 “Glossary Extraction and Utilization in the Information Search and Delivery System for IBM Technical Support.” IBM Systems Journal43 (3): 546–63. doi:  10.1147/sj.433.0546
    https://doi.org/10.1147/sj.433.0546 [Google Scholar]
  29. Kucza, Maren, Jan Niehues, Thomas Zenkel, Alex Waibel, and Sebastian Stüker
    2018 “Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks.” InProceedings of Interspeech 2018, the 19th Annual Conference of the International Speech Communication Association, 2072–76. Hyderabad, India: International Speech Communication Association. doi:  10.21437/Interspeech.2018‑2017
    https://doi.org/10.21437/Interspeech.2018-2017 [Google Scholar]
  30. Ljubešić, Nikola, Tomaž Erjavec, and Darja Fišer
    2018 “KAS-Term and KAS-Biterm: Datasets and Baselines for Monolingual and Bilingual Terminology Extraction from Academic Writing.” Digital Humanities, 7.
    [Google Scholar]
  31. Ljubešić, Nikola, Darja Fišer, and Tomaž Erjavec
    2019 “KAS-Term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning.” InText, Speech, and Dialogue. TSD 2019. Vol.11697. Lecture Notes in Computer Science. Springer. arxiv.org/abs/1906.02053. 10.1007/978‑3‑030‑27947‑9_10
    https://doi.org/10.1007/978-3-030-27947-9_10 [Google Scholar]
  32. Loukachevitch, Natalia
    2012 “Automatic Term Recognition Needs Multiple Evidence.” InProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), 2401–7. Istanbul, Turkey: European Language Resources Association.
    [Google Scholar]
  33. Loukachevitch, Natalia, and Michael Nokel
    2013 “An Experimental Study of Term Extraction for Real Information-Retrieval Thesauri.” InProceedings 10th International Conference on Terminology and Artificial Intelligence TIA 2013, 69–76. Paris, France.
    [Google Scholar]
  34. Macken, Lieve, Els Lefever, and Véronique Hoste
    2013 “TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-Based Alignment.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication19 (1): 1–30. 10.1075/term.19.1.01mac
    https://doi.org/10.1075/term.19.1.01mac [Google Scholar]
  35. Mayorov, V., I. Andrianov, Nikita Astrakhantsev, Avanesov, V., Kozlov, I., and Turdakov, D.
    2015 “A High Precision Method for Aspect Extraction in Russian.” InComputational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue.”Vol.2. Moscow, Russia.
    [Google Scholar]
  36. McCrae, John P., and Adrian Doyle
    2019 “Adapting Term Recognition to an Under-Resourced Language: The Case of Irish.” InProceedings of the Celtic Language Technology Workshop, 48–57. Dublin, Ireland.
    [Google Scholar]
  37. Meijer, Kevin, Flavius Frasincar, and Frederik Hogenboom
    2014 “A Semantic Approach for Extracting Domain Taxonomies from Text.” Decision Support Systems62 (June): 78–93. doi:  10.1016/j.dss.2014.03.006
    https://doi.org/10.1016/j.dss.2014.03.006 [Google Scholar]
  38. Meyers, Adam L., Yifan He, Zachary Glass, John Ortega, Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and Olga Babko-Malaya
    2018 “The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores.” Frontiers in Research Metrics and Analytics3 (June). doi:  10.3389/frma.2018.00019
    https://doi.org/10.3389/frma.2018.00019 [Google Scholar]
  39. Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman
    2013 “The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch.” InEssential Speech and Language Technology for Dutch, edited byPeter Spyns and Jan Odijk, 219–47. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:  10.1007/978‑3‑642‑30910‑6_13
    https://doi.org/10.1007/978-3-642-30910-6_13 [Google Scholar]
  40. Patry, Alexandre, and Philippe Langlais
    2005 “Corpus-Based Terminology Extraction.” InTerminology and Content Development – Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, 313–21. Copenhagen, Denmark.
    [Google Scholar]
  41. Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
    2011 “Scikit-Learn: Machine Learning in Python.” Machine Learning in Python, no.12: 2825–30.
    [Google Scholar]
  42. Peñas, Anselmo, Felisa Verdejo, and Julio Gonzalo
    2001 “Corpus-Based Terminology Extraction Applied to Information Access.” InProceedings of Corpus Linguistics, 9. Lancaster, UK.
    [Google Scholar]
  43. Petrov, Slav, Dipanjan Das, and Ryan McDonald
    2012 “A Universal Part-of-Speech Tagset.” InProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2089–96. Istanbul, Turkey: European Language Resources Association.
    [Google Scholar]
  44. Pollak, Senja, Andraž Repar, Matej Martinc, and Vid Podpečan
    2019 “Karst Exploration: Extracting Terms and Definitions from Karst Domain Corpus.” InProceedings of ELex 2019, 934–56. Sintra, Portugal.
    [Google Scholar]
  45. Qasemizadeh, Behrang, and Siegfried Handschuh
    2014 “Evaluation of Technology Term Recognition with Random Indexing.” InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), 4027–32. Reykjavik, Iceland: European Language Resources Association.
    [Google Scholar]
  46. Ramisch, Carlos, Aline Villavicencio, and Christian Boitet
    2010 “Mwetoolkit: A Framework for Multiword Expression Identification.” InProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), 662–69. Valetta, Malta: European Language Resources Association.
    [Google Scholar]
  47. Rigouts Terryn, Ayla, Patrick Drouin, Véronique Hoste, and Els Lefever
    2019 “Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat.” InProceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1012–21. Varna, Bulgaria. doi:  10.26615/978‑954‑452‑056‑4_117
    https://doi.org/10.26615/978-954-452-056-4_117 [Google Scholar]
  48. Rigouts Terryn, Ayla, Véronique Hoste, Patrick Drouin, and Els Lefever
    2020 “TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset.” InProceedings of the LREC 2020 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94. Marseille, France: European Language Resources Association.
    [Google Scholar]
  49. Rigouts Terryn, Ayla, Véronique Hoste, and Els Lefever
    2018 “A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents.” InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 1803–8. Miyazaki, Japan: European Language Resources Association.
    [Google Scholar]
  50. 2020 “In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora.” Language Resources and Evaluation54 (2): 385–418. doi:  10.1007/s10579‑019‑09453‑9
    https://doi.org/10.1007/s10579-019-09453-9 [Google Scholar]
  51. Šajatović, Antonio, Maja Buljan, Jan Šnajder, and Bojana Dalbelo Bašić
    2019 “Evaluating Automatic Term Extraction Methods on Individual Documents.” InProceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 149–54. Florence, Italy: Association for Computational Linguistics. 10.18653/v1/W19‑5118
    https://doi.org/10.18653/v1/W19-5118 [Google Scholar]
  52. Shah, Sapan, S. Sarath, and Reddy Shreedhar
    2019 “Similarity Driven Unsupervised Learning for Materials Science Terminology Extraction.” Computación y Sistemas23 (3): 1005–13. 10.13053/cys‑23‑3‑3266
    https://doi.org/10.13053/cys-23-3-3266 [Google Scholar]
  53. Ville-Ometz, Fabienne, Jean Royauté, and Alain Zasadzinski
    2007 “Enhancing in Automatic Recognition and Extraction of Term Variants with Linguistic Features.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication13 (1): 35–59. 10.1075/term.13.1.03vil
    https://doi.org/10.1075/term.13.1.03vil [Google Scholar]
  54. Vintar, Spela
    2010 “Bilingual Term Recognition Revisited.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication16 (2): 141–58. 10.1075/term.16.2.01vin
    https://doi.org/10.1075/term.16.2.01vin [Google Scholar]
  55. Vivaldi, Jorge, Luís Màrquez, and Horacio Rodríguez
    2001 “Improving Term Extraction by System Combination Using Boosting.” InProceedings of the 12th European Conference on Machine Learning (ECML 2001), edited byLuc Raedt and Peter Flach, 2167:515–26. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:  10.1007/3‑540‑44795‑4_44
    https://doi.org/10.1007/3-540-44795-4_44 [Google Scholar]
  56. Vivaldi, Jorge, and Horacio Rodríguez
    2001 “Improving Term Extraction by Combining Different Techniques.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication7 (1): 31–48. doi:  10.1075/term.7.1.04viv
    https://doi.org/10.1075/term.7.1.04viv [Google Scholar]
  57. Wang, Rui, Wei Liu, and Chris McDonald
    2016 “Featureless Domain-Specific Term Extraction with Minimal Labelled Data.” InProceedings of Australasian Language Technology Association Workshop, 103–12. Melbourne, Australia.
    [Google Scholar]
  58. Wolf, Petra, Ulrike Bernardini, Christian Federmann, and Hunsicker Sabine
    2011 “From Statistical Term Extraction to Hybrid Machine Translation.” InProceedings of the 15th Conference of the European Association for Machine Translation, edited byMikel L. Forcada, Heidi Depraetere, and Vincent Vandeghinste, 225–32. Leuven, Belgium: European Association for Machine Translation.
    [Google Scholar]
  59. Wolpert, David H.
    1996 “The Lack of a Priori Distinctions between Learning Algorithms.” Neural Computation8 (7): 1341–90. doi:  10.1162/neco.1996.8.7.1341
    https://doi.org/10.1162/neco.1996.8.7.1341 [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.1075/term.20017.rig
Loading
/content/journals/10.1075/term.20017.rig
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): automatic term extraction; comparable corpora; named entities; terminology
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error