1887
Volume 28, Issue 1
  • ISSN 0929-9971
  • E-ISSN: 1569-9994
USD
Buy:$35.00 + Taxes

Abstract

Abstract

As with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted probability that they are valid terms. However, with the rise of neural networks and word embeddings, the next development in ATE might be towards sequential approaches, i.e., classifying each occurrence of each token within its original context. To test the validity of such approaches for ATE, two sequential methodologies were developed, evaluated, and compared: one feature-based conditional random fields classifier and one embedding-based recurrent neural network. An additional comparison was added with a machine learning interpretation of the traditional approach. All systems were trained and evaluated on identical data in multiple languages and domains to identify their respective strengths and weaknesses. The sequential methodologies were proven to be valid approaches to ATE, and the neural network even outperformed the more traditional approach. Interestingly, a combination of multiple approaches can outperform all of them separately, showing new ways to push the state-of-the-art in ATE.

Loading

Article metrics loading...

/content/journals/10.1075/term.21010.rig
2022-01-10
2022-05-23
Loading full text...

Full text loading...

References

  1. Agić, Željko , and Ivan Vulić
    2019 ‘JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages’. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3204–10. Florence, Italy: Association for Computational Linguistics. 10.18653/v1/P19‑1310
    https://doi.org/10.18653/v1/P19-1310 [Google Scholar]
  2. Akbik, Alan , Tanja Bergmann , Duncan Blythe , Kashif Rasul , Stefan Schweter , and Roland Vollgraf
    2019 ‘FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP’. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 54–59. Minneapolis, USA: Association for Computational Linguistics.
    [Google Scholar]
  3. Akbik, Alan , Duncan Blythe , and Roland Vollgraf
    2018 ‘Contextual String Embeddings for Sequence Labeling’. InProceedings of the 27th International Conference on Computational Linguistics, 1638–49. Sante Fe, New Mexico, USA: Association for Computational Linguistics.
    [Google Scholar]
  4. Alami Merrouni, Zakariae , Bouchra Frikh , and Brahim Ouhbi
    2020 ‘Automatic Keyphrase Extraction: A Survey and Trends’. Journal of Intelligent Information Systems54 (2): 391–424. 10.1007/s10844‑019‑00558‑9
    https://doi.org/10.1007/s10844-019-00558-9 [Google Scholar]
  5. Amjadian, Ehsan , Diana Inkpen , T. Sima Paribakht , and Farahnaz Faez
    2016 ‘Local-Global Vectors to Improve Unigram Terminology Extraction’. InProceedings of the 5th International Workshop on Computational Terminology, 2–11. Osaka, Japan.
    [Google Scholar]
  6. Amjadian, Ehsan , Diana Zaiu Inkpen , T. Sima Paribakht , and Farahnaz Faez
    2018 ‘Distributed Specificity for Automatic Terminology Extraction’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication24 (1): 23–40. 10.1075/term.00012.amj
    https://doi.org/10.1075/term.00012.amj [Google Scholar]
  7. Astrakhantsev, Nikita , D. Fedorenko , and D. Yu. Turdakov
    2015 ‘Methods for Automatic Term Recognition in Domain-Specific Text Collections: A Survey’. Programming and Computer Software41 (6): 336–49. 10.1134/S036176881506002X
    https://doi.org/10.1134/S036176881506002X [Google Scholar]
  8. Bay, Matthias , Daniel Bruneß , Miriam Herold , Christian Schulze , Michael Guckert , and Mirjam Minor
    2020 ‘Term Extraction from Medical Documents Using Word Embeddings’. InProceedings of the 4th IEEE Conference on Machine Learning and Natural Language Processing (MNLP). Agadir, Morocco: IEEE Computer Society. 10.1109/CiSt49399.2021.9357263
    https://doi.org/10.1109/CiSt49399.2021.9357263 [Google Scholar]
  9. Bojanowski, Piotr , Edouard Grave , Armand Joulin , and Tomas Mikolov
    2016 ‘Enriching Word Vectors with Subword Information’. ArXiv Preprint in ArXiv:1607.04606 [Cs]. arxiv.org/abs/1607.04606
    [Google Scholar]
  10. Bourigault, Didier
    1992 ‘Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases’. InProceedings of the 14th Conference on Computational Linguistics-Volume 3, 977–81. Nantes, France: Association for Computational Linguistics. 10.3115/992383.992415
    https://doi.org/10.3115/992383.992415 [Google Scholar]
  11. 1993 ‘An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation’. InProceedings of the Sixth Conference of the European Chapter of the Association for Computational Linguistics, 81–86. Utrecht, Netherlands: Association for Computational Linguistics. 10.3115/976744.976755
    https://doi.org/10.3115/976744.976755 [Google Scholar]
  12. Cram, Damien , and Beatrice Daille
    2016 ‘TermSuite: Terminology Extraction with Term Variant Detection’. InProceedings of The 54th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 13–18. Berlin, Germany: Association for Computational Linguistics. 10.18653/v1/P16‑4003
    https://doi.org/10.18653/v1/P16-4003 [Google Scholar]
  13. Crammer, Koby , Alex Kulesza , and Mark Dredze
    2009 ‘Adaptive Regularization of Weight Vectors’. Advances in Neural Information Processing Systems22: 414–22. 10.1007/s10994‑013‑5327‑x
    https://doi.org/10.1007/s10994-013-5327-x [Google Scholar]
  14. Davies, Mark
    2017 ‘The New 4.3 Billion Word NOW Corpus, with 4--5 Million Words of Data Added Every Day’. InProceedings of the 9th International Corpus Linguistics Conference. Birmingham. Birmingham, UK. https://www.english-corpora.org/now
    [Google Scholar]
  15. De Clercq, Orphée , Marjan Van de Kauter , Els Lefever , and Veronique Hoste
    2015 ‘LT3: Applying Hybrid Terminology Extraction to Aspect-Based Sentiment Analysis’. InProceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 719–24. Denver, Colorado: Association for Computational Linguistics. 10.18653/v1/S15‑2122
    https://doi.org/10.18653/v1/S15-2122 [Google Scholar]
  16. Devlin, Jacob , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova
    2019 ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. ArXiv:1810.04805 [Cs]. arxiv.org/abs/1810.04805
    [Google Scholar]
  17. Dobrov, Boris , and Natalia Loukachevitch
    2011 ‘Multiple Evidence for Term Extraction in Broad Domains’. InProceedings of the International Conference Recent Advances in Natural Language Processing 2011, 710–15. Hissar, Bulgaria: Association for Computational Linguistics.
    [Google Scholar]
  18. Drouin, Patrick
    2003 ‘Term Extraction Using Non-Technical Corpora as a Point of Leverage’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication9 (1): 99–115. 10.1075/term.9.1.06dro
    https://doi.org/10.1075/term.9.1.06dro [Google Scholar]
  19. Drouin, Patrick , Jean-Benoît Morel , and Marie-Claude L’ Homme
    2020 ‘Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features’. InProceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 1–7. Marseille, France: ELRA.
    [Google Scholar]
  20. Fedorenko, Denis , Nikita Astrakhantsev , and Denis Turdakov
    2013 ‘Automatic Recognition of Domain-Specific Terms: An Experimental Evaluation’. InProceedings of the Ninth Spring Researcher’s Colloquium on Database and Information Systems, 26:15–23. Kazan, Russia.
    [Google Scholar]
  21. Goyal, Archana , Vishal Gupta , and Manish Kumar
    2018 ‘Recent Named Entity Recognition and Classification Techniques: A Systematic Review’. Computer Science Review29 (August): 21–43. 10.1016/j.cosrev.2018.06.001
    https://doi.org/10.1016/j.cosrev.2018.06.001 [Google Scholar]
  22. Graff, David , Ângelo Mendonça , and Denise DiPersio
    2011 ‘French Gigaword Third Edition LDC2011T10’. Philadelphia, USA: Linguistic Data Consortium.
    [Google Scholar]
  23. Habibi, Maryam , Leon Weber , Mariana Neves , David Luis Wiegandt , and Ulf Leser
    2017 ‘Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition’. Bioinformatics33 (14): i37–48. 10.1093/bioinformatics/btx228
    https://doi.org/10.1093/bioinformatics/btx228 [Google Scholar]
  24. Hätty, Anna , Michael Dorna , and Sabine Schulte im Walde
    2017 ‘Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction’. InProceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 113–21. Valencia, Spain: Association for Computational Linguistics. 10.18653/v1/E17‑4012
    https://doi.org/10.18653/v1/E17-4012 [Google Scholar]
  25. Hätty, Anna , Dominik Schlechtweg , and Michael Dorna
    2020 ‘Predicting Degrees of Technicality in Automatic Terminology Extraction’. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 72883–89. olnine: Association for Computational Linguistics. 10.18653/v1/2020.acl‑main.258
    https://doi.org/10.18653/v1/2020.acl-main.258 [Google Scholar]
  26. Hazem, Amir , Mérieme Bouhandi , Florian Boudin , and Béatrice Daille
    2020 ‘TermEval 2020: TALN-LS2N System for Automatic Term Extraction’. InProceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 95–100. Marseille, France: European Language Resources Association.
    [Google Scholar]
  27. Kageura, Kyo , and Elizabeth Marshman
    2019 ‘Terminology Extraction and Management’. InThe Routledge Handbook of Translation and Technology, edited by O’Hagan, Minako . 10.4324/9781315311258‑4
    https://doi.org/10.4324/9781315311258-4 [Google Scholar]
  28. Kageura, Kyo , and Bin Umino
    1996 ‘Methods of Automatic Term Recognition’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication3 (2): 259–89. 10.1075/term.3.2.03kag
    https://doi.org/10.1075/term.3.2.03kag [Google Scholar]
  29. Kauter, Marian van de , Geert Coorman , Els Lefever , Bart Desmet , Lieve Macken , and Véronique Hoste
    2013 ‘LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit’. Computational Linguistics in the Netherlands Journal3: 103–20.
    [Google Scholar]
  30. Kim, J.-D. , T. Ohta , Y. Tateisi , and J. Tsujii
    2003 ‘GENIA Corpus – a Semantically Annotated Corpus for Bio-Textmining’. Bioinformatics19 (1): 180–82. 10.1093/bioinformatics/btg1023
    https://doi.org/10.1093/bioinformatics/btg1023 [Google Scholar]
  31. Kingma, Diederik P. , and Jimmy Ba
    2015 ‘Adam: A Method for Stochastic Optimization’. InProceedings of 3rd International Conference for Learning Representations. San Diego, CA. arxiv.org/abs/1412.6980
    [Google Scholar]
  32. Koutropoulou, Theoni , and Efstratios Efstratios
    2019 ‘TMG-BoBI: Generating Back-of-the-Book Indexes with the Text-to-Matrix-Generator’. InProceedings of the 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, 1–8. Patras, Greece. 10.1109/IISA.2019.8900745
    https://doi.org/10.1109/IISA.2019.8900745 [Google Scholar]
  33. Kucza, Maren , Jan Niehues , Thomas Zenkel , Alex Waibel , and Sebastian Stüker
    2018 ‘Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks’. InProceedings of Interspeech 2018, the 19th Annual Conference of the International Speech Communication Association, 2072–76. Hyderabad, India: International Speech Communication Association. 10.21437/Interspeech.2018‑2017
    https://doi.org/10.21437/Interspeech.2018-2017 [Google Scholar]
  34. Loshchilov, Ilya , and Frank Hutter
    2019 ‘Decoupled Weight Decay Regularization’. InProceedings of the Seventh International Conference on Learning Representations. New Orleans, USA. arxiv.org/abs/1711.05101
    [Google Scholar]
  35. Macken, Lieve , Els Lefever , and Véronique Hoste
    2013 ‘TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-Based Alignment’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication19 (1): 1–30. 10.1075/term.19.1.01mac
    https://doi.org/10.1075/term.19.1.01mac [Google Scholar]
  36. Martin, Louis , Benjamin Muller , Pedro Javier Ortiz Suárez , Yoann Dupont , Laurent Romary , Éric de la Clergerie , Djamé Seddah , and Benoît Sagot
    2020 ‘CamemBERT: A Tasty French Language Model’. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7203–19. Online: Association for Computational Linguistics. 10.18653/v1/2020.acl‑main.645
    https://doi.org/10.18653/v1/2020.acl-main.645 [Google Scholar]
  37. McCrae, John P. , and Adrian Doyle
    2019 ‘Adapting Term Recognition to an Under-Resourced Language: The Case of Irish’. InProceedings of the Celtic Language Technology Workshop, 48–57. Dublin, Ireland.
    [Google Scholar]
  38. Meyers, Adam L. , Yifan He , Zachary Glass , John Ortega , Shasha Liao , Angus Grieve-Smith , Ralph Grishman , and Olga Babko-Malaya
    2018 ‘The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores’. Frontiers in Research Metrics and Analytics3 (June). 10.3389/frma.2018.00019
    https://doi.org/10.3389/frma.2018.00019 [Google Scholar]
  39. Mikolov, Tomas , Wen-tau Yih , and Geoffrey Zweig
    2013 ‘Linguistic Regularities in Continuous Space Word Representations’. InProceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–51. Atlanta, GA, USA: Association for Computational Linguistics.
    [Google Scholar]
  40. Okazaki, Naoaki
    2007CRFsuite: A Fast Implementation of Conditional Random Fields (CRFs). www.chokkan.org/software/crfsuite/
    [Google Scholar]
  41. Oostdijk, Nelleke , Martin Reynaert , Véronique Hoste , and Ineke Schuurman
    2013 ‘The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch’. InEssential Speech and Language Technology for Dutch, edited by Peter Spyns and Jan Odijk , 219–47. Berlin, Heidelberg: Springer Berlin Heidelberg. 10.1007/978‑3‑642‑30910‑6_13
    https://doi.org/10.1007/978-3-642-30910-6_13 [Google Scholar]
  42. Paszke, Adam , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen ,
    2019 ‘PyTorch: An Imperative Style, High-Performance Deep Learning Library’. InProceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 8024–35. Vancouver, Canada. arxiv.org/abs/1912.01703
    [Google Scholar]
  43. Pedregosa, Fabian , Gael Varoquaux , Alexandre Gramfort , Vincent Michel , Bertrand Thirion , Olivier Grisel , Mathieu Blondel ,
    2011 ‘Scikit-Learn: Machine Learning in Python’. Machine Learning in Python, no.12: 2825–30.
    [Google Scholar]
  44. Peters, Matthew , Mark Neumann , Mohit Iyyer , Matt Gardner , Christopher Clark , Kenton Lee , and Luke Zettlemoyer
    2018 ‘Deep Contextualized Word Representations’. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–37. New Orleans, Louisiana: Association for Computational Linguistics. 10.18653/v1/N18‑1202
    https://doi.org/10.18653/v1/N18-1202 [Google Scholar]
  45. Petrov, Slav , Dipanjan Das , and Ryan McDonald
    2012 ‘A Universal Part-of-Speech Tagset’. InProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2089–96. Istanbul, Turkey: European Language Resources Association.
    [Google Scholar]
  46. Pires, Telmo , Eva Schlinger , and Dan Garrette
    2019 ‘How Multilingual Is Multilingual BERT?’ InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4996–5001. Florence, Italy: Association for Computational Linguistics. 10.18653/v1/P19‑1493
    https://doi.org/10.18653/v1/P19-1493 [Google Scholar]
  47. Qasemizadeh, Behrang , and Siegfried Handschuh
    2014 ‘The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics’. InProceedings of COLING 2014: 4th International Workshop on Computational Terminology, 52–63. Dublin, Ireland.
    [Google Scholar]
  48. Qasemizadeh, Behrang , and Anne-Kathrin Schumann
    2016 ‘The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods’. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1862–68. Portorož, Slovenia: European Language Resources Association.
    [Google Scholar]
  49. Rigouts Terryn, Ayla , Patrick Drouin , Véronique Hoste , and Els Lefever
    2019 ‘Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat’. InProceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1012–21. Varna, Bulgaria. 10.26615/978‑954‑452‑056‑4_117
    https://doi.org/10.26615/978-954-452-056-4_117 [Google Scholar]
  50. Rigouts Terryn, Ayla , Véronique Hoste , Patrick Drouin , and Els Lefever
    2020 ‘TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset’. InProceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94. Marseille, France: European Language Resources Association.
    [Google Scholar]
  51. Rigouts Terryn, Ayla , Véronique Hoste , and Els Lefever
    2020 ‘In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora’. Language Resources and Evaluation54 (2): 385–418. 10.1007/s10579‑019‑09453‑9
    https://doi.org/10.1007/s10579-019-09453-9 [Google Scholar]
  52. 2021 ‘HAMLET: Hybrid Adaptable Machine Learning Approach to Extract Terminology’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication27 (2). 10.1075/term.20017.rig
    https://doi.org/10.1075/term.20017.rig [Google Scholar]
  53. Rokas, Aivaras , Sigita Rackevičienė , and Andrius Utka
    2020 ‘Automatic Extraction of Lithuanian Cybersecurity Terms Using Deep Learning Approaches’. InProceedings of the Ninth International Conference on Baltic Human Language Technologies, 39–46. Kaunas, Lithuania: IOS Press. 10.3233/FAIA200600
    https://doi.org/10.3233/FAIA200600 [Google Scholar]
  54. Stenetorp, Pontus , Goran Topić , Sampo Pyysalo , Tomoko Ohta , Jin-Dong Kim , and Jun’ichi Tsujii
    2011 ‘BioNLP Shared Task 2011: Supporting Resources’. InProceedings of BioNLP Shared Task 2011 Workshop, 112–20. Portland, oregon: Association for Computational Linguistics.
    [Google Scholar]
  55. Vintar, Spela
    2010 ‘Bilingual Term Recognition Revisited’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication16 (2): 141–58. 10.1075/term.16.2.01vin
    https://doi.org/10.1075/term.16.2.01vin [Google Scholar]
  56. Vivaldi, Jorge , and Horacio Rodríguez
    2001 ‘Improving Term Extraction by Combining Different Techniques’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication7 (1): 31–48. 10.1075/term.7.1.04viv
    https://doi.org/10.1075/term.7.1.04viv [Google Scholar]
  57. Vries, Wietse de , Andreas van Cranenburgh , Arianna Bisazza , Tommaso Caselli , Gertjan van Noord , and Malvina Nissim
    2019 ‘BERTje: A Dutch BERT Model’. ArXiv:1912.09582, December. arxiv.org/abs/1912.09582
    [Google Scholar]
  58. Wang, Rui , Wei Liu , and Chris McDonald
    2016 ‘Featureless Domain-Specific Term Extraction with Minimal Labelled Data’. InProceedings of Australasian Language Technology Association Workshop, 103–12. Melbourne, Australia.
    [Google Scholar]
  59. Wolf, Thomas , Lysandre Debut , Victor Sanh , Julien Chaumond , Clement Delangue , Anthony Moi , Pierric Cistac ,
    2020 ‘Transformers: State-of-the-Art Natural Language Processing’. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Online: Association for Computational Linguistics. 10.18653/v1/2020.emnlp‑demos.6
    https://doi.org/10.18653/v1/2020.emnlp-demos.6 [Google Scholar]
  60. Wołk, Krzysztof , and Krzysztof Marasek
    2014 ‘Building Subject-Aligned Comparable Corpora and Mining It for Truly Parallel Sentence Pairs’. Procedia Technology18: 126–32. 10.1016/j.protcy.2014.11.024
    https://doi.org/10.1016/j.protcy.2014.11.024 [Google Scholar]
  61. Yuan, Yu , Jie Gao , and Yue Zhang
    2017 ‘Supervised Learning for Robust Term Extraction’. InThe Proceedings of 2017 International Conference on Asian Language Processing (IALP), 302–5. Singapore: IEEE. 10.1109/IALP.2017.8300603
    https://doi.org/10.1109/IALP.2017.8300603 [Google Scholar]
  62. Zhang, Ziqi , Johann Petrak , and Diana Maynard
    2018 ‘Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms’. ACM Transactions on Knowledge Discovery from Data12 (5): 1–7. 10.1145/3201408
    https://doi.org/10.1145/3201408 [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.1075/term.21010.rig
Loading
/content/journals/10.1075/term.21010.rig
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): automatic term extraction; sequential labelling; terminology
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error