Volume 28, Issue 2
  • ISSN 0929-9971
  • E-ISSN: 1569-9994
Buy:$35.00 + Taxes



In this paper, we propose the first method for automatic Vietnamese medical term discovery and extraction from clinical texts. The method combines linguistic filtering based on our defined open patterns with nested term extraction and statistical ranking using -value. It does not require annotated corpora, external data resources, parameter settings, or term length restriction. Beside its specialty in handling Vietnamese medical terms, another novelty is that it uses Pointwise Mutual Information to split nested terms and the disjunctive acceptance condition to extract them. Evaluated on real Vietnamese electronic medical records, it achieves a precision of about 74% and recall of about 92% and is proved stably effective with small datasets. It outperforms the previous works in the same category of not using annotated corpora and external data resources. Our method and empirical evaluation analysis can lay a foundation for further research and development in Vietnamese medical term discovery and extraction.


Article metrics loading...

Loading full text...

Full text loading...


  1. Arbabi, Aryan, David R. Adams, Sanja Fidler, and Michael Brudno
    2019 “Identifying clinical terms in free-text notes using ontology-guided machine learning.” InRECOMB 2019, ed. byL. J. Cowen, LNBI, 11467: 19–34. Springer-Verlag. 10.1007/978‑3‑030‑17083‑7_2
    https://doi.org/10.1007/978-3-030-17083-7_2 [Google Scholar]
  2. Aubin, Sophie, and Thierry Hamon
    2006 “Improving term extraction with terminological resources.” InProc the International Conference on Natural Language Processing: 380–387. 10.1007/11816508_39
    https://doi.org/10.1007/11816508_39 [Google Scholar]
  3. Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou
    2009 “An improved automatic term recognition method for Spanish.” InCICLing 2009, ed. byA. Gelbukh, Lecture Notes in Computer Science5449: 125–136. Springer-Verlag. 10.1007/978‑3‑642‑00382‑0_10
    https://doi.org/10.1007/978-3-642-00382-0_10 [Google Scholar]
  4. Bonin, Francesca, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni
    2010 “A contrastive approach to multi-word term extraction from domain corpora.” InProc the 7th International Conference on Language Resources and Evaluation (LREC’10): 3222–3229.
    [Google Scholar]
  5. Boulaknadel, Siham, Beatrice Daille, and Driss Aboutajdine
    2008 “A multi-word term extraction program for Arabic language.” InProc the 6th International Conference on Language Resources and Evaluation (LREC’08): 1485–1488.
    [Google Scholar]
  6. Bouma, Gerlof
    2009 “Normalized (pointwise) mutual information in collocation extraction.” InProc GSCL: 31–40.
    [Google Scholar]
  7. Bourigault, Didier
    1992 “Surface grammatical analysis for the extraction of terminological noun phrases.” InProc COLING-92: 977–981. 10.3115/992383.992415
    https://doi.org/10.3115/992383.992415 [Google Scholar]
  8. Bourigault, Didier and Christian Jacquemin
    1999 “TERM EXTRACTION + TERM CLUSTERING: an integrated platform for computer-aided terminology.” InProc the 9th Conference on European Chapter of the Association for Computational Linguistics (EACL’99): 15–22. 10.3115/977035.977039
    https://doi.org/10.3115/977035.977039 [Google Scholar]
  9. Cabré Castellví, M. Teresa
    2003 “Theories of terminology: Their description, prescription and explanation.” Terminology9 (2): 163–199. 10.1075/term.9.2.03cab
    https://doi.org/10.1075/term.9.2.03cab [Google Scholar]
  10. Chaimongkol, Panot and Akiko Aizawa
    2013 “Utilizing LDA clustering for technical term extraction.” InProc the 19th Annual Meeting of the Association for Natural Language Processing (ANLP): 686–689.
    [Google Scholar]
  11. Chen, Jinying, and Hong Yu
    2017 “Unsupervised ensemble ranking of terms in electronic health record notes based on their importance to patients.” Journal of Biomedical Informatics: 1–30. 10.1016/j.jbi.2017.02.016
    https://doi.org/10.1016/j.jbi.2017.02.016 [Google Scholar]
  12. Chung, Teresa Mihwa
    2003 “A corpus comparison approach for terminology extraction.” Terminology9 (2): 221–246. 10.1075/term.9.2.05chu
    https://doi.org/10.1075/term.9.2.05chu [Google Scholar]
  13. Church, Kenneth Ward, and Patrick Hanks
    1989 “Word association norms, mutual information, and lexicography.” InProc the 27th Annual Meetings of the Association for Computational Linguistics: 76–83. 10.3115/981623.981633
    https://doi.org/10.3115/981623.981633 [Google Scholar]
  14. Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende
    2013 “Exploration of a rich feature set for automatic term extraction.” InMICAI 2013, ed. byF. Castro, A. Gelbukh, and M. González, LNAI, 8265: 342–354. Springer-Verlag. 10.1007/978‑3‑642‑45114‑0_28
    https://doi.org/10.1007/978-3-642-45114-0_28 [Google Scholar]
  15. Dagan, Ido and Ken Church
    1997 “Termight: coordinating humans and machines in bilingual terminology acquisition.” Machine Translation12: 89–107. 10.1023/A:1007926723945
    https://doi.org/10.1023/A:1007926723945 [Google Scholar]
  16. Daille, Béatrice
    1994 “Study and implementation of combined techniques for automatic extraction of terminology.” InProc the Balancing Act Workshop at the 32nd Annual Meeting of the ACL: 29–36.
    [Google Scholar]
  17. Dias, Gaël
    2003 “Multiword unit hybrid extraction.” InProc the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment: 41–48. 10.3115/1119282.1119288
    https://doi.org/10.3115/1119282.1119288 [Google Scholar]
  18. Dice, Lee R.
    1945 “Measures of the amount of ecological association between species.” J. Ecology26: 297–302. 10.2307/1932409
    https://doi.org/10.2307/1932409 [Google Scholar]
  19. Diep, Quang Ban
    2014Vietnamese Grammar. Education Publisher, Vietnam. InVietnamese.
    [Google Scholar]
  20. Drouin, Patrick
    2003 “Term extraction using non-technical corpora as a point of leverage.” Terminology9 (1): 99–115. 10.1075/term.9.1.06dro
    https://doi.org/10.1075/term.9.1.06dro [Google Scholar]
  21. Fahmi, Ismail, Gosse Bouma, and Lonneke van der Plas
    2007 “Using multilingual terms for biomedical term extraction.” InProc the RANLP Workshop on Acquisition and Management of Multilingual Lexicons: 1–8.
    [Google Scholar]
  22. Frantzi, Katerina T., and Sophia Ananiadou
    1999 “The C-value/NC-value domain-independent method for multi-word term extraction.” Journal of Natural Language Processing6 (3): 145–179. 10.5715/jnlp.6.3_145
    https://doi.org/10.5715/jnlp.6.3_145 [Google Scholar]
  23. Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima
    2000 “Automatic recognition of multi-word terms: the C-value/NC-value method.” Int J Digit Libr3: 115–130. 10.1007/s007999900023
    https://doi.org/10.1007/s007999900023 [Google Scholar]
  24. Gao, Yuze, and Yu Yuan
    2019 “Feature-less end-to-end nested term extraction.” InProc the International Conference on Natural Language Processing and Chinese Computing: 607–616. 10.1007/978‑3‑030‑32236‑6_55
    https://doi.org/10.1007/978-3-030-32236-6_55 [Google Scholar]
  25. He, Yulan
    2016 “Extracting topical phrases from clinical documents.” InProc the 30th AAAI Conf on Artificial Intelligence: 2957–2963. 10.1609/aaai.v30i1.10365
    https://doi.org/10.1609/aaai.v30i1.10365 [Google Scholar]
  26. Heylen, Kris, and Dirk De Hertog
    2015 “Automatic term extraction.” InHandbook of Terminology, ed. byH. J. Kockaert and F. Steurs, Vol.1, 203–221. John Benjamins. 10.1075/hot.1.aut1
    https://doi.org/10.1075/hot.1.aut1 [Google Scholar]
  27. Kageura, Kyo, and Bin Umino
    1996 “Methods of automatic term recognition – a review.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication3(2): 259–289. 10.1075/term.3.2.03kag
    https://doi.org/10.1075/term.3.2.03kag [Google Scholar]
  28. Krauthammer, Michael, and Goran Nenadic
    2004 “Term identification in the biomedical literature.” Journal of Biomedical Informatics37: 512–526. 10.1016/j.jbi.2004.08.004
    https://doi.org/10.1016/j.jbi.2004.08.004 [Google Scholar]
  29. Le, Hong Phuong
    2016 “Vitk: a Vietnamese text processing toolkit.” (Jan. 2016) RetrievedJan 01, 2016fromhttps://github.com/phuonglh/vn.vitk
  30. Liu, Liangliang, Xiaojing Wu, Hui Liu, Xinyu Cao, Haitao Wang, Hongwei Zhou, and Qi Xie
    2020 “A semi-supervised approach for extracting TCM clinical terms based on feature words.” BMC Medical Informatics and Decision Making20 (Suppl 3): 118. 10.1186/s12911‑020‑1108‑1
    https://doi.org/10.1186/s12911-020-1108-1 [Google Scholar]
  31. Liu, Wei, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet
    2015 “A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters.” Health Inf Sci Syst3 (5): 1–14. 10.1186/s13755‑015‑0013‑y
    https://doi.org/10.1186/s13755-015-0013-y [Google Scholar]
  32. Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire
    2016 “Biomedical term extraction: overview and a new methodology.” Information Retrieval Journal, Medical Information Retrieval19 (1): 59–99. 10.1007/s10791‑015‑9262‑2
    https://doi.org/10.1007/s10791-015-9262-2 [Google Scholar]
  33. Maclean, Diana Lynn, and Jeffrey Heer
    2013 “Identifying medical terms in patient-authored text: a crowdsourcing-based approach.” J Am Med Inform Assoc: 1–8. 10.1136/amiajnl‑2012‑001110
    https://doi.org/10.1136/amiajnl-2012-001110 [Google Scholar]
  34. Marciniak, Malgorzata, and Agnieszka Mykowiecka
    2014 “Terminology extraction from medical texts in Polish.” Journal of Biomedical Semantics5 (24): 1–14. 10.1186/2041‑1480‑5‑24
    https://doi.org/10.1186/2041-1480-5-24 [Google Scholar]
  35. 2015 “Nested term recognition driven by word connection strength.” Terminology21 (2): 1–31.
    [Google Scholar]
  36. Maynard, Diana, and Sophia Ananiadou
    2001 “TRUCKS: a model for automatic multi-word term recognition.” Journal of Natural Language Processing8 (1): 101–125. 10.5715/jnlp.8.101
    https://doi.org/10.5715/jnlp.8.101 [Google Scholar]
  37. McInnes, Bridget T., Ted Pedersen, and Serguei V. Pakhomov
    2007 “Determining the syntactic structure of medical terms in clinical notes.” InProc the ACL 2007 Workshop on Biological, Translational, and Clinical Language Processing (BioNLP 2007): 9–16. 10.3115/1572392.1572395
    https://doi.org/10.3115/1572392.1572395 [Google Scholar]
  38. Mihalcea, Rada, and Paul Tarau
    2004 “TextRank: Bringing order into text.” InProc the 2004 Conference on Empirical Methods in Natural Language Processing: 404–411.
    [Google Scholar]
  39. Nguyen, Bao An, and Don-Lin Yang
    2012 “A semi-automatic approach to construct Vietnamese ontology from online text.” The International Review of Research in Open and Distributed Learning13 (5): 148–172. 10.19173/irrodl.v13i5.1250
    https://doi.org/10.19173/irrodl.v13i5.1250 [Google Scholar]
  40. Nguyen, Hong Son, Minh Hieu Le, Chan Quan Loi Lam, and Trong Hai Duong
    2017 “Smart interactive search for Vietnamese disease by using data mining-based ontology.” Journal of Information and Telecommunication1 (2): 176–191. 10.1080/24751839.2017.1323491
    https://doi.org/10.1080/24751839.2017.1323491 [Google Scholar]
  41. Nguyen, Minh Hiep, Huyen Nguyen Thi Minh, and Quyen Ngo The
    2018 “Building Resources for Vietnamese Clinical Text Processing.” Computación y Sistemas22 (4): 1287–1294.
    [Google Scholar]
  42. Nguyen, Minh-Tien, and Tri-Thanh Nguyen
    2015 “DESRM: a disease extraction system for real-time monitoring.” International Journal of Computational Vision and Robotics5 (3): 282–301. 10.1504/IJCVR.2015.071341
    https://doi.org/10.1504/IJCVR.2015.071341 [Google Scholar]
  43. Oliver, Antoni, and Mercè Vàzquez
    2015 “TBXTools: a free, fast and flexible tool for automatic terminology extraction.” InProc Recent Advances in Natural Language Processing: 473–479.
    [Google Scholar]
  44. 2020 “TermEval 2020: Using TSR Filtering Method to Improve Automatic Term Extraction.” InProc the 6th International Workshop on Computational Terminology (COMPUTERM 2020): 106–113.
    [Google Scholar]
  45. Pei, Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu
    2001 “PrefixSpan: Mining sequential patterns efficiently by Prefix-Projected Pattern Growth.” InProc the 17th International Conference on Data Engineering: 1–10.
    [Google Scholar]
  46. Periñán-Pascual, Carlos, and Eva M. Mestre-Mestre
    2015 “DEXTER: Automatic extraction of domain-specific glossaries for language teaching.” Procedia – Social and Behavioral Sciences198: 377–385. 10.1016/j.sbspro.2015.07.457
    https://doi.org/10.1016/j.sbspro.2015.07.457 [Google Scholar]
  47. Repar, Andraž, Vid Podpečan, Anze Vavpetič, Nada Lavrač, and Senja Pollak
    2019 “TermEnsembler: an ensemble learning approach to bilingual term extraction and alignment.” Terminology25 (1): 93–120.
    [Google Scholar]
  48. Samy, Doaa, Antonio Moreno-Sandoval, Conchi Bueno-Díaz, Marta Garrote-Salazar, and José M. Guirao
    2012 “Medical term extraction in an Arabic medical corpus.” InProc the 8th International Conference on Language Resources and Evaluation (LREC’12): 640–645.
    [Google Scholar]
  49. Terryn, Ayla Rigouts, Patrick Drouin, Véronique Hoste, and Els Lefever
    2019 “Analysing the impact of supervised machine learning on automatic term extraction: HAMLET vs TermoStat.” InProc Recent Advances in Natural Language Processing: 1012–1021. 10.26615/978‑954‑452‑056‑4_117
    https://doi.org/10.26615/978-954-452-056-4_117 [Google Scholar]
  50. Terryn, Ayla Rigouts, Véronique Hoste, Joost Buysschaert, Robert Vander Stichele, Elise Van Campen, and Els Lefever
    2019 “Validating multilingual hybrid automatic term extraction for search engine optimization: the use of EBM-GUIDELINES.” Argentinian Journal of Applied Linguistics: 93–108.
    [Google Scholar]
  51. Terryn, Ayla Rigouts, Véronique Hoste, and Els Lefever
    2018 “A gold standard for multilingual automatic term extraction from comparable corpora: term structure and translation equivalents.” InProc the 11th International Conference on Language Resources and Evaluation (LREC 2018): 1803–1808.
    [Google Scholar]
  52. Vàzquez, Mercè, and Antoni Oliver
    2018 “Improving term candidates selection using terminological tokens.” Terminology24 (1): 122–147.
    [Google Scholar]
  53. Vivaldi, Jordi, Lluís Màrquez, and Horacio Rodríguez
    2001 “Improving term extraction by system combination using boosting.” InECML 2001, ed. byL. De Raedt and P. Flach, LNAI, Vol.2167, 515–526. Springer-Verlag. 10.1007/3‑540‑44795‑4_44
    https://doi.org/10.1007/3-540-44795-4_44 [Google Scholar]
  54. Zhang, Xing, Yan Song, and Alex Chengyu Fang
    2010 “Term recognition using conditional random fields.” InProc the 6th International Conference on Natural Language Processing and Knowledge Engineering: 1–6. 10.1109/NLPKE.2010.5587809
    https://doi.org/10.1109/NLPKE.2010.5587809 [Google Scholar]
  55. Zhang, Ziqi, Jie Gao, and Fabio Ciravegna
    2017 “SemRe-Rank: improving automatic term extraction by incorporating semantic relatedness with personalised PageRank.” ACM Trans Knowl Discov Data9 (4): 1–40. 10.1145/2700398
    https://doi.org/10.1145/2700398 [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error