Volume 24, Issue 1
  • ISSN 0929-9971
  • E-ISSN: 1569-9994
Buy:$35.00 + Taxes


Narrative clinical records and biomedical articles constitute rich sources of information about , i.e., markers distinguishing individuals with specific medical conditions from the general population. Phenotypes help clinicians to provide personalised treatments. However, locating information about them within huge document repositories is difficult, since each phenotypic concept can be mentioned in many ways. Normalisation methods automatically divergent phrases to unique concepts in domain-specific terminologies, to allow location and linking of all mentions of a concept of interest. We have developed a hybrid normalisation method (HYPHEN) to handle concept mentions with wide ranging characteristics, across different text types. HYPHEN integrates various normalisation techniques that handle variations (e.g., differences in word order, word forms or acronyms/abbreviations) and variations (where terms have similar , but potentially unrelated ). HYPHEN achieves robust performance for both biomedical academic text and narrative clinical records, and has the ability to significantly outperform related methods.


Article metrics loading...

Loading full text...

Full text loading...


  1. Alnazzawi, Noha , Paul Thompson , and Sophia Ananiadou
    2016 “Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource.” PLOS ONE11 (9): e0162287.10.1371/journal.pone.0162287
    https://doi.org/10.1371/journal.pone.0162287 [Google Scholar]
  2. Ananiadou, Sophia
    1994 “A Methodology for Automatic Term Recognition.” InProceedings of the 15th Conference on Computational Linguistics–Volume2, 1034–1038, Kyoto, Japan. doi: 10.3115/991250.991317
    https://doi.org/10.3115/991250.991317 [Google Scholar]
  3. Aronson, Alan R. , and François-Michel Lang
    2010 “An Overview of Metamap: Historical Perspective and Recent Advances.” Journal of the American Medical Informatics Association17 (3): 229–236. doi: 10.1136/jamia.2009.002733
    https://doi.org/10.1136/jamia.2009.002733 [Google Scholar]
  4. Bodenreider, O.
    2004 “The Unified Medical Language System (Umls): Integrating Biomedical Terminology.” Nucleic Acids Research32: 267–270. doi: 10.1093/nar/gkh061
    https://doi.org/10.1093/nar/gkh061 [Google Scholar]
  5. Bodnari, Andreea , Louise Deleger , Thomas Lavergne , Aurelie Neveol , and Pierre Zweigenbaum
    2013 “A Supervised Named-Entity Extraction System for Medical Text.” InProceedings of the hARe/CLEF Evaluation Lab, Valencia, Spain (ceur-ws.org/Vol-1179/CLEF2013wn-CLEFeHealth-BodnariEt2013.pdf). Accessed8 February 2018.
    [Google Scholar]
  6. Carroll, John , Rob Koeling , and Shivani Puri
    2012 “Lexical Acquisition for Clinical Text Mining Using Distributional Similarity.” InProceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, 232–246, New Delhi, India. doi: 10.1007/978‑3‑642‑28601‑8_20
    https://doi.org/10.1007/978-3-642-28601-8_20 [Google Scholar]
  7. Cohen, William , Pradeep Ravikumar , and Stephen Fienberg
    2003 “A Comparison of String Metrics for Matching Names and Records.” InProceedings of the KDD Workshop on Data Cleaning and Object Consolidation, 73–78, Washington DC, USA.
    [Google Scholar]
  8. Collier, Nigel , Anika Oellrich , and Tudor Groza
    2015 “Concept Selection for Phenotypes and Diseases Using Learn to Rank.” Journal of Biomedical Semantics6 (1): 24. doi: 10.1186/s13326‑015‑0019‑z
    https://doi.org/10.1186/s13326-015-0019-z [Google Scholar]
  9. Dai, Manhong , Nigam H. Shah , Wei Xuan , Mark A. Musen , Stanley J. Watson , Brian D. Athey , and Fan Meng
    2008 “An Efficient Solution for Mapping Free Text to Ontology Terms.” InProceedings of the AMIA Summit on Translational Bioinformatics, San Francisco, USA (https://knowledge.amia.org/amia-55142-tbi2008a-1.650887/t-002-1.985042/f-001-1.985043/a-041-1.985157/an-041-1.985158?qr=1). Accessed8 February 2018.
    [Google Scholar]
  10. Deléger, Louise , Fiammetta Namer , and Pierre Zweigenbaum
    2007 “Defining Medical Words: Transposing Morphosemantic Analysis from French to English.” InMedinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, 535–539, Brisbane, Australia.
    [Google Scholar]
  11. Doğan, Rezarta Islamaj , Robert Leaman , and Zhiyong Lu
    2014 “Ncbi Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization.” Journal of Biomedical Informatics47: 1–10. doi: 10.1016/j.jbi.2013.12.006
    https://doi.org/10.1016/j.jbi.2013.12.006 [Google Scholar]
  12. Dogan, Rezarta Islamaj , and Zhiyong Lu
    2012 “An Inference Method for Disease Name Normalization.” InProceedings of the AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 8–13, Arlington, USA.
    [Google Scholar]
  13. Donnelly, Kevin
    2006 “Snomed-Ct: The Advanced Terminology and Coding System for Ehealth.” Studies in Health Technology and Informatics121: 279.
    [Google Scholar]
  14. Duclos, C. , A. Burgun , J. B. Lamy , P. Landais , J. M. Rodrigues , L. Soualmia , and P. Zweigenbaum
    2014 “Medical Vocabulary, Terminological Resources and Information Coding in the Health Domain.” InMedical Informatics, E-Health, edited by A. Venot , A. Burgun and Quantin , 11–41. Paris, France: Springer. doi: 10.1007/978‑2‑8178‑0478‑1_2
    https://doi.org/10.1007/978-2-8178-0478-1_2 [Google Scholar]
  15. Elhadad, Noémie , Sameer Pradhan , W. W. Chapman , Suresh Manandhar , and G. K. Savova
    2015 “Semeval-2015 Task 14: Analysis of Clinical Text.” InProceedings of Workshop on Semantic Evaluation. Association for Computational Linguistics, 303–310, Denver, USA.
    [Google Scholar]
  16. Fan, Jung-wei , Navdeep Sood , and Yang Huang
    2013 “Disorder Concept Identification from Clinical Notes an Experience with the Share/Clef 2013 Challenge.” InProceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain (ceur-ws.org/Vol-1179/CLEF2013wn-CLEFeHealth-FanEt2013.pdf). Accessed8 February 2018.
    [Google Scholar]
  17. Fellbaum, Christiane
    (ed.) 1998WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
    [Google Scholar]
  18. Fu, Xiao , and Sophia Ananiadou
    2014 “Improving the Extraction of Clinical Concepts from Clinical Records.” InProceedings of BioTxtM14, 47–53, Reykjavik, Iceland.
    [Google Scholar]
  19. Fu, Xiao , Riza Batista-Navarro , Rafal Rak , and Sophia Ananiadou
    2015 “Supporting the Annotation of Chronic Obstructive Pulmonary Disease (Copd) Phenotypes with Text Mining Workflows.” Journal of Biomedical Semantics6 (1): 8. doi: 10.1186/s13326‑015‑0004‑6
    https://doi.org/10.1186/s13326-015-0004-6 [Google Scholar]
  20. Fu, Xiao , R. T. B. Batista-Navarro , Rafal Rak , and Sophia Ananiadou
    2014 “A Strategy for Annotating Clinical Records with Phenotypic Information Relating to the Chronic Obstructive Pulmonary Disease.” InProceedings of Phenotype Day ISMB, 1–8, Boston, USA.
    [Google Scholar]
  21. Groza, Tudor , Sebastian Köhler , Dawid Moldenhauer , Nicole Vasilevsky , Gareth Baynam , Tomasz Zemojtel , Lynn Marie Schriml , Warren Alden Kibbe , Paul N. Schofield , and Tim Beck
    2015 “The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease.” The American Journal of Human Genetics97 (1):111–124. doi: 10.1016/j.ajhg.2015.05.020
    https://doi.org/10.1016/j.ajhg.2015.05.020 [Google Scholar]
  22. Habash, Nizar , and Bonnie Dorr
    2003 “Catvar: A Database of Categorial Variations for English.” InProceedings of the MT Summit, 17–23, New Orleans, US.
    [Google Scholar]
  23. Hamosh, Ada , Alan F. Scott , Joanna S. Amberger , Carol A. Bocchini , and Victor A. McKusick
    2005 “Online Mendelian Inheritance in Man (Omim), a Knowledgebase of Human Genes and Genetic Disorders.” Nucleic Acids Research33 (suppl 1):D514–D517. doi: 10.1093/nar/gki033
    https://doi.org/10.1093/nar/gki033 [Google Scholar]
  24. Han, MeiLan K. , Alvar Agusti , Peter M. Calverley , Bartolome R. Celli , Gerard Criner , Jeffrey L. Curtis , Leonardo M. Fabbri , Jonathan G. Goldin , Paul W. Jones , and William MacNee
    2010 “Chronic Obstructive Pulmonary Disease Phenotypes: The Future of Copd.” American Journal of Respiratory and Critical Care Medicine182 (5): 598–604. doi: 10.1164/rccm.200912‑1843CC
    https://doi.org/10.1164/rccm.200912-1843CC [Google Scholar]
  25. Hersh, William R. , and Robert A. Greenes
    1990 “Saphire – an Information Retrieval System Featuring Concept Matching, Automatic Indexing, Probabilistic Retrieval, and Hierarchical Relationships.” Computers and Biomedical Research23 (5): 410–425. doi: 10.1016/0010‑4809(90)90031‑7
    https://doi.org/10.1016/0010-4809(90)90031-7 [Google Scholar]
  26. Jaccard, Paul
    1912 “The Distribution of the Flora in the Alpine Zone.” New Phytologist11 (2): 37–50. doi: 10.1111/j.1469‑8137.1912.tb05611.x
    https://doi.org/10.1111/j.1469-8137.1912.tb05611.x [Google Scholar]
  27. Jacquemin, Christian
    1999 “Syntagmatic and Paradigmatic Representations of Term Variation.” InProceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, 341–348, Maryland, USA. doi: 10.3115/1034678.1034733
    https://doi.org/10.3115/1034678.1034733 [Google Scholar]
  28. Jonquet, Clement , Nigam Shah , and Mark Musen
    2009 “The Open Biomedical Annotator.” InProceedings of the AMIA summit on translational bioinformatics, 56–60, San Francisco, USA.
    [Google Scholar]
  29. Kang, Ning , Rogier J. Barendse , Zubair Afzal , Bharat Singh , Martijn J. Schuemie , Erik M van Mulligen , and Jan A. Kors
    2010 “Erasmus Mc Approaches to the I2b2 Challenge.” InProceedings of the 2010 i2b2/VA Workshop on Challenges in Natural Language Processing for Clinical Data, Boston, MA, USA (biosemantics.org/downloads/i2b2_challenge.pdf). Accessed15 February 2018.
    [Google Scholar]
  30. Kate, Rohit J.
    2015 “Normalizing Clinical Terms Using Learned Edit Distance Patterns.” Journal of the American Medical Informatics Association23 (2): 380–386.10.1093/jamia/ocv108
    https://doi.org/10.1093/jamia/ocv108 [Google Scholar]
  31. Leaman, Robert , Rezarta Islamaj Doğan , and Zhiyong Lu
    2013 “Dnorm: Disease Name Normalization with Pairwise Learning to Rank.” Bioinformatics29 (22): 2909–2917. doi: 10.1093/bioinformatics/btt474
    https://doi.org/10.1093/bioinformatics/btt474 [Google Scholar]
  32. Leaman, Robert , Ritu Khare , and Zhiyong Lu
    2015 “Challenges in Clinical Natural Language Processing for Automated Disorder Normalization.” Journal of Biomedical Informatics57: 28–37. doi: 10.1016/j.jbi.2015.07.010
    https://doi.org/10.1016/j.jbi.2015.07.010 [Google Scholar]
  33. Leaman, Robert , Christopher Miller , and G. Gonzalez
    2009 “Enabling Recognition of Diseases in Biomedical Text with Machine Learning: Corpus and Benchmark.” InProceedings of the 2009 Symposium on Languages in Biology and Medicine, 82–89, Jeju Island, South Korea.
    [Google Scholar]
  34. Lee, Hsin-Chun , Yi-Yu Hsu , and Hung-Yu Kao
    2016 “Audis: An Automatic Crf-Enhanced Disease Normalization in Biomedical Text.” Database 2016: baw091.10.1093/database/baw091
    https://doi.org/10.1093/database/baw091 [Google Scholar]
  35. Li, Jiao , Yueping Sun , Robin J. Johnson , Daniela Sciaky , Chih-Hsuan Wei , Robert Leaman , Allan Peter Davis , Carolyn J. Mattingly , Thomas C. Wiegers , and Zhiyong Lu
    2016 “Biocreative V Cdr Task Corpus: A Resource for Chemical Disease Relation Extraction.” Database 2016: baw068.
    [Google Scholar]
  36. Maglott, Donna , Jim Ostell , Kim D. Pruitt , and Tatiana Tatusova
    2011 “Entrez Gene: Gene-Centered Information at Ncbi.” Nucleic Acids Research39 (suppl 1): D52–D57. doi: 10.1093/nar/gkq1237.
    https://doi.org/10.1093/nar/gkq1237 [Google Scholar]
  37. Markó, Kornél , Stefan Schulz , Olena Medelyan , and Udo Hahn
    2005 “Bootstrapping Dictionaries for Cross-Language Information Retrieval.” InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 528–535, Salvador, Brazil.
    [Google Scholar]
  38. Miyao, Yusuke , and Jun’ichi Tsujii
    2008 “Feature Forest Models for Probabilistic Hpsg Parsing.” Computational Linguistics34 (1): 35–80. doi: 10.1162/coli.2008.34.1.35
    https://doi.org/10.1162/coli.2008.34.1.35 [Google Scholar]
  39. Namer, Fiammetta , and Robert Baud
    2005 “Predicting Lexical Relations between Biomedical Terms: Towards a Multilingual Morphosemantics-Based System.” Studies in Health Technology and Informatics116: 793–798.
    [Google Scholar]
  40. Névéol, A. , and P. Zweigenbaum
    2016 “Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest.” IMIA Yearbook: 234–239.
    [Google Scholar]
  41. Nunes, Tiago , David Campos , Sérgio Matos , and José Luís Oliveira
    2013 “Becas: Biomedical Concept Recognition Services and Visualization.” Bioinformatics29 (15): 1915–1916. doi: 10.1093/bioinformatics/btt317
    https://doi.org/10.1093/bioinformatics/btt317 [Google Scholar]
  42. Oellrich, Anika , Nigel Collier , Damian Smedley , and Tudor Groza
    2015 “Generation of Silver Standard Concept Annotations from Biomedical Texts with Special Relevance to Phenotypes.” PLOS ONE10 (1): e0116040. doi: 10.1371/journal.pone.0116040
    https://doi.org/10.1371/journal.pone.0116040 [Google Scholar]
  43. Okazaki, N. , S. Ananiadou , and J. Tsujii
    2010 “Building a High-Quality Sense Inventory for Improved Abbreviation Disambiguation.” Bioinformatics26 (9): 1246–1253. doi: 10.1093/bioinformatics/btq129
    https://doi.org/10.1093/bioinformatics/btq129 [Google Scholar]
  44. Patrick, Jon , Yefeng Wang , and Peter Budd
    2007 “An Automated System for Conversion of Clinical Notes into Snomed Clinical Terminology.” InProceedings of the Fifth Australasian Symposium on ACSW Frontiers, 219–226, Ballarat, Australia.
    [Google Scholar]
  45. Pradhan, Sameer , Noémie Elhadad , Wendy Chapman , Suresh Manandhar , and Guergana Savova
    2014 “Semeval-2014 Task 7: Analysis of Clinical Text.” InProceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 54–62, Dublin, Ireland. doi: 10.3115/v1/S14‑2007
    https://doi.org/10.3115/v1/S14-2007 [Google Scholar]
  46. Pradhan, Sameer , Noémie Elhadad , Brett R. South , David Martinez , Lee Christensen , Amy Vogel , Hanna Suominen , Wendy W. Chapman , and Guergana Savova
    2015 “Evaluating the State of the Art in Disorder Recognition and Normalization of the Clinical Narrative.” Journal of the American Medical Informatics Association22 (1): 143–154.10.1136/amiajnl‑2013‑002544
    https://doi.org/10.1136/amiajnl-2013-002544 [Google Scholar]
  47. Rais, Meriem , and Natalia Grabar
    2015 “Discovering the Role of Morphology on the Understanding of Biomedical Terminology by Paramedical Students.” InProccedings of the 26th Medical Informatics Europe Conference, 30–34Madrid, Spain.
    [Google Scholar]
  48. Ramanan, S. V. , Shereen Broido , and P Senthil Nathan
    2013 “Performance of a Multi-Class Biomedical Tagger on Clinical Records.” InProceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain (ceur-ws.org/Vol-1179/CLEF2013wn-CLEFeHealth-RamananEt2013.pdf). Accessed8 February 2018.
    [Google Scholar]
  49. Ruch, Patrick , Julien Gobeill , Christian Lovis , and Antoine Geissbühler
    2008 “Automatic Medical Encoding with Snomed Categories.” BMC Medical Informatics and Decision Making8 (1): S6. doi: 10.1186/1472‑6947‑8‑S1‑S6
    https://doi.org/10.1186/1472-6947-8-S1-S6 [Google Scholar]
  50. Savova, Guergana K. , James J. Masanz , Philip V. Ogren , Jiaping Zheng , Sunghwan Sohn , Karin C. Kipper-Schuler , and Christopher G. Chute
    2010 “Mayo Clinical Text Analysis and Knowledge Extraction System (Ctakes): Architecture, Component Evaluation and Applications.” Journal of the American Medical Informatics Association17 (5): 507–513. doi: 10.1136/jamia.2009.001560
    https://doi.org/10.1136/jamia.2009.001560 [Google Scholar]
  51. Schriml, Lynn Marie , Cesar Arze , Suvarna Nadendla , Yu-Wei Wayne Chang , Mark Mazaitis , Victor Felix , Gang Feng , and Warren Alden Kibbe
    2012 “Disease Ontology: A Backbone for Disease Semantic Integration.” Nucleic Acids Research40 (D1): D940–D946. doi: 10.1093/nar/gkr972
    https://doi.org/10.1093/nar/gkr972 [Google Scholar]
  52. Suominen, Hanna , Sanna Salanterä , Sumithra Velupillai , Wendy W. Chapman , Guergana Savova , Noemie Elhadad , Sameer Pradhan , Brett R. South , Danielle L. Mowery , and Gareth J. F. Jones
    2013 “Overview of the Share/Clef Ehealth Evaluation Lab 2013.” InProceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, 212–231, Valencia, Spain.
    [Google Scholar]
  53. Tanenblatt, Michael A. , Anni Coden , and Igor L. Sominsky
    2010 “The Conceptmapper Approach to Named Entity Recognition.” InProceedings of LREC, 546–551Valletta, Malta.
    [Google Scholar]
  54. Thompson, Paul , John McNaught , Simonetta Montemagni , Nicoletta Calzolari , Riccardo Del Gratta , Vivian Lee , Simone Marchi , Monica Monachini , Piotr Pezik , and Valeria Quochi
    2011 “The Biolexicon: A Large-Scale Terminological Resource for Biomedical Text Mining.” BMC Bioinformatics12 (1): 397. doi: 10.1186/1471‑2105‑12‑397
    https://doi.org/10.1186/1471-2105-12-397 [Google Scholar]
  55. Uzuner, Özlem , Brett R. South , Shuying Shen , and Scott L. DuVall
    2011 “2010 I2b2/Va Challenge on Concepts, Assertions, and Relations in Clinical Text.” Journal of the American Medical Informatics Association18 (5): 552–556. doi: 10.1136/amiajnl‑2011‑000203
    https://doi.org/10.1136/amiajnl-2011-000203 [Google Scholar]
  56. Wang, Chunye , and Ramakrishna Akella
    2013 “Ucsc’s System for Clef Ehealth 2013 Task 1.” InProceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain (ceur-ws.org/Vol-1179/CLEF2013wn-CLEFeHealth-WangEt2013.pdf). Accessed8 February 2018.
    [Google Scholar]
  57. Wang, Liqin , Bruce E. Bray , Jianlin Shi , Guilherme Del Fiol , and Peter J. Haug
    2016 “A Method for the Development of Disease-Specific Reference Standards Vocabularies from Textual Biomedical Literature Resources.” Artificial Intelligence in Medicine68: 47–57.10.1016/j.artmed.2016.02.003
    https://doi.org/10.1016/j.artmed.2016.02.003 [Google Scholar]
  58. Wulff, Henrik R.
    2004 “The Language of Medicine.” Journal of the Royal Society of Medicine97 (4): 187–188. doi: 10.1258/jrsm.97.4.187
    https://doi.org/10.1258/jrsm.97.4.187 [Google Scholar]
  59. Zhou, Xiaohua , Xiaodan Zhang , and Xiaohua Hu
    2006 “Maxmatcher: Biological Concept Extraction Using Approximate Dictionary Lookup.” InProceedings of PRICAI 2006: Trends in Artificial Intelligence, 1145–1149, Guilin, China. doi: 10.1007/978‑3‑540‑36668‑3_150
    https://doi.org/10.1007/978-3-540-36668-3_150 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): normalisation; phenotypic information; term variation; terminological resources
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error