Volume 24, Issue 1
  • ISSN 0929-9971
  • E-ISSN: 1569-9994
Buy:$35.00 + Taxes


In our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora.

The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method, with a precision of about 0.75 on half of the tested list, was the context based method using a modified contextual diversity coefficient.

Although the methods were tested on Polish, they seems to be language independent.


Article metrics loading...

Loading full text...

Full text loading...


  1. Basili, Roberto , Alessandro Moschitti , Maria Teresa Pazienza , and Fabio Massimo Zanzotto
    2001 “A Contrastive Approach to Term Extraction.” InProceedings of 4th Terminology and Artificial Intelligence Conference (TIA), 119–128, Nancy: INIST/CNRS.
    [Google Scholar]
  2. Bonin, Francesca , Felice Dell’Orletta , Giulia Venturi , and Simonetta Montemagni
    2010 “A Contrastive Approach to Multi-word Term Extraction from Domain Corpora.” InProceedings of the 7th International Conference on Language Resources and Evaluation, 19–21. Valetta, Malta.
    [Google Scholar]
  3. Frantzi, Katerina , Sophia Ananiadou , and Hideki Mima
    2000 “Automatic Recognition of Multi-word Terms: the C-value/NC-value Method.” International Journal on Digital Libraries3: 115–130. doi: 10.1007/s007999900023
    https://doi.org/10.1007/s007999900023 [Google Scholar]
  4. Hamilton, William L. , Jure Leskovec , and Dan Jurafsky
    2016 “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.” InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers. 1489–1501. Berlin, Germany: The Association for Computer Linguistics.10.18653/v1/P16‑1141
    https://doi.org/10.18653/v1/P16-1141 [Google Scholar]
  5. Hill, Felix , Reichart Roi , and Anna Korhonnen
    2015 “Simlex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation.” Computational Linguistics41: 665–695. doi: 10.1162/COLI_a_00237
    https://doi.org/10.1162/COLI_a_00237 [Google Scholar]
  6. Lopes, Lucene , Paulo Fernandes , and Renata Vieira
    2016 “Estimating Term Domain Relevance through Term Frequency, Disjoint Corpora Frequency – tf-dcf.” Knowledge-Based Systems97: 237–249.10.1016/j.knosys.2015.12.015
    https://doi.org/10.1016/j.knosys.2015.12.015 [Google Scholar]
  7. Marciniak, Małgorzata , Agnieszka Mykowiecka , and Piotr Rychlik
    2016 “TermoPL – A Flexible Tool for Terminology Extraction.” InProceedings of 10th edition of the Language Resources and Evaluation Conference. 2278–2284. Portorož, Slovenia.
    [Google Scholar]
  8. Marciniak, Małgorzata , and Agnieszka Mykowiecka
    2014 “Terminology Extraction from Medical Texts in Polish.” Journal of Biomedical Semantics5: 24. doi: 10.1186/2041‑1480‑5‑24
    https://doi.org/10.1186/2041-1480-5-24 [Google Scholar]
  9. Mikolov, Tomas , Wen-tau Yih , and Geoffrey Zweig
    2013 “Linguistic Regularities in Continuous Space Word Representations.” InHuman Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings. 746–751. Atlanta, Georgia: The Association for Computer Linguistics.
    [Google Scholar]
  10. Navigli, Roberto , and Paola Velardi
    2004 “Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites.” Computational Linguistics30: 151–179. doi: 10.1162/089120104323093276
    https://doi.org/10.1162/089120104323093276 [Google Scholar]
  11. Przepiórkowski, Adam , Mirosław Bańko , Rafał L. Górski , and Barbara Lewandowska-Tomaszczyk
    2012Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN.
    [Google Scholar]
  12. Rayson, Paul , and Roger Garside
    2000 “Comparing Corpora Using Frequency Profiling.” inProceedings of the Workshop on Comparing Corpora – Volume 9, WCC ’00. 1–6. Stroudsburg, PA, USA: Association for Computational Linguistics. doi: 10.3115/1117729.1117730
    https://doi.org/10.3115/1117729.1117730 [Google Scholar]
  13. Řehůřek, Radim , and Petr Sojka
    2010 “Software Framework for Topic Modelling with Large Corpora.” InProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50. Valetta, Malta: ELRA.
    [Google Scholar]
  14. Schäfer, Johannes , Ina Rösinger , Ulrich Heid , and Michael Dorna
    2015 “Evaluating Noise Reduction Strategies for Terminology Extraction.” InProceedings of the 11th International Conference on Terminology and Artificial Intelligence. 123–131. Granada: Universidad de Granada.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): automatic term recognition; domain corpora; irrelevant phrases; similar phrases
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error