- Home
- e-Journals
- Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication
- Previous Issues
- Volume 24, Issue, 2018
Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication - Volume 24, Issue 1, 2018
Volume 24, Issue 1, 2018
-
NWJC2Vec
Author(s): Masayuki Asaharapp.: 7–22 (16)More LessIn this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.
-
Distributed specificity for automatic terminology extraction
Author(s): Ehsan Amjadian, Diana Inkpen, T. Sima Paribakht and Farahnaz Faezpp.: 23–40 (18)More LessThe present article explores two novel methods that integrate distributed representations with terminology extraction. Both methods assess the specificity of a word (unigram) to the target corpus by leveraging its distributed representation in the target domain as well as in the general domain. The first approach adopts this distributed specificity as a filter, and the second directly applies it to the corpus. The filter can be mounted on any other Automatic Terminology Extraction (ATE) method, allows merging any number of other ATE methods, and achieves remarkable results with minimal training. The direct approach does not perform as high as the filtering approach, but it reemphasizes that using distributed specificity as the words’ representation, very little data is required to train an ATE classifier. This encourages more minimally supervised ATE algorithms in the future.
-
Clinical sublanguages
Author(s): Leonie Grön and Ann Bertelspp.: 41–65 (25)More LessDue to its specific linguistic properties, the language found in clinical records has been characterized as a distinct sublanguage. Even within the clinical domain, though, there are major differences in language use, which has led to more fine-grained distinctions based on medical fields and document types. However, previous work has mostly neglected the influence of term variation. By contrast, we propose to integrate the potential for term variation in the characterization of clinical sublanguages. By analyzing a corpus of clinical records, we show that the different sections of these records vary systematically with regard to their lexical, terminological and semantic composition, as well as their potential for term variation. These properties have implications for automatic term recognition, as they influence the performance of frequency-based term weighting.
-
Recognition of irrelevant phrases in automatically extracted lists of domain terms
Author(s): Agnieszka Mykowiecka, Małgorzata Marciniak and Piotr Rychlikpp.: 66–90 (25)More LessIn our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora.
The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method, with a precision of about 0.75 on half of the tested list, was the context based method using a modified contextual diversity coefficient.
Although the methods were tested on Polish, they seems to be language independent.
-
HYPHEN
Author(s): Paul Thompson and Sophia Ananiadoupp.: 91–121 (31)More LessNarrative clinical records and biomedical articles constitute rich sources of information about phenotypes, i.e., markers distinguishing individuals with specific medical conditions from the general population. Phenotypes help clinicians to provide personalised treatments. However, locating information about them within huge document repositories is difficult, since each phenotypic concept can be mentioned in many ways. Normalisation methods automatically map divergent phrases to unique concepts in domain-specific terminologies, to allow location and linking of all mentions of a concept of interest. We have developed a hybrid normalisation method (HYPHEN) to handle concept mentions with wide ranging characteristics, across different text types. HYPHEN integrates various normalisation techniques that handle surface-level variations (e.g., differences in word order, word forms or acronyms/abbreviations) and lexical-level variations (where terms have similar meanings, but potentially unrelated forms). HYPHEN achieves robust performance for both biomedical academic text and narrative clinical records, and has the ability to significantly outperform related methods.
-
Improving term candidates selection using terminological tokens
Author(s): Mercè Vàzquez and Antoni Oliverpp.: 122–147 (26)More LessThe identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. This paper presents the implementation of the term candidates filtering method we developed in linguistic and statistical approaches applied for automatic term extraction using several domain-specific corpora in different languages. We observed that the filtering method outperforms term candidate selection by ranking a higher number of terms at the top of the term candidate list than raw frequency, and for statistical term extraction the improvement is between 15% and 25% both in precision and recall. Our analyses further revealed a reduction in the number of term candidates to be validated manually by specialists. In conclusion, the number of term candidates extracted automatically from domain-specific corpora has been reduced significantly using the Token Slot Recognition filtering method, so term candidates can be easily and quickly validated by specialists.
Volumes & issues
-
Volume 30 (2024)
-
Volume 29 (2023)
-
Volume 28 (2022)
-
Volume 27 (2021)
-
Volume 26 (2020)
-
Volume 25 (2019)
-
Volume 24 (2018)
-
Volume 23 (2017)
-
Volume 22 (2016)
-
Volume 21 (2015)
-
Volume 20 (2014)
-
Volume 19 (2013)
-
Volume 18 (2012)
-
Volume 17 (2011)
-
Volume 16 (2010)
-
Volume 15 (2009)
-
Volume 14 (2008)
-
Volume 13 (2007)
-
Volume 12 (2006)
-
Volume 11 (2005)
-
Volume 10 (2004)
-
Volume 9 (2003)
-
Volume 8 (2002)
-
Volume 7 (2001)
-
Volume 6 (2000)
-
Volume 5 (1998)
-
Volume 4 (1997)
-
Volume 3 (1996)
-
Volume 2 (1995)
-
Volume 1 (1994)
Most Read This Month

-
-
Methods of automatic term recognition: A review
Author(s): Kyo Kageura and Bin Umino
-
- More Less