Full text loading...
-
TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment
- Source: Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, Volume 19, Issue 1, Jan 2013, p. 1 - 30
- Previous Article
- Table of Contents
- Next Article
Abstract
We report on TExSIS, a flexible bilingual terminology extraction system that uses a sophisticated chunk-based alignment method for the generation of candidate terms, after which the specificity of the candidate terms is determined by combining several statistical filters. Although the set-up of the architecture is largely language-independent, we present terminology extraction results for four different languages and three language pairs. Gold standard data sets were created for French-Italian, French-English and French-Dutch, which allowed us not only to evaluate precision, which is common practice, but also recall. We compared the TExSIS approach, which takes a multilingual perspective from the start, with the more commonly used approach of first identifying term candidates monolingually and then aligning the source and target terms. A comparison of our system with the LUIZ approach described by Vintar (2010) reveals that TExSIS outperforms LUIZ both for monolingual and bilingual terminology extraction. Our results also clearly show that the precision of the alignment is crucial for the success of the terminology extraction. Furthermore, based on the observation that the precision scores for bilingual terminology extraction outperform those of the monolingual systems, we conclude that multilingual evidence helps to determine unithood in less related languages.