Collocations and statistical analysis of n-grams

MyBook is a cheap paperback edition of the original book and will be sold at uniform, low price.
This Chapter is currently unavailable for purchase.

Multiword expressions (MWEs) are words that co-occur so often that they are perceived as a linguistic unit. Since MWEs pervade natural language, their identification is pertinent for a range of tasks within lexicography, terminology and language technology. We apply various statistical association measures (AMs) to word sequences from the Norwegian Newspaper Corpus (NNC) in order to rank two-and three-word sequences (bigrams and trigrams) in terms of their tendency to co-occur. The results show that some statistical measures favour relatively frequent MWEs (e.g. <i>i motsetning til </i>&#8216;as opposed to&#8217;), whereas other measures favour relatively low-frequent units, which typically comprise loan words (<i>de facto</i>), technical terms (<i>notaries publicus</i>) and phrasal anglicisms (<i>practical jokes</i>; cf. G. Andersen this volume). On this basis we evaluate the relevance of each of these measures for lexicography, terminology and language technology purposes.


This is a required field
Please enter a valid email address