
Full text loading...
Abstract
The article investigates the two main corpus indicators of word commonness, frequency and dispersion, through a cross-validation analysis of frequency and four dispersion measures (‘Range’, ‘Chi-squared’, ‘Deviation of Proportions’ and ‘Juilland’s D’). The approach provides an estimation of the capacity of the named measures to predict the distribution of corpus items in an extracted language sample. Based on a dataset of 273 Norwegian compounds, the results show that especially Deviation of Proportions is a robust measure of dispersion that can be used in conjunction with frequency to substantiate assertions of word commonness based on corpus data. In addition, dispersion measures do not only reflect what sort of distribution the frequency statistic is generated from, but also how reliable the frequency estimation in the corpus sample is in terms of giving an accurate representation of frequency in the language variety that the corpus is sampled from.