Describing a translational corpus
There are a number of different ways to describe a single corpus. We consider how the frequencies of linguistic features may be quantified, such as in terms of their “average” occurrence, dispersion among text segments, and whether they follow the familiar “bell curve” characteristic of a normal distribution. We describe how to determine the required corpus size so that these things can be measured with the required degree of confidence. We consider “aboutness”: the extent to which individual linguistic features characterise the corpus as a whole. We describe the vocabulary richness, the extent to which the author of a text constantly brings in new vocabulary, and collocations: groups of words which are found together more often than one would expect by chance.