A data-driven approach to anglicism identification in Norwegian

MyBook is a cheap paperback edition of the original book and will be sold at uniform, low price.
This Chapter is currently unavailable for purchase.

Anglicisms are words of English origin that have entered into Norwegian, either denoting conceptual innovations such as <i>interface</i> or denoting existing concepts in parallel with their Norwegian counterparts (<i>boots</i>). In this chapter we investigate whether machine-learning methods could improve the anglicism component of the classification tool that is currently used to categorize new words appearing in the Norwegian Newspaper Corpus. We derive classification features by extracting three-character sequences (trigrams) from long lists of uniquely English and Norwegian words. Next, we test two frequency-based and a statisticsbased approach to selecting features from this initial pool of trigrams. Finally, using the TiMBL memory-based learning system, we train a classifier with our selections of trigrams, identifying the sets of trigrams that are most predictive for identifying anglicisms. The results show that the datadriven frequency approach, although not sufficient as a stand-alone method for automatic anglicism identification, provides a valuable supplement to the existing knowledge-based classification tool.


This is a required field
Please enter a valid email address