1887
Text Corpora and Multilingual Lexicography
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
USD
Buy:$35.00 + Taxes

Abstract

In this paper, a language model, based on probabilities of text n-grams, is used as a measure of distance between Slovenian and 15 other European languages. During the construction of the model, a Huffman tree is generated from all the n-grams (n= 1to 32, frequency 2 or more) in the training corpus of Slovenian literary texts (2.7 million words), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to a new text sample, it is cut into n-grams (1–32) in such a way that the sum of model Huffman code lengths for all the obtained n-grams of new text is minimal.

The above model, applied to all (16) translations of Plato’s Republic from the TELRI CD ROM, produced the following language order (average coding length in bits per character): Slovenian (2,37), Serbocroatian (3,77), Croatian (3,84), Bulgarian (3,96), Czech (4,10), Polish (4,32), Russian (4,46), Slovak (4,46), Latvian (4,74), Lithuanian (4,94), English (5,40), French (5,67), German (5,69), Romanian (5,76), Finnish (6,11), and Hungarian (6,47).

Loading

Article metrics loading...

/content/journals/10.1075/ijcl.6.si.05jak
2001-12-17
2025-04-23
Loading full text...

Full text loading...

/content/journals/10.1075/ijcl.6.si.05jak
Loading
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error