Volume 21, Issue 2
  • ISSN 1387-6759
  • E-ISSN: 1569-9897
Buy:$35.00 + Taxes



N‑gram analysis (popularized e.g. by Biber ., 1999) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling and Ebeling, 2013Granger and Lefer, 2013Čermáková and Chlumská, 2017). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).


Article metrics loading...

Loading full text...

Full text loading...


  1. Baker, M.
    2004 A Corpus-Based View of Similarity and Difference in Translation. International Journal of Corpus Linguistics9(2): 167–193. 10.1075/ijcl.9.2.02bak
    https://doi.org/10.1075/ijcl.9.2.02bak [Google Scholar]
  2. Biber, D. , Johansson, S. , Leech, G. , Conrad, S. and Finegan, E.
    1999Longman Grammar of Spoken and Written English. Harlow: Longman.
    [Google Scholar]
  3. Biber, D. , Kim, Y. and Tracy-Ventura, N.
    2010 A Corpus-Driven Approach to Comparative Phraseology: Lexical Bundles in English, Spanish, and Korean. InJapanese/Korean Linguistics, Volume17, S. Iwasaki , H. Hoji , P. M. Clancy and S.-O. Sohn (eds), 75–94. Stanford: Center for the Study of Language and Information (CSLI).
    [Google Scholar]
  4. Cheng, W. , Greaves, C. and Warren, M.
    2006 From N‑gram to Skipgram to Concgram. International Journal of Corpus Linguistics11(4): 411–433. 10.1075/ijcl.11.4.04che
    https://doi.org/10.1075/ijcl.11.4.04che [Google Scholar]
  5. Cortes, V.
    2008 A Comparative Analysis of Lexical Bundles in Academic History Writing in English and Spanish. Corpora3(1): 43–57. 10.3366/E1749503208000063
    https://doi.org/10.3366/E1749503208000063 [Google Scholar]
  6. Cvrček, V.
    2019Calc: Corpus Calculator. Prague: Czech National Corpus. Available atwww.korpus.cz/calc
    [Google Scholar]
  7. Čermák, F. and Rosen, A.
    2012 The Case of InterCorp, a Multilingual Parallel Corpus. International Journal of Corpus Linguistics13(3): 411–427. 10.1075/ijcl.17.3.05cer
    https://doi.org/10.1075/ijcl.17.3.05cer [Google Scholar]
  8. Čermáková, A. and Chlumská, L.
    2017 Expressing Place in Children’s Literature: Testing the Limits of the N‑gram Method in Contrastive Linguistics. InCross-Linguistic Correspondences: From Lexis to Genre, T. Egan and H. Dirdal (eds), 75–95. Amsterdam: John Benjamins. 10.1075/slcs.191.03cer
    https://doi.org/10.1075/slcs.191.03cer [Google Scholar]
  9. Ebeling, J. and Ebeling, S. Oksefjell
    2013Patterns in Contrast. Studies in Corpus Linguistics 58. Amsterdam: John Benjamins. 10.1075/scl.58
    https://doi.org/10.1075/scl.58 [Google Scholar]
  10. 2017 A Cross-Linguistic Comparison of Recurrent Word Combinations in a Comparable Corpus of English and Norwegian Fiction. InContrasting English and other Languages through Corpora, M. Janebová , E. Lapshinova-Koltunski and M. Martínková (eds), 2–31. Newcastle upon Tyne: Cambridge Scholars Publishing.
    [Google Scholar]
  11. Forchini, P. and Murphy, A. C.
    2008 N‑grams in Comparable Specialized Corpora: Perspectives on Phraseology, Translation, and Pedagogy. International Journal of Corpus Linguistics13(3): 351–367. 10.1075/ijcl.13.3.06for
    https://doi.org/10.1075/ijcl.13.3.06for [Google Scholar]
  12. Granger, S.
    2014 A Lexical Bundle Approach to Comparing Languages: Stems in English and French. Languages in Contrast14(1): 58–72. 10.1075/lic.14.1.04gra
    https://doi.org/10.1075/lic.14.1.04gra [Google Scholar]
  13. Granger, S. and Lefer, M.-A.
    2013 Enriching the Phraseological Coverage of High-Frequency Adverbs in English–French Bilingual Dictionaries. InAdvances in Corpus-Based Contrastive Linguistics: Studies in Honour of Stig Johansson, K. Aijmer and B. Altenberg (eds), 157–176. Amsterdam: John Benjamins. 10.1075/scl.54.10gra
    https://doi.org/10.1075/scl.54.10gra [Google Scholar]
  14. Hasselgård, H.
    2017 Temporal Expression in English and Norwagian. InContrasting English and other Languages through Corpora, M. Janebová , E. Lapshinova-Koltunski and M. Martínková (eds), 75–101. Newcastle upon Tyne: Cambridge Scholars Publishing.
    [Google Scholar]
  15. Kim, Y.
    2009 Korean Lexical Bundles in Conversations and Academic Texts. Corpora4(2): 135–165. 10.3366/E1749503209000288
    https://doi.org/10.3366/E1749503209000288 [Google Scholar]
  16. Mahlberg, M.
    2012Corpus Stylistics and Dickens’s Fiction. London: Routledge.
    [Google Scholar]
  17. Milička, J.
    2013 Rank-Frequency Relation & Type-Token Relation: Two Sides of the Same Coin. InMethods and Applications of Quantitative Linguistics, M. Obradovič , E. Kelih , R. Köhler (eds), 163–172. Belgrade: University of Belgrade and Academic Mind.
    [Google Scholar]
  18. Nebeský, L. and Novák, P.
    1996 Větné faktory a jejich podíl na analýze věty. Slovo a Slovesnost57(4): 282–295.
    [Google Scholar]
  19. Rapoport, A.
    1982 Zipf’s Law Re-Visited. Quantitative Linguistics16(1): 1–28.
    [Google Scholar]
  20. Rosen, A. , Vavřín, M. and Zasina, A. J.
    2018The InterCorp Corpus, Version 11 of 11 October 2018. Praha: Institute of the Czech National Corpus. FF UK. Available atwww.korpus.cz
    [Google Scholar]
  21. Sinclair, J.
    2004 The Search for Units of Meaning. InTrust the Text: Language, Corpus and Discourse, R. Carter (ed.), 24–48. London: Routledge. 10.4324/9780203594070‑6
    https://doi.org/10.4324/9780203594070-6 [Google Scholar]
  22. Tracy-Ventura, N. , Cortes, V. and Biber, D.
    2007 Lexical Bundles in Spanish Speech and Writing. InWorking with Spanish Corpora, G. Parodi (ed.), 217–231. London: Continuum.
    [Google Scholar]
  23. Zipf, G. K.
    1949Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge: Addison-Wesley Press.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): correspondence; Czech/English/Spanish; n‑grams; parallel corpus
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error