Modelling crosslinguistic n‑gram correspondence in typologically different languages

Jiří Milička; Václav Cvrček; Lucie Lukešová

doi:10.1075/lic.19018.mil

ISSN 1387-6759
E-ISSN: 1569-9897

GBP

Modelling crosslinguistic n‑gram correspondence in typologically different languages
Author(s): Jiří Milička¹, Václav Cvrček¹, Lucie Lukešová¹
View Affiliations Hide Affiliations

Affiliations: ¹ Charles UniversityCzech Republic
Source: Languages in Contrast, Volume 21, Issue 2, Aug 2021, p. 217 - 249
DOI: https://doi.org/10.1075/lic.19018.mil
- Received: 19 Jul 2019
- Accepted: 30 Nov 2020
- Version of Record published : 12 Jan 2021

Abstract

N‑gram analysis (popularized e.g. by Biber et al., 1999) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling and Ebeling, 2013; Granger and Lefer, 2013; Čermáková and Chlumská, 2017). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).

Article metrics loading...

/content/journals/10.1075/lic.19018.mil

2021-01-12

2024-04-17

From This Site

/content/journals/10.1075/lic.19018.mil

dcterms_title,dcterms_subject,pub_keyword

-contentType:Journal -contentType:Contributor -contentType:Concept -contentType:Institution

10

5

Full text loading...

References

Baker, M.
2004 A Corpus-Based View of Similarity and Difference in Translation. International Journal of Corpus Linguistics9(2): 167–193. 10.1075/ijcl.9.2.02bak
https://doi.org/10.1075/ijcl.9.2.02bak [Google Scholar]
Biber, D. , Johansson, S. , Leech, G. , Conrad, S. and Finegan, E.
1999 Longman Grammar of Spoken and Written English. Harlow: Longman.
[Google Scholar]
Biber, D. , Kim, Y. and Tracy-Ventura, N.
2010 A Corpus-Driven Approach to Comparative Phraseology: Lexical Bundles in English, Spanish, and Korean. InJapanese/Korean Linguistics, Volume17, S. Iwasaki , H. Hoji , P. M. Clancy and S.-O. Sohn (eds), 75–94. Stanford: Center for the Study of Language and Information (CSLI).
[Google Scholar]
Cheng, W. , Greaves, C. and Warren, M.
2006 From N‑gram to Skipgram to Concgram. International Journal of Corpus Linguistics11(4): 411–433. 10.1075/ijcl.11.4.04che
https://doi.org/10.1075/ijcl.11.4.04che [Google Scholar]
Cortes, V.
2008 A Comparative Analysis of Lexical Bundles in Academic History Writing in English and Spanish. Corpora3(1): 43–57. 10.3366/E1749503208000063
https://doi.org/10.3366/E1749503208000063 [Google Scholar]
Cvrček, V.
2019 Calc: Corpus Calculator. Prague: Czech National Corpus. Available atwww.korpus.cz/calc
[Google Scholar]
Čermák, F. and Rosen, A.
2012 The Case of InterCorp, a Multilingual Parallel Corpus. International Journal of Corpus Linguistics13(3): 411–427. 10.1075/ijcl.17.3.05cer
https://doi.org/10.1075/ijcl.17.3.05cer [Google Scholar]
Čermáková, A. and Chlumská, L.
2017 Expressing Place in Children’s Literature: Testing the Limits of the N‑gram Method in Contrastive Linguistics. InCross-Linguistic Correspondences: From Lexis to Genre, T. Egan and H. Dirdal (eds), 75–95. Amsterdam: John Benjamins. 10.1075/slcs.191.03cer
https://doi.org/10.1075/slcs.191.03cer [Google Scholar]
Ebeling, J. and Ebeling, S. Oksefjell
2013 Patterns in Contrast. Studies in Corpus Linguistics 58. Amsterdam: John Benjamins. 10.1075/scl.58
https://doi.org/10.1075/scl.58 [Google Scholar]
Ebeling, J. and Ebeling, S. Oksefjell
2017 A Cross-Linguistic Comparison of Recurrent Word Combinations in a Comparable Corpus of English and Norwegian Fiction. InContrasting English and other Languages through Corpora, M. Janebová , E. Lapshinova-Koltunski and M. Martínková (eds), 2–31. Newcastle upon Tyne: Cambridge Scholars Publishing.
[Google Scholar]
Forchini, P. and Murphy, A. C.
2008 N‑grams in Comparable Specialized Corpora: Perspectives on Phraseology, Translation, and Pedagogy. International Journal of Corpus Linguistics13(3): 351–367. 10.1075/ijcl.13.3.06for
https://doi.org/10.1075/ijcl.13.3.06for [Google Scholar]
Granger, S.
2014 A Lexical Bundle Approach to Comparing Languages: Stems in English and French. Languages in Contrast14(1): 58–72. 10.1075/lic.14.1.04gra
https://doi.org/10.1075/lic.14.1.04gra [Google Scholar]
Granger, S. and Lefer, M.-A.
2013 Enriching the Phraseological Coverage of High-Frequency Adverbs in English–French Bilingual Dictionaries. InAdvances in Corpus-Based Contrastive Linguistics: Studies in Honour of Stig Johansson, K. Aijmer and B. Altenberg (eds), 157–176. Amsterdam: John Benjamins. 10.1075/scl.54.10gra
https://doi.org/10.1075/scl.54.10gra [Google Scholar]
Hasselgård, H.
2017 Temporal Expression in English and Norwagian. InContrasting English and other Languages through Corpora, M. Janebová , E. Lapshinova-Koltunski and M. Martínková (eds), 75–101. Newcastle upon Tyne: Cambridge Scholars Publishing.
[Google Scholar]
Kim, Y.
2009 Korean Lexical Bundles in Conversations and Academic Texts. Corpora4(2): 135–165. 10.3366/E1749503209000288
https://doi.org/10.3366/E1749503209000288 [Google Scholar]
Mahlberg, M.
2012 Corpus Stylistics and Dickens’s Fiction. London: Routledge.
[Google Scholar]
Milička, J.
2013 Rank-Frequency Relation & Type-Token Relation: Two Sides of the Same Coin. InMethods and Applications of Quantitative Linguistics, M. Obradovič , E. Kelih , R. Köhler (eds), 163–172. Belgrade: University of Belgrade and Academic Mind.
[Google Scholar]
Nebeský, L. and Novák, P.
1996 Větné faktory a jejich podíl na analýze věty. Slovo a Slovesnost57(4): 282–295.
[Google Scholar]
Rapoport, A.
1982 Zipf’s Law Re-Visited. Quantitative Linguistics16(1): 1–28.
[Google Scholar]
Rosen, A. , Vavřín, M. and Zasina, A. J.
2018 The InterCorp Corpus, Version 11 of 11 October 2018. Praha: Institute of the Czech National Corpus. FF UK. Available atwww.korpus.cz
[Google Scholar]
Sinclair, J.
2004 The Search for Units of Meaning. InTrust the Text: Language, Corpus and Discourse, R. Carter (ed.), 24–48. London: Routledge. 10.4324/9780203594070‑6
https://doi.org/10.4324/9780203594070-6 [Google Scholar]
Tracy-Ventura, N. , Cortes, V. and Biber, D.
2007 Lexical Bundles in Spanish Speech and Writing. InWorking with Spanish Corpora, G. Parodi (ed.), 217–231. London: Continuum.
[Google Scholar]
Zipf, G. K.
1949 Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge: Addison-Wesley Press.
[Google Scholar]

http://instance.metastore.ingenta.com/content/journals/10.1075/lic.19018.mil

Modelling crosslinguistic n‑gram correspondence in typologically different languages

Languages in Contrast 21, 217 (2021); https://doi.org/10.1075/lic.19018.mil

/content/journals/10.1075/lic.19018.mil

Data & Media loading...

Article Type: Research Article

Keyword(s): correspondence; Czech/English/Spanish; n‑grams; parallel corpus

Most Cited

- Using multi-dimensional analysis to explore cross-linguistic universals of register variation
  
  Author(s): Douglas Biber
- Passive constructions in English and Chinese: A corpus-based contrastive study
  
  Author(s): Richard Xiao, Tony McEnery and Yufang Qian
- Cohesive explicitness and explicitation in an English-German translation corpus
  
  Author(s): Silvia Hansen-Schirra, Stella Neumann and Erich Steiner
- Aspect selection in adult L2 Spanish and the Competing Systems Hypothesis: When pedagogical and linguistic rules conflict
  
  Author(s): Jason Rothman
- The clustering of discourse markers and filled pauses
  
  Author(s): Ludivine Crible, Liesbeth Degand and Gaëtanelle Gilquin
- English and French causal connectives in contrast
  
  Author(s): Sandrine Zufferey and Bruno Cartoni
- A lexical bundle approach to comparing languages: Stems in English and French
  
  Author(s): Sylviane Granger
- Will 'translationese' ruin a contrastive study?
  
  Author(s): Anna Mauranen
- Cross-linguistic analyses of backward causal connectives in Dutch, German and French
  
  Author(s): Mirna Pit
- The Integrated Contrastive Model: Spicing up your data
  
  Author(s): Gaëtanelle Gilquin
More Less

Modelling crosslinguistic n‑gram correspondence in typologically different languages

Abstract

Most Read This Month

Most Cited