Volume 23, Issue 2
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes



This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise measures is abandoned, association measures must also identify the borders of meaningful sequences. This paper takes a vector-based approach to the segmentation problem by using 18 unique measures to describe different aspects of multi-unit association. An examination of these measures across eight languages shows that they are stable across languages and that each provides a unique rank of associated sequences. Taken together, these measures expand corpus-based approaches to association by generalizing across varying lengths and types of representation.


Article metrics loading...

Loading full text...

Full text loading...


  1. Biber, B., Reppen, R., Schnur, E., & Ghanem, R.
    (2016) On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4): 439–464.10.1075/ijcl.21.4.01bib
    https://doi.org/10.1075/ijcl.21.4.01bib [Google Scholar]
  2. Church, K., & Hanks, P.
    (1990) Word association norms, Mutual Information, and lexicography. Computational Linguistics, 16(1): 22–29.
    [Google Scholar]
  3. Daudaravičius, V., & Marcinkevičienė, R.
    (2004) Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9(2): 321–348.10.1075/ijcl.9.2.08dau
    https://doi.org/10.1075/ijcl.9.2.08dau [Google Scholar]
  4. Davies, M.
    (2008-) The Corpus of Contemporary American English (COCA): 520 million words, 1990-present. Available online atcorpus.byu.edu/coca/ (last accessedJune 2018).
    [Google Scholar]
  5. Dunn, J.
    (2017) Computational learning of construction grammars. Language and Cognition, 9(2): 254–292.10.1017/langcog.2016.7
    https://doi.org/10.1017/langcog.2016.7 [Google Scholar]
  6. (2018) Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs. Cognitive Linguistics, 29(2): 275–311.10.1515/cog‑2017‑0029
    https://doi.org/10.1515/cog-2017-0029 [Google Scholar]
  7. Ellis, N.
    (2007) Language acquisition as rational contingency learning. Applied Linguistics, 27(1): 1–24.10.1093/applin/ami038
    https://doi.org/10.1093/applin/ami038 [Google Scholar]
  8. Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S.
    (2010) Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11: 625–660.
    [Google Scholar]
  9. Evert, S.
    (2005) The Statistics of Word Co-Occurrences: Word Pairs and Collocations (Unpublished doctoral dissertation). Stuttgart, University of Stuttgart.
    [Google Scholar]
  10. Gries, St. Th.
    (2008) Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4): 403–437.10.1075/ijcl.13.4.02gri
    https://doi.org/10.1075/ijcl.13.4.02gri [Google Scholar]
  11. (2010) Bigrams in registers, domains, and varieties: A bigram gravity approach to the homogeneity of corpora. InMahlberg, M., Diaz, V. & Smith, C. (Eds.) Proceedings of the 2009 Corpus Linguistics Conference. Liverpool: University of Liverpool.
    [Google Scholar]
  12. (2012) Frequencies, probabilities, and association measures in usage- / exemplar-based linguistics. Studies in Language, 11(3): 477–510.
    [Google Scholar]
  13. (2013) 50-something years of work on collocations: What is or should be next. International Journal of Corpus Linguistics, 18(1): 137–165.10.1075/ijcl.18.1.09gri
    https://doi.org/10.1075/ijcl.18.1.09gri [Google Scholar]
  14. Gries, St. Th., & Mukherjee, J.
    (2010) Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4): 520–548.10.1075/ijcl.15.4.04gri
    https://doi.org/10.1075/ijcl.15.4.04gri [Google Scholar]
  15. Gries, St. Th., & Stefanowitsch, A.
    (2004) Extending collostructional analysis: A corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics, 9(1): 97–129.10.1075/ijcl.9.1.06gri
    https://doi.org/10.1075/ijcl.9.1.06gri [Google Scholar]
  16. Jelinek, F.
    (1990) Self-organizing language modeling for speech recognition. InA. Waibel & K. Lee (eds.), Readings in Speech Recognition (pp.450–506). San Mateo, CA: Morgan Kaufmann.10.1016/B978‑0‑08‑051584‑7.50045‑0
    https://doi.org/10.1016/B978-0-08-051584-7.50045-0 [Google Scholar]
  17. Koehn, P.
    (2005) Europarl: A parallel corpus for statistical machine translation. InProceedings of the 10th Machine Translation Summit 2005 (pp.79–86). Tokyo: Asia-Pacific Association for Machine Translation.
    [Google Scholar]
  18. Michelbacher, L., Evert, S., & Schutze, H.
    (2007) Asymmetric association measures. InN. Nicolov, G. Angelova & R. Mitkov (Eds.), Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP) (pp.367–372). Amsterdam/Philadelphia: John Benjamins. 367–372.
    [Google Scholar]
  19. Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B.
    (2016) A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications, 29(3): 409–422.10.3233/AIC‑150698
    https://doi.org/10.3233/AIC-150698 [Google Scholar]
  20. Pecina, P.
    (2009) Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1/2): 137–158.
    [Google Scholar]
  21. Pedersen, T.
    (1998) Dependent bigram identification. InJ. Mostow & C. Rich (Eds.), Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) (p.1197). Menlo Park, CA: The AAAI Press.
    [Google Scholar]
  22. Pennington, J., Socher, R., & Manning, C.
    (2014) GloVe: Global vectors for word representation. InB. Pang & W. Daelemans (Eds.), Proceedings of Empirical Methods in Natural Language Processing (EMNLP) (pp.1532–1543) Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  23. Shimohata, S., Sugio, T., & Nagata, J.
    (1997) Retrieving collocations by co-occurrences and word order constraints. InP. Cohen & W. Wahlster (Eds.), Proceedings of the Association for Computational Linguistics Annual Meeting (pp.476–481). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  24. Wible, D., & Tsao, N.
    (2010) StringNet as a computational resource for discovering and investigating linguistic constructions. InM. Sahlgren & O. Knutsson (Eds.), Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL) (pp.25–31). Stroudsburg, PA: Association for Computational Linguistics.
    [Google Scholar]
  25. Wiechmann, D.
    (2008) On the computation of collostructional strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2): 253–290.10.1515/CLLT.2008.011
    https://doi.org/10.1515/CLLT.2008.011 [Google Scholar]
  26. Zhai, C.
    (1997) Exploiting context to identify lexical atoms: A statistical view of linguistic context. InP. Brezillon (Ed.), Proceedings of the First International and Interdisciplinary Conference on Modeling and Using Contex (pp.119–129). Rio de Janeiro, Brazil.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): association strength; collocations; multi-unit association; sequences; ΔP
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error