Volume 29, Issue 1
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes



In this study, we propose a new evaluation scheme to assess the strengths and limitations of collocation extraction measures and explore type-sensitive methods for extracting collocations. We introduced the pooling strategy widely used in Information Retrieval and automated the evaluation process using online dictionaries. Sixteen well-known metrics are evaluated based on their effectiveness and then distributional and linguistic compared. The results show that Group A methods (e.g. z-score, Dice, PMI) are more effective in extracting low-frequency collocations with relatively small extraction scales. In contrast, Group B methods (e.g. t-test, LMI, LLR) perform better at finding high-frequency collocations, most of which outperform Group A methods as the extraction scale increases. Moreover, Group A prefers NN collocations, while Group B identifies collocations with a wide range of syntactic structures. This study provides suggestions for studies to identify hybrid extraction methods as well as for language educators and dictionary compilers.


Article metrics loading...

Loading full text...

Full text loading...


  1. Ackermann, K., & Chen, Y. H.
    (2013) Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach. Journal of English for Academic Purposes, 12(4), 235–247. 10.1016/j.jeap.2013.08.002
    https://doi.org/10.1016/j.jeap.2013.08.002 [Google Scholar]
  2. Agresti, A.
    (2003) Categorical Data Analysis. John Wiley & Sons.
    [Google Scholar]
  3. Agrawal, S., Sanyal, R., & Sanyal, S.
    (2018) Hybrid method for automatic extraction of multi-word expressions. International Journal of Engineering & Technology, 71, 33–38. 10.14419/ijet.v7i2.6.10063
    https://doi.org/10.14419/ijet.v7i2.6.10063 [Google Scholar]
  4. Bartsch, S., & Evert, S.
    (2014) Towards a Firthian notion of collocation. Vernetzungsstrategien Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern, 2(1), 48–61.
    [Google Scholar]
  5. Berry-Rogghe, G.
    (1973) The computation of collocations and their relevance in lexical studies. InA. Aitken, R. Bailey, & N. Hamilton-Smith. (Eds.), The Computer and Literary Studies (pp.103–112). Edinburgh University Press.
    [Google Scholar]
  6. Blaheta, D., & Johnson, M.
    (2001, July7). Unsupervised learning of multi-word verbs [Paper presentation]. ACL/EACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations. Toulouse, France.
    [Google Scholar]
  7. Church, K., & Hanks, P.
    (1990) Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
    [Google Scholar]
  8. Constant, M., Eryiğit, G., Monti, J., Van Der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A.
    (2017) Multi-word expression processing: A survey. Computational Linguistics, 43(4), 837–892. 10.1162/COLI_a_00302
    https://doi.org/10.1162/COLI_a_00302 [Google Scholar]
  9. Daille, B.
    (1994) Study and implementation of combined techniques for automatic extraction of terminology. InJ. L. Klavans & P. Resnik. (Eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language (pp.49–66). MIT.
    [Google Scholar]
  10. Dice, L. R.
    (1945) Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. 10.2307/1932409
    https://doi.org/10.2307/1932409 [Google Scholar]
  11. Dunning, T. E.
    (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
    [Google Scholar]
  12. Espinosa-Anke, L., Schockaert, S., & Wanner, L.
    (2019) Collocation classification with unsupervised relation vectors. InA. Korhonen, D. Traum, L. Màrquez. (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp.5765–5772). Association for Computational Linguistics. https://aclanthology.org/P19-1576/. 10.18653/v1/P19‑1576
    https://doi.org/10.18653/v1/P19-1576 [Google Scholar]
  13. Evert, S.
    (2005) The Statistics of Word Co-occurrences: Word Pairs and Collocations [Doctoral dissertation, Universität Stuttgart]. Online Publikationen der Universität Stuttgart. https://elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf
    [Google Scholar]
  14. (2008) Corpora and collocations. InA. Lüdeling & M. Kytö. (Eds.), Corpus Linguistics: An International Handbook (Vol.21, pp.1212–1248). De Gruyter.
    [Google Scholar]
  15. Evert, S., & Krenn, B.
    (2001) Methods for the qualitative evaluation of lexical association measures. InProceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp.188–195). Association for Computational Linguistics. https://aclanthology.org/P01-1025/. 10.3115/1073012.1073037
    https://doi.org/10.3115/1073012.1073037 [Google Scholar]
  16. Firth, J. R.
    (1968) A synopsis of linguistic theory, 1930–1955. InF. Palmer. (Ed.), Selected Papers of J. R. Firth 1952–1959 (pp.168–205). Longman. (Original work published 1957)
    [Google Scholar]
  17. Green, S., de Marneffe, M., & Manning, C.
    (2013) Parsing models for identifying multi-word expressions. Computational Linguistics, 39(1), 195–227. 10.1162/COLI_a_00139
    https://doi.org/10.1162/COLI_a_00139 [Google Scholar]
  18. Gries, S. T.
    (2013) 50-something years of work on collocations: What is or should be next…. International Journal of Corpus Linguistics, 18(1), 137–166. 10.1075/ijcl.18.1.09gri
    https://doi.org/10.1075/ijcl.18.1.09gri [Google Scholar]
  19. Gueguen, L., Velasco-Forero, S., & Soille, P.
    (2014) Local mutual information for dissimilarity-based image segmentation. Journal of Mathematical Imaging and Vision, 48(3), 625–644. 10.1007/s10851‑013‑0432‑9
    https://doi.org/10.1007/s10851-013-0432-9 [Google Scholar]
  20. Hausmann, F.
    (1985) Kollokationen im deutschen Wörterbuch. Ein Beitrag zur Theorie des lexikographischen Beispiels [Collocations in the German dictionary: A contribution to the theory of the lexicographic example]. InH. Bergenholtz & J. Mugdan. (Ed.), Lexikographie und Grammatik (pp.118–129). Max Niemeyer. 10.1515/9783111635637‑004
    https://doi.org/10.1515/9783111635637-004 [Google Scholar]
  21. (2004) Was sind eigentlich Kollokationen? [What are collocations actually?] InK. Steyer. (Ed.), Wortverbindungen – mehr oder weniger fest (pp.309–334). De Gruyter. 10.1515/9783110622768‑015
    https://doi.org/10.1515/9783110622768-015 [Google Scholar]
  22. Herbst, T.
    (1996) What are collocations: Sandy beaches or false teeth?English Studies, 77(4), 379–393. 10.1080/00138389608599038
    https://doi.org/10.1080/00138389608599038 [Google Scholar]
  23. Huang, L. S.
    (2001) Knowledge of English collocations: An analysis of Taiwanese EFL learners. InC. Luke & B. Rubrecht. (Eds.), Texas Papers in Foreign Language Education: Selected Proceedings from the Texas Foreign Language Education Conference (pp.113–132). ERIC.
    [Google Scholar]
  24. Jaccard, P.
    (1912) The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50. 10.1111/j.1469‑8137.1912.tb05611.x
    https://doi.org/10.1111/j.1469-8137.1912.tb05611.x [Google Scholar]
  25. Jones, K., & Van Rijsbergen, C. J.
    (1975) Report on the need for and provision of an ideal information retrieval test collection. British Library Research and Development Report 5266. University Computer Laboratory, Cambridge.
    [Google Scholar]
  26. Kang, B. M.
    (2018) Collocation and word association: Comparing collocation measuring methods. International Journal of Corpus Linguistics, 23(1), 85–113. 10.1075/ijcl.15116.kan
    https://doi.org/10.1075/ijcl.15116.kan [Google Scholar]
  27. Kita, K., Kato, Y., Omoto, T., & Yano, Y.
    (1994) A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria. Journal of Natural Language Processing, 1(1), 21–33. 10.5715/jnlp.1.21
    https://doi.org/10.5715/jnlp.1.21 [Google Scholar]
  28. Lei, L., & Liu, D.
    (2018) The academic English collocation list: A corpus-driven study. International Journal of Corpus Linguistics, 23(2), 216–243. 10.1075/ijcl.16135.lei
    https://doi.org/10.1075/ijcl.16135.lei [Google Scholar]
  29. L’homme, M., & Bertrand, C.
    (2000) Specialized lexical combinations: Should they be described as collocations or in terms of selectional restrictions. InU. Heid, S. Evert, E. Lehman, & C. Rohrer. (Eds.), Proceedings of the Ninth Euralex International Congress (pp.497–506). Stuttgart University.
    [Google Scholar]
  30. Loper, E., & Bird, S.
    (2002) NLTK: The natural language toolkit. arXiv preprint cs/0205028.
    [Google Scholar]
  31. Manning, C., & Schütze, H.
    (1999) Foundations of Statistical Natural Language Processing. MIT.
    [Google Scholar]
  32. Mel’čuk, I.
    (1998) Collocations and lexical functions. InA. Cowie. (Ed.), Phraseology: Theory, Analysis, and Applications (pp.23–53). Clarendon.
    [Google Scholar]
  33. Montemurro, M. A., & Zanette, D. H.
    (2002) New perspectives on Zipf’s law in linguistics: From single texts to large corpora. Glottometrics, 41, 87–99.
    [Google Scholar]
  34. Moon, R.
    (2008) Dictionaries and collocation. InS. Granger & F. Meunier. (Eds.), Phraseology: An Interdisciplinary Perspective (pp.313–336). Benjamins. 10.1075/z.139.27moo
    https://doi.org/10.1075/z.139.27moo [Google Scholar]
  35. Orliac, B., & Dillinger, M.
    (2003) Collocation extraction for machine translation. InProceedings of Machine Translation Summit IX (pp.292–298). MTSummit. https://aclanthology.org/2003.mtsummit-papers.39/
    [Google Scholar]
  36. Pearce, D.
    (2001, June3–4). Synonymy in collocation extraction [Paper presentation]. NAACL workshop on WordNet and other lexical resources. Pittsburgh, USA.
    [Google Scholar]
  37. (2002, May). A comparative evaluation of collocation extraction techniques. InM. Rodríguez & C. Araujo. (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp.1530–1536). European Language Resources Association. https://aclanthology.org/L02-1169/
    [Google Scholar]
  38. Pecina, P.
    (2005, June). An extensive empirical study of collocation extraction methods. InC. Callison-Burch & S. Wan. (Eds.), Proceedings of the ACL Student Research Workshop (pp.13–18). Association for Computational Linguistics. https://aclanthology.org/P05-2003/. 10.3115/1628960.1628964
    https://doi.org/10.3115/1628960.1628964 [Google Scholar]
  39. (2010) Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158. 10.1007/s10579‑009‑9101‑4
    https://doi.org/10.1007/s10579-009-9101-4 [Google Scholar]
  40. Pedersen, T., & Bruce, R.
    (1996) What to infer from a description. Technical Report 96-CSE-04. Southern Methodist University.
    [Google Scholar]
  41. Pivovarova, L., Kormacheva, D., & Kopotev, M.
    (2017) Evaluation of collocation extraction methods for the Russian language. InM. Kopotev, O. Lyashevskaya, & A. Mustajoki. (Eds.), Quantitative Approaches to the Russian Language (pp.137–157). Routledge. 10.4324/9781315105048‑7
    https://doi.org/10.4324/9781315105048-7 [Google Scholar]
  42. Quasthoff, U., & Wolff, C.
    (2002, July). The poisson collocation measure and its applications [Paper presentation]. Second International Workshop on Computational Approaches to Collocations. Vienna, Austria.
    [Google Scholar]
  43. Seretan, V.
    (2011) Syntax-based Collocation Extraction. Springer. 10.1007/978‑94‑007‑0134‑2
    https://doi.org/10.1007/978-94-007-0134-2 [Google Scholar]
  44. Seretan, V., & Wehrli, E.
    (2006) Accurate collocation extraction using a multilingual parser. InN. Calzolari, C. Cardie, & P. Isabelle. (Eds.), Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp.953–960). Association for Computational Linguistics. https://aclanthology.org/P06-1120/. 10.3115/1220175.1220295
    https://doi.org/10.3115/1220175.1220295 [Google Scholar]
  45. Shimohata, S., Sugio, T., & Nagata, J.
    (1997, July). Retrieving collocations by co-occurrences and word order constraints. In35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (pp.476–481). Association for Computational Linguistics. https://aclanthology.org/P97-1061/
    [Google Scholar]
  46. Siepmann, D.
    (2005) Collocation, colligation and encoding dictionaries. Part I: Lexicological aspects. International Journal of Lexicography, 18(4), 409–443. 10.1093/ijl/eci042
    https://doi.org/10.1093/ijl/eci042 [Google Scholar]
  47. Sinclair, J.
    (1966) Beginning the study of lexis. InC. Bazell, J. Catford, M. Halliday, & R. Robins. (Eds.), In Memory of J. R. Firth (pp.410–429). Longman.
    [Google Scholar]
  48. (1991) Corpus Concordance Collocation. Oxford University Press.
    [Google Scholar]
  49. Smadja, F.
    (1993) Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–178.
    [Google Scholar]
  50. Stefanowitsch, A., & Gries, S. T.
    (2003) Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243. 10.1075/ijcl.8.2.03ste
    https://doi.org/10.1075/ijcl.8.2.03ste [Google Scholar]
  51. Tan, P. N., Kumar, V., & Srivastava, J.
    (2002) Selecting the right interestingness measure for association patterns. InProceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.32–41). Association for Computing Machinery. 10.1145/775047.775053
    https://doi.org/10.1145/775047.775053 [Google Scholar]
  52. Tonon, A., Demartini, G., & Cudré-Mauroux, P.
    (2015) Pooling-based continuous evaluation of information retrieval systems. Information Retrieval Journal, 181, 445–472. 10.1007/s10791‑015‑9266‑y
    https://doi.org/10.1007/s10791-015-9266-y [Google Scholar]
  53. Tutin, A.
    (2008) For an extended definition of lexical collocations. InE. Bernal & J. DeCesaris. (Eds.), Proceedings of the 13th Euralex International Congress (pp.1453–1460). Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra. https://shs.hal.science/halshs-00371418
    [Google Scholar]
  54. Uhrig, P., & Proisl, T.
    (2012) Less hay, more needles – using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates. Lexicographica, 28(1), 141–180. 10.1515/lexi.2012‑0009
    https://doi.org/10.1515/lexi.2012-0009 [Google Scholar]
  55. Voorhees, E. M.
    (2001, September). The philosophy of information retrieval evaluation. InC. Peters, M. Braschler, J. Gonzalo, & M. Kluck. (Eds.), Evaluation of Cross-Language Information Retrival Systems (pp.355–370). Springer.
    [Google Scholar]
  56. Zobel, J.
    (1998, August). How reliable are the results of large-scale information retrieval experiments?InA. Moffat, C. J. van Rijsbergen, R. Wilkinson, & J. Zobel. (Eds.), Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.307–314). ACM. 10.1145/290941.291014
    https://doi.org/10.1145/290941.291014 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): association measure; collocation extraction; evaluation; pooling; statistical metrics
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error