Volume 22, Issue 4
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes


With expanding evidence on the formulaic nature of human communication, there is a growing need to extend discourse marker research to functionally analogue multi-word expressions. In contrast to the common qualitative approaches to discourse marker identification in corpora, this paper presents a corpus-driven semi-automatic approach to identification of multi-word discourse markers (MWDMs) in the reference corpus of spoken Slovene. Using eight statistical measures, we identified 173 structurally fixed discourse-marking MWEs, distinguished by a high number of tokens, a large proportion of grammatical words and semantic heterogeneity. This is a significantly longer list than would have been gained by manual inspection of smaller corpus samples. Although frequency-based methods produced satisfactory results, best precision in MWDM identification was achieved using the t-score association measure, while the overall poor performance of the mutual information suggests its inadequacy for extraction of MWDMs and other MWEs with similar lexical and distributional features.


Article metrics loading...

Loading full text...

Full text loading...


  1. Adolphs, S. , & Carter, R.
    (2013) Spoken Corpus Linguistics: From Monomodal to Multimodal. London/New York: Routledge.
    [Google Scholar]
  2. Aijmer, K.
    (1996) Conversational Routines in English: Convention and Creativity. London/New York: Addison Wesley Longman.
    [Google Scholar]
  3. (2002) English Discourse Particles. Amsterdam/Philadelphia: John Benjamins Publishing Company. doi: 10.1075/scl.10
    https://doi.org/10.1075/scl.10 [Google Scholar]
  4. Alonso, L. , Castellón, I. , & Padró, L.
    (2002) X-TRACTOR: A tool for extracting discourse markers. In A. Lenci , S. Montemagni & V. Pirelli (Eds.), Proceedings of the LREC 2002 Workshop on Linguistic Knowledge Acquisition and Representation: Bootrstrapping Annotated Language Data (pp.100–105). Paris: ELRA.
    [Google Scholar]
  5. Balažic Bulc, T.
    (2009) Torej, namreč, zato … o konektorjih: Raba in funkcija konektorjev v slovenskem in hrvaškem jezikoslovnem diskurzu. Ljubljana: Filozofska fakulteta.
    [Google Scholar]
  6. Biber, D.
    (2009) A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International Journal of Corpus Linguistics, 14(3), 275–311. doi: 10.1075/ijcl.14.3.08bib
    https://doi.org/10.1075/ijcl.14.3.08bib [Google Scholar]
  7. Biber, D. , Conrad, S. , & Cortes, V.
    (2004) If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405. doi: 10.1093/applin/25.3.371
    https://doi.org/10.1093/applin/25.3.371 [Google Scholar]
  8. Biber, D. , Johansson, S. , Leech, G. , & Conrad, S.
    (1999) Longman Grammar of Spoken and Written English. Harlow: Pearson Education.
    [Google Scholar]
  9. Blakemore, D.
    (2006) Divisions of labour: The analysis of parentheticals. Lingua, 116(10), 1670–1687. doi: 10.1016/j.lingua.2005.04.007
    https://doi.org/10.1016/j.lingua.2005.04.007 [Google Scholar]
  10. Bolly, C. , Crible, L. , Degand, L. , & Uygur, D.
    (forthcoming). Towards a model for discourse marker annotation in spoken French: From potential to feature-based discourse markers. In C. Fedriani & A. Sanso Eds. Pragmatic Markers, Discourse Markers and Modal Particles: New Perspectives (pp.71–98). Amsterdam/Philadelphia: John Benjamins.
    [Google Scholar]
  11. Brinton, L. J. (2008) The Comment Clause in English: Syntactic Origis and Pragmatic Development. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511551789
    https://doi.org/10.1017/CBO9780511551789 [Google Scholar]
  12. Brinton, L. J. , & Traugott, E. C.
    (2005) Lexicalization and Language Change. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511615962
    https://doi.org/10.1017/CBO9780511615962 [Google Scholar]
  13. Bybee, J.
    (2010) Language, Usage and Cognition. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511750526
    https://doi.org/10.1017/CBO9780511750526 [Google Scholar]
  14. Church, K. W. , & Hanks, P.
    (1990) Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
    [Google Scholar]
  15. Conklin, K. , & Schmitt, N.
    (2007) Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers?Applied Linguistics, 29(1), 72–89. doi: 10.1093/applin/amm022
    https://doi.org/10.1093/applin/amm022 [Google Scholar]
  16. Crible, L.
    (forthcoming) Towards an operational category of discourse markers: A definition and its model. In C. Fedriani & A. Sanso (Eds.), Discourse markers, Pragmatics Markers and Modal Particles: New Perspectives. Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/slcs.186.04cri
    https://doi.org/10.1075/slcs.186.04cri [Google Scholar]
  17. Csomay, E.
    (2013) Lexical bundles in discourse structure: A corpus-based study of classroom discourse. Applied Linguistics, 34(3), 369–388. doi: 10.1093/applin/ams045
    https://doi.org/10.1093/applin/ams045 [Google Scholar]
  18. da Silva, J. F. , & Lopes, G. P.
    (1999) A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In J. Rogers (Ed.), Proceedings of the 6th Meeting on the Mathematics of Language (pp.369–381). Orlando, FL: University of Central Florida.
    [Google Scholar]
  19. Degand, L. , Cornillie, B. , & Pietrandrea, P.
    (Eds.) (2013) Discourse Markers and Modal Particles: Categorization and Description. Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/pbns.234
    https://doi.org/10.1075/pbns.234 [Google Scholar]
  20. Degand, L. , & Evers-Vermeul, J.
    (2015) Grammaticalization or pragmaticalization of discourse markers?: More than a terminological issue. Journal of Historical Pragmatics, 16(1), 59–85. doi: 10.1075/jhp.16.1.03deg
    https://doi.org/10.1075/jhp.16.1.03deg [Google Scholar]
  21. Dehé, N. , & Kavalova, Y.
    (Eds.) (2007) Parentheticals. Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/la.106
    https://doi.org/10.1075/la.106 [Google Scholar]
  22. Dér, C.
    (2010) On the status of discourse markers. Acta Linguistica Hungarica, 57(1), 3–28. doi: 10.1556/ALing.57.2010.1.1
    https://doi.org/10.1556/ALing.57.2010.1.1 [Google Scholar]
  23. Dice, L. R.
    (1945) Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. doi: 10.2307/1932409
    https://doi.org/10.2307/1932409 [Google Scholar]
  24. Dobrovoljc, K.
    (forthcoming). Lexical features of spoken language in user-generated content: The case of multi-word discourse markers (Doctoral dissertation). Faculty of Arts, University of Ljubljana, Slovenia.
    [Google Scholar]
  25. Dobrovoljc, K. , & Nivre, J.
    (2016) The Universal Dependencies treebank of spoken Slovenian. In N. Calzolari , K. Choukri , T. Declerck , S. Goggi , M. Grobelnik , B. Maegaard , J. Mariani , H. Mazo , A. Moreno , J. Odijk & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp.1566–1573). Paris: ELRA.
    [Google Scholar]
  26. Dunning, T.
    (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
    [Google Scholar]
  27. Erman, B. , & Warren, B.
    (2000) The idiom principle and the open choice principle. Text – Interdisciplinary Journal for the Study of Discourse, 20(1), 29–62.
    [Google Scholar]
  28. Evert, S.
    (2009) Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp.1212–1248). Berlin/New York: Mouton de Gruyter. doi: 10.1515/9783110213881.2.1212
    https://doi.org/10.1515/9783110213881.2.1212 [Google Scholar]
  29. Fischer, K.
    (Ed.) (2006a) Approaches to Discourse Particles. Oxford: Elsevier.
    [Google Scholar]
  30. Fischer, K. (2006b) Towards an understanding of the spectrum of approaches to discourse particles: Introduction to the volume. In K. Fischer (Ed.), Approaches to Discourse Particles (pp.1–20). Oxford: Elsevier.
    [Google Scholar]
  31. (2014) Discourse markers. In K. P. Schneider & A. Barron (Eds.), Pragmatics of Discourse (pp.271–294). Berlin: Mouton De Gruyter. doi: 10.1515/9783110214406‑011
    https://doi.org/10.1515/9783110214406-011 [Google Scholar]
  32. Fox Tree, J. E. , & Schrock, J. C.
    (1999) Discourse markers in spontaneous speech: Oh what a difference an oh makes. Journal of Memory and Language, 40(2), 280–295. doi: 10.1006/jmla.1998.2613
    https://doi.org/10.1006/jmla.1998.2613 [Google Scholar]
  33. Fraser, B.
    (2013) Combinations of contrastive discourse markers in English. International Review of Pragmatics, 5(2), 318–340. doi: 10.1163/18773109‑13050209
    https://doi.org/10.1163/18773109-13050209 [Google Scholar]
  34. Gantar, P. , Kosem, I. , & Krek, S.
    (2016) Discovering automated lexicography: The case of the Slovene Lexical Database. International Journal of Lexicography, 29(2), 200–225. doi: 10.1093/ijl/ecw014
    https://doi.org/10.1093/ijl/ecw014 [Google Scholar]
  35. Gries, S. Th.
    (2012) Frequencies, probabilities, and association measures in usage-/exemplar-based linguistics: Some necessary clarification. Studies in Language, 11(3), 477–510. doi: 10.1075/sl.36.3.02gri
    https://doi.org/10.1075/sl.36.3.02gri [Google Scholar]
  36. (2013) 50-something years of work on collocations: What is or should be next …International Journal of Corpus Linguistics, 18(1), 137–166. doi: 10.1075/ijcl.18.1.09gri
    https://doi.org/10.1075/ijcl.18.1.09gri [Google Scholar]
  37. Hansen, M. -B.  M.
    (1998) The semantic status of discourse markers. Lingua, 104, 235–260. doi: 10.1016/S0024‑3841(98)00003‑5
    https://doi.org/10.1016/S0024-3841(98)00003-5 [Google Scholar]
  38. (2006) A dynamic polysemy approach to the lexical semantics of discourse markers (with an exemplary analysis of Frenchtoujours). In K. Fischer (Ed.), Approaches to Discourse Particles (pp.21–41). Oxford: Elsevier.
    [Google Scholar]
  39. Heine, B.
    (2013) On discourse markers: Grammaticalization, pragmaticalization, or something else?Linguistics, 51(6), 1205–1247. doi: 10.1515/ling‑2013‑0048
    https://doi.org/10.1515/ling-2013-0048 [Google Scholar]
  40. Jucker, A. H. , & Ziv, Y.
    (Eds.) (1998) Discourse Markers. Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/pbns.57
    https://doi.org/10.1075/pbns.57 [Google Scholar]
  41. Kilgarriff, A. , Rychly, P. , Kovar, V. , & Baisa, V.
    (2012) Finding multiwords of more than two words. In R. V. Fjeld & J. M. Torjusen (Eds.), Proceedings of the 15th EURALEX International Congress (pp.693–700). Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo.
    [Google Scholar]
  42. Koops, C. , & Lohmann, A.
    (2015) A quantitative approach to the grammaticalization of discourse markers: Evidence from their sequencing behavior. International Journal of Corpus Linguistics, 20(2), 232–259. doi: 10.1075/ijcl.20.2.04koo
    https://doi.org/10.1075/ijcl.20.2.04koo [Google Scholar]
  43. Krek, S.
    (2012) The Slovene Language in the Digital Age. Berlin/Heidelberg: Springer.
    [Google Scholar]
  44. Lapshinova-Koltunski, E. , & Kunz, K.
    (2014) Conjunctions across languages, registers and modes: Semi-automatic extraction and annotation. In A. Diaz Negrillo & F. J. Daz Prez (Eds.), Specialisation and Variation Language Corpora (pp.77–104). Bern: Peter Lang.
    [Google Scholar]
  45. Lin, P. M. S.
    (2013) The prosody of formulaic expression in the IBM/Lancaster Spoken English Corpus. International Journal of Corpus Linguistics, 18(4), 561–588. doi: 10.1075/ijcl.18.4.05lin
    https://doi.org/10.1075/ijcl.18.4.05lin [Google Scholar]
  46. Ljubešić, N. , Dobrovoljc, K. , & Fišer, D.
    (2015) MWELex – MWE lexica of Croatian, Slovene and Serbian extracted from parsed corpora. Informatica, 39(3), 293–300.
    [Google Scholar]
  47. Logar, N. , Gantar, P. , & Kosem, I.
    (2014) Collocations and examples of use: A lexical-semantic approach to terminology. Slovenščina 2.0, 2(1), 41–61.
    [Google Scholar]
  48. Louwerse, M. M. , & Mitchell, H. H.
    (2003) Toward a taxonomy of a set of discourse markers in dialog: A theoretical and computational account. Discourse Processes, 35, 199–239. doi: 10.1207/S15326950DP3503_1
    https://doi.org/10.1207/S15326950DP3503_1 [Google Scholar]
  49. Manning, C. , & Schütze, H.
    (1999) Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.
    [Google Scholar]
  50. Maschler, Y. , & Schiffrin, D. (2015) Discourse markers: Language, meaning, and context. In D. Tanen , H. E. Hamilton & D. Schiffrin (Eds.), The Handbook of Discourse Analysis (pp.189–221). Hoboken, NJ: John Wiley & Sons.
    [Google Scholar]
  51. McCarthy, M. , & Carter, R.
    (2006) This, that and the other: Multi-word clusters in spoken English as visible patterns of interaction. In M. McCarthy (Ed.), Explorations in Corpus Linguistics (pp.7–26). Cambridge: Cambridge University Press.
    [Google Scholar]
  52. Nesi, H. , & Basturkmen, H.
    (2006) Lexical bundles and discourse signalling in academic lectures. International Journal of Corpus Linguistics, 11(3), 283–304. doi: 10.1075/ijcl.11.3.04nes
    https://doi.org/10.1075/ijcl.11.3.04nes [Google Scholar]
  53. O’Donnell, M. B.
    (2010) The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal, 35, 135–169.
    [Google Scholar]
  54. Oakes, M. P.
    (1998) Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
    [Google Scholar]
  55. Overstreet, M.
    (2000) Whales, Candlelight, and Stuff Like That: General Extenders in English Discourse. Oxford/New York: Oxford University Press
    [Google Scholar]
  56. Pecina, P.
    (2010) Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158. doi: 10.1007/s10579‑009‑9101‑4
    https://doi.org/10.1007/s10579-009-9101-4 [Google Scholar]
  57. Prasad, R. , & Bunt, H.
    (2015) Semantic relations in discourse: The current state of ISO 24617–8. In H. Bunt (Ed.), Proceedings of the 11th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (pp.80–92). London: Queen Mary University of London.
    [Google Scholar]
  58. Prasad, R. , Dinesh, N. , Lee, A. , Miltsakaki, E. , Robaldo, L. , Joshi, A. , & Webber, B.
    (2008) The Penn Discourse TreeBank 2.0. In N. Calozolari , K. Choukri , B. Maegaard , J. Mariani , J. Odijk , S. Piperidis , D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp.2961–2968). Paris: ELRA.
    [Google Scholar]
  59. Prasad, R. , Joshi, A. , & Webber, B.
    (2010) Realization of discourse relations by other means: Alternative lexicalizations. In C. -R. Huang & D. Jurafsky (Eds.), Proceedings of the 23rd International Conference on Computational Linguistics (pp.1023–1031). Beijing: Chinese Information Processsing Society of China.
    [Google Scholar]
  60. Redeker, G.
    (2000) Coherence and structure in text and discourse. In H. V. Bunt & W. J. Black (Eds.), Abduction, Belief and Context in Dialogue: Studies in Computational Pragmatics (pp.233–263). Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/nlp.1.06red
    https://doi.org/10.1075/nlp.1.06red [Google Scholar]
  61. Roze, C. , Danlos, L. , & Muller, P.
    (2012) LEXCONN: A French lexicon of discourse connectives. Discours, 10. discours.revues.org/8645 doi: 10.4000/discours.8645
    https://doi.org/10.4000/discours.8645 [Google Scholar]
  62. Rychlý, P.
    (2007) Manatee/Bonito – A Modular Corpus Manager. In P. Sojk & A. Horák (Eds.), First Workshop on Recent Advances in Slavonic Natural Language Processing (pp.65–70). Brno: Masaryk University.
    [Google Scholar]
  63. Rysová, M. , & Rysová, K.
    (2015) Secondary connectives in the Prague Dependency Treebank. In J. Nivre & E. Hajičova (Eds.), Proceedings of the Third International Conference on Dependency Linguistics (pp.291–299). Uppsala: Uppsala University.
    [Google Scholar]
  64. Schiffrin, D.
    (1987) Discourse Markers. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511611841
    https://doi.org/10.1017/CBO9780511611841 [Google Scholar]
  65. Schnur, E.
    (2014) Phraseological signaling of discourse organization in academic lectures: A comparison of lexical bundles in authentic lectures and EAP listening materials. Yearbook of Phraseology, 5(1), 95–122. doi: 10.1515/phras‑2014‑0005
    https://doi.org/10.1515/phras-2014-0005 [Google Scholar]
  66. Schourup, L.
    (1999) Discourse markers. Lingua, 3(4), 227–265. doi: 10.1016/S0024‑3841(96)90026‑1
    https://doi.org/10.1016/S0024-3841(96)90026-1 [Google Scholar]
  67. Siepmann, D.
    (2005) Discourse Markers Across Languages: A Contrastive Study of Second-level Discourse Markers in Native and Non-native Text with Implications for General and Pedagogic Lexicography. London/New York: Routledge
    [Google Scholar]
  68. Simpson-Vlach, R. , & Ellis, N. C. (2010) An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512. doi: 10.1093/applin/amp058
    https://doi.org/10.1093/applin/amp058 [Google Scholar]
  69. Stede, M.
    (2002) DiMLex: A lexical approach to discourse markers. In A. Lenci & V. Di Tomaso (Eds.), Exploring the Lexicon: Theory and Computation (pp.151–177). Alessandria: Edizioni dell’Orso.
    [Google Scholar]
  70. (2011) Discourse Processing. San Rafael, CA: Morgan & Claypool.
    [Google Scholar]
  71. Taboada, M.
    (2006) Discourse markers as signals (or not) of rhetorical relations. Journal of Pragmatics, 38(4), 567–592. doi: 10.1016/j.pragma.2005.09.010
    https://doi.org/10.1016/j.pragma.2005.09.010 [Google Scholar]
  72. Tadić, M. , & Šojat, K.
    (2003) Finding multiword term candidates in Croatian. In H. Cunningham , E. Paskaleva , K. Bontcheva & G. Angelova (Eds.), Proceedings of the International Workshop on Information Extraction for Slavonic and Other Central and Eastern European Languages (pp.102–107). Sofia: BAS.
    [Google Scholar]
  73. van Dijk, T. A.
    (Ed.) (1997) Discourse as Structure and Process. London: SAGE.
    [Google Scholar]
  74. Verdonik, D.
    (2008) Označevanje vrste diskurznih označevalcev. In T. Erjavec & J. Žganec Gros (Eds.), Proceedings of the Sixth Language Technologies Conference (pp.25–28). Ljubljana: Institut “Jožef Stefan”.
    [Google Scholar]
  75. (2014) Vprašanja zapisovanja govora v govornem korpusu Gos. In T. Erjavec & J. Žganec Gros (Eds.), Proceedings of the Ninth Language Technologies Conference (pp.151–156). Ljubljana: Institut “Jožef Stefan”.
    [Google Scholar]
  76. (2015) Internal variety in the use of Slovene general extenders in different spoken discourse settings. International Journal of Corpus Linguistics, 20(4), 445–468. doi: 10.1075/ijcl.20.4.02ver
    https://doi.org/10.1075/ijcl.20.4.02ver [Google Scholar]
  77. Verdonik, D. , Kosem, I. , Vitez, A. Z. , Krek, S. , & Stabej, M.
    (2013) Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048. doi: 10.1007/s10579‑013‑9216‑5
    https://doi.org/10.1007/s10579-013-9216-5 [Google Scholar]
  78. Verdonik, D. , Rojc, M. , & Stabej, M.
    (2007) Annotating discourse markers in spontaneous speech corpora on an example for the Slovenian language. Language Resources and Evaluation, 41(2), 147–180. doi: 10.1007/s10579‑007‑9035‑7
    https://doi.org/10.1007/s10579-007-9035-7 [Google Scholar]
  79. Wei, N. , & Li, J.
    (2013) A new computing method for extracting contiguous phraseological sequences from academic text corpora. International Journal of Corpus Linguistics, 18(4), 506–535. doi: 10.1075/ijcl.18.4.03wei
    https://doi.org/10.1075/ijcl.18.4.03wei [Google Scholar]
  80. Wiechmann, D.
    (2008) On the computation of construction strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2), 253–290. doi: 10.1515/CLLT.2008.011
    https://doi.org/10.1515/CLLT.2008.011 [Google Scholar]
  81. Wray, A.
    (2005) Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.
    [Google Scholar]
  82. (2013) Formulaic language. Language Teaching, 46(3), 316–334. doi: 10.1017/S0261444813000013
    https://doi.org/10.1017/S0261444813000013 [Google Scholar]
  83. Zufferey, S. , & Degand, L.
    (2013) Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguistics and Linguistic Theory, 10, 1–18. doi: 10.1515/cllt‑2013‑0022
    https://doi.org/10.1515/cllt-2013-0022 [Google Scholar]
  84. Zwitter Vitez, A. , Zemljarič Miklavčič, J. , Krek, S. , Stabej, M. , & Erjavec, T.
    (2013) Spoken corpus Gos 1.0. Retrieved from: hdl.handle.net/11356/1040
    [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error