Volume 22, Issue 2
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes


Forensic authorship attribution is concerned with identifying the writers of anonymous criminal documents. Over the last twenty years, computer scientists have developed a wide range of statistical procedures using a number of different linguistic features to measure similarity between texts. However, much of this work is not of practical use to forensic linguists who need to explain in reports or in court why a particular method of identifying potential authors works. This paper sets out to address this problem using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing on literature positing the idiolectal nature of collocations, phrases and word sequences, this paper tests the accuracy of word n-grams in identifying the authors of anonymised email samples. Moving beyond the statistical analysis, the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n-grams.


Article metrics loading...

Loading full text...

Full text loading...


  1. Argamon, S. , & Koppel, M.
    (2013) A systemic functional approach to automated authorship analysis. Journal of Law and Policy, 21(2), 299–316.
    [Google Scholar]
  2. Barlow, M.
    (2013) Individual differences and usage-based grammar. International Journal of Corpus Linguistics, 18(4), 443–478. doi: 10.1075/ijcl.18.4.01bar
    https://doi.org/10.1075/ijcl.18.4.01bar [Google Scholar]
  3. Becker, J. D.
    (1975) The phrasal lexicon. In B. L. Nash-Webber & R. Shank (Eds.), Theoretical Issues in Natural Language Processing (pp.60–63). Cambridge, MA: Bolt Beranek and Newman.
    [Google Scholar]
  4. Biber, D. , Conrad, S. , & Cortes, V.
    (2004)  If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405. doi: 10.1093/applin/25.3.371
    https://doi.org/10.1093/applin/25.3.371 [Google Scholar]
  5. Bloch, B.
    (1948) A set of postulates for phonemic analysis. Language, 24(1), 3–46. doi: 10.2307/410284
    https://doi.org/10.2307/410284 [Google Scholar]
  6. Cohen, W. W.
    (2009) Enron Email Dataset[online]. Retrieved fromwww.cs.cmu.edu/~enron/ (last accessedNovember 2010).
    [Google Scholar]
  7. Coniam, D.
    (2004) Concordancing oneself: Constructing individual textual profiles. International Journal of Corpus Linguistics, 9(2), 271–298. doi: 10.1075/ijcl.9.2.06con
    https://doi.org/10.1075/ijcl.9.2.06con [Google Scholar]
  8. Cotterill, J.
    (2010) How to use corpus linguistics in forensic linguistics. In A. O’Keefe & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp.578–590). London: Routledge. doi: 10.4324/9780203856949.ch41
    https://doi.org/10.4324/9780203856949.ch41 [Google Scholar]
  9. Coulthard, M.
    (1994) On the use of corpora in the analysis of forensic texts. Forensic Linguistics. International Journal of Speech, Language and the Law, 1(1), 27–43. doi: 10.1558/ijsll.v1i1.27
    https://doi.org/10.1558/ijsll.v1i1.27 [Google Scholar]
  10. (2004) Author identification, idiolect, and linguistic uniqueness. Applied Linguistics, 24(4), 431–447. doi: 10.1093/applin/25.4.431
    https://doi.org/10.1093/applin/25.4.431 [Google Scholar]
  11. Coulthard, M. , Grant, T. , & Kredens, K.
    (2011) Forensic Linguistics. In R. Wodak , B. Johnstone & P. Kerswill (Eds.), The SAGE Handbook of Sociolinguistics (pp.531–544). London: Sage. doi: 10.4135/9781446200957.n36
    https://doi.org/10.4135/9781446200957.n36 [Google Scholar]
  12. Coyotl-Morales, R. , Villaseñor-Pineda, M. L. , Montes-y-Gómez, M. , & Rosso, P.
    (2006) Authorship attribution using word sequences. In J. F. Martínez-Trinidad , J. A. Carrasco Ochoa & J. Kittler (Eds.), Proceedings of the 11th Iberoamerican Congress on Pattern Recognition (pp.844–853). Berlin: Springer.
    [Google Scholar]
  13. Durrant, P. , & Doherty, A.
    (2010) Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming. Corpus Linguistics and Linguistic Theory, 6(2), 125–155. doi: 10.1515/cllt.2010.006
    https://doi.org/10.1515/cllt.2010.006 [Google Scholar]
  14. Eckert, P. , & McConnell-Ginet, S.
    (1998) Communities of practice: Where language, gender and power all live?In J. Coates (Ed.), Language and Gender: A Reader (pp.484–494). Oxford: Blackwell.
    [Google Scholar]
  15. Eder, M.
    (2015) Does size matter? Authorship attribution, small samples, big problem. Digital Scholarship in the Humanities, 30(2), 167–182. doi: 10.1093/llc/fqt066
    https://doi.org/10.1093/llc/fqt066 [Google Scholar]
  16. Firth, J. R.
    (1957) A synopsis of linguistic theory 1930–1955. In F. R. Palmer (Ed.), Selected papers of J.R. Firth 1952–1959 (pp.168–205). London: Longman.
    [Google Scholar]
  17. Grant, T.
    (2007) Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law, 14(1), 1–25. doi: 10.1558/ijsll.v14i1.1
    https://doi.org/10.1558/ijsll.v14i1.1 [Google Scholar]
  18. (2008) Approaching questions in forensic authorship analysis. In J. Gibbons & M. T. Turell (Eds.), Dimensions of Forensic Linguistics (pp.215–229). Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/aals.5.15gra
    https://doi.org/10.1075/aals.5.15gra [Google Scholar]
  19. (2010). Txt 4n6: Idiolect free authorship analysis?In M. Coulthard & A. Johnson (Eds.), The Routledge Handbook of Forensic Linguistics (pp.508–522) London: Routledge.
    [Google Scholar]
  20. (2013) Txt 4N6: Method, consistency and distinctiveness in the analysis of SMS text messages. Journal of Law and Policy, 21(2), 467–494.
    [Google Scholar]
  21. Grieve, J.
    (2007) Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270. doi: 10.1093/llc/fqm020
    https://doi.org/10.1093/llc/fqm020 [Google Scholar]
  22. Hoey, M.
    (2005) Lexical Priming: A New Theory of Words and Language. London: Routledge. doi: 10.4324/9780203327630
    https://doi.org/10.4324/9780203327630 [Google Scholar]
  23. Hoover, D. L.
    (2002) Frequent word sequences and statistical stylistics. Literary and Linguistic Computing, 17(2), 157–180. doi: 10.1093/llc/17.2.157
    https://doi.org/10.1093/llc/17.2.157 [Google Scholar]
  24. Johnson, A. & Wright, D.
    (2014) Identifying idiolect in forensic authorship attribution: An n-gram textbite approach. Language and Law (Linguagem e Direito)1(1), 37–69.
    [Google Scholar]
  25. Juola, P.
    (2008) Authorship Attribution. Delft: NOW Publishing.
    [Google Scholar]
  26. (2013) Stylometry and immigration: A case study. Journal of Law and Policy, 21(2), 287–298.
    [Google Scholar]
  27. Koppel, M. , Schler, J. , & Argamon, S.
    (2011) Authorship attribution in the wild. Language Resources and Evaluation, 45(1), 83–94. doi: 10.1007/s10579‑009‑9111‑2
    https://doi.org/10.1007/s10579-009-9111-2 [Google Scholar]
  28. Kredens, K.
    (2002) Towards a corpus-based methodology of forensic authorship attribution: A comparative study of two idiolects. In B. Lewandowska-Tomaszczyk (Ed.), PALC’01: Practical Applications in Language Corpora (pp.405–437). Peter Lang: Frankfurt am Mein.
    [Google Scholar]
  29. Kuiper, K.
    (2004) Formulaic performance in conventionalised varieties of speech. In N. Schmitt (Ed.), Formulaic Sequences: Acquisition, Processing and Use (pp.37–54). Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/lllt.9.04kui
    https://doi.org/10.1075/lllt.9.04kui [Google Scholar]
  30. Langacker, R.
    (1988) A usage-based model. In B. Rudzka-Ostyn (Ed.), Topics in Cognitive Linguistics (pp.127–161). Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/cilt.50.06lan
    https://doi.org/10.1075/cilt.50.06lan [Google Scholar]
  31. (2000) A dynamic usage-based model. In M. Barlow & S. Kemmer (Eds.), Usage-Based Models of Language (pp.1–63). Stanford: CSLI Publications.
    [Google Scholar]
  32. Larner, S.
    (2014) A preliminary investigation into the use of fixed formulaic sequences as a marker of authorship. International Journal of Speech, Language and the Law, 21(1), 1–22. doi: 10.1558/ijsll.v21i1.1
    https://doi.org/10.1558/ijsll.v21i1.1 [Google Scholar]
  33. Love, H.
    (2002) Attributing Authorship: An Introduction. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511483165
    https://doi.org/10.1017/CBO9780511483165 [Google Scholar]
  34. Luyckx, K. , & Daelemans, W.
    (2011) The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, 26(1), 35–55. doi: 10.1093/llc/fqq013
    https://doi.org/10.1093/llc/fqq013 [Google Scholar]
  35. Mikros, G.
    (2012) Authorship attribution and gender identification in Greek blogs. In I. Obradovic , E. Kelih & Reinhard Köhler (Eds.), Methods and Applications of Quantitative Linguistics (pp.21–32). University of Belgrade: Academic Mind.
    [Google Scholar]
  36. Mollin, S.
    (2009) ‘I entirely understand’ is a Blairism: The methodology of identifying idiolectal collocations. International Journal of Corpus Linguistics, 14(3), 367–392. doi: 10.1075/ijcl.14.3.04mol
    https://doi.org/10.1075/ijcl.14.3.04mol [Google Scholar]
  37. Nattinger, J. R. , & DeCarrico, J.
    (1992) Lexical Phrases and Language Teaching. Oxford: Oxford University Press.
    [Google Scholar]
  38. Nini, A. , & Grant, T.
    (2013) Bridging the gap between stylistic and cognitive approaches to authorship analysis using Systemic Functional Linguistics and multidimensional analysis. International Journal of Speech, Language and the Law, 20(2), 173–202.
    [Google Scholar]
  39. Sanderson, C. , & Guenter, S.
    (2006) Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. InProceedings of the International Conference on Empirical Methods in Natural Language Engineering (pp.482–491). Morristown, NJ: Association for Computational Linguistics.
    [Google Scholar]
  40. Schmid, H-J.
    (2016) A framework for understanding linguistic entrenchment and its psychological foundations. In H-J. Schmid (Ed.), Entrenchment and the Psychology of Language Learning: How We Reorganize and Adapt Linguistic Knowledge (pp.9–36). Berlin: De Gruyter Mouton.
    [Google Scholar]
  41. Schmitt, N. , Grandage, S. , & Adolphs, S.
    (2004) Are corpus-derived recurrent clusters psycholinguistically valid?In N. Schmitt (Ed.) Formulaic Sequences: Acquisition, Processing and Use (pp.12–151). Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/lllt.9.08sch
    https://doi.org/10.1075/lllt.9.08sch [Google Scholar]
  42. Scott, M.
    (2008) WordSmith Tool (Version 5) [Computer software]. Liverpool: Lexical Analysis Software.
    [Google Scholar]
  43. Sinclair, J. M.
    (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.
    [Google Scholar]
  44. Stamatatos, E.
    (2009) A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556. doi: 10.1002/asi.21001
    https://doi.org/10.1002/asi.21001 [Google Scholar]
  45. (2013) On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), 421–440.
    [Google Scholar]
  46. Svartvik, J.
    (1968) The Evans Statements: A case for Forensic Linguistics. Gotëborg: University of Gothenburg Press.
    [Google Scholar]
  47. Turell, M. T. , & Gavaldà, N.
    (2013) Towards an index of idiolectal similitude (or distance) in forensic authorship analysis. Journal of Law and Policy, 21(2), 495–514.
    [Google Scholar]
  48. Woolls, D.
    (2013) CFL Jaccard n-gram Lexical Evaluator (Jangle) (Version 2) [Computer software]. CFL Software Limited. Retrieved fromwww.cflsoftware.com/ (last accessedJanuary 2017).
    [Google Scholar]
  49. Wray, A.
    (2002) Formulaic Language and the Lexicon. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511519772
    https://doi.org/10.1017/CBO9780511519772 [Google Scholar]
  50. (2008) Formulaic Language: Pushing the Boundaries. Oxford: Oxford University Press.
    [Google Scholar]
  51. Wright, D.
    (2013) Stylistic variation within genre conventions in the Enron email corpus: Developing a text-sensitive methodology for authorship research. International Journal of Speech, Language and the Law20(1): 45–75. doi: 10.1558/ijsll.v20i1.45
    https://doi.org/10.1558/ijsll.v20i1.45 [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): authorship attribution; Enron; entrenchment; forensic linguistics; idiolect
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error