Volume 26, Issue 2
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
Buy:$35.00 + Taxes



As large-scale learner corpora become increasingly available, it is vital that natural language processing (NLP) technology is developed to provide rich linguistic annotations necessary for second language (L2) research. We present a system for automatically analyzing subcategorization frames (SCFs) for learner English. SCFs link lexis with morphosyntax, shedding light on the interplay between lexical and structural information in learner language. Meanwhile, SCFs are crucial to the study of a wide range of phenomena including individual verbs, verb classes and varying syntactic structures. To illustrate the usefulness of our system for learner corpus research and second language acquisition (SLA), we investigate how L2 learners diversify their use of SCFs in text and how this diversity changes with L2 proficiency.


Article metrics loading...

Loading full text...

Full text loading...


  1. Al-Rfou’, R., Perozzi, B., & Skiena, S.
    (2013) Polyglot: Distributed word representations for multilingual NLP. InJ. Hockenmaier & S. Riedel (Eds.), Proceedings of the Seventeenth Conference on Computational Natural Language Learning (pp.183–192). Association for Computational Linguistics. https://www.aclweb.org/anthology/W13-3520
    [Google Scholar]
  2. Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D.
    (2017) Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67(S1), 180–208. 10.1111/lang.12232
    https://doi.org/10.1111/lang.12232 [Google Scholar]
  3. Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., & Collins, M.
    (2016) Globally normalized transition-based neural networks. InK. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp.2442–2452). Association for Computational Linguistics. doi:  10.18653/v1/P16‑1231
    https://doi.org/10.18653/v1/P16-1231 [Google Scholar]
  4. Aston, G., & Burnard, L.
    (1998) The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh University Press.
    [Google Scholar]
  5. Baker, S., Reichart, R., & Korhonen, A.
    (2014) An unsupervised model for instance level subcategorization acquisition. InA. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp.278–289). Association for Computational Linguistics. https://www.aclweb.org/anthology/D14-1034/. 10.3115/v1/D14‑1034
    https://doi.org/10.3115/v1/D14-1034 [Google Scholar]
  6. Berger, A. L., Pietra, V. J. Della, & Pietra, S. A. Della
    (1996) A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
    [Google Scholar]
  7. Biber, D.
    (1988) Variation across Speech and Writing. Cambridge University Press. 10.1017/CBO9780511621024
    https://doi.org/10.1017/CBO9780511621024 [Google Scholar]
  8. Biber, D., Gray, B., & Poonpon, K.
    (2011) Should we use characteristics of conversation to measure grammatical complexity in L2 writing development?TESOL Quarterly, 45(1), 5–35. 10.5054/tq.2011.244483
    https://doi.org/10.5054/tq.2011.244483 [Google Scholar]
  9. Boguraev, B., & Briscoe, T.
    (1987) Large lexicons for natural language processing: Utilising the grammar coding system of LDOCE. Computational Linguistics, 13(3–4), 203–218.
    [Google Scholar]
  10. Briscoe, T., & Carroll, J.
    (1997) Automatic extraction of subcategorization from corpora. InProceedings of the Fifth Conference on Applied Natural Language Processing (pp.356–363). Association for Computational Linguistics. https://www.aclweb.org/anthology/A97-1052/. 10.3115/974557.974609
    https://doi.org/10.3115/974557.974609 [Google Scholar]
  11. Bulté, B., & Housen, A.
    (2012) Defining and operationalising L2 complexity. InA. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA (pp.21–46). John Benjamins. 10.1075/lllt.32.02bul
    https://doi.org/10.1075/lllt.32.02bul [Google Scholar]
  12. Charniak, E., & Johnson, M.
    (2005) Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. InK. Knight, H. T. Ng, & K. Oflazer (Eds.), Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp.173–180). Association for Computational Linguistics. https://www.aclweb.org/anthology/P05-1022. 10.3115/1219840.1219862
    https://doi.org/10.3115/1219840.1219862 [Google Scholar]
  13. Chen, X., & Meurers, D.
    (2019) Linking text readability and learner proficiency using linguistic complexity feature vector distance. Computer Assisted Language Learning, 32(4), 418–447. 10.1080/09588221.2018.1527358
    https://doi.org/10.1080/09588221.2018.1527358 [Google Scholar]
  14. Chomsky, N.
    (1965) Aspects of the Theory of Syntax. MIT press.
    [Google Scholar]
  15. Cohen, J.
    (1988) Statistical Power Analysis for the Behavioral Sciences. Lawrence Earlbaum Associates.
    [Google Scholar]
  16. Council of Europe
    Council of Europe (2001) Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge University Press.
    [Google Scholar]
  17. Covington, M. A., & McFall, J. D.
    (2010) Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100. 10.1080/09296171003643098
    https://doi.org/10.1080/09296171003643098 [Google Scholar]
  18. De Marneffe, M.-C., & Manning, C. D.
    (2008) The Stanford typed dependencies representation. InJ. Bos, E. Briscoe, A. Cahill, J. Carroll, S. Clark, A. Copestake, D. Flickinger, J. van Genabith, J. Hockenmaier, A. Joshi, R. Kaplan, T. Holloway King, S. Kuebler, D. Lin, J. T. Lønning, C. Manning, Y. Miyao, J. Nivre, S. Oepen, …, Y. Zhang (Eds.), COLING 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation (pp.1–8). Coling 2008 Organizing Committee. https://www.aclweb.org/anthology/W08-1301. 10.3115/1608858.1608859
    https://doi.org/10.3115/1608858.1608859 [Google Scholar]
  19. Dušek, O., Hajič, J., & Urešová, Z.
    (2014) Verbal valency frame detection and selection in Czech and English. InT. Mitamura, E. Hovy, & M. Palmer (Eds.), Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation (pp.6–11). Association for Computational Linguistics. doi:  10.3115/v1/W14‑2902
    https://doi.org/10.3115/v1/W14-2902 [Google Scholar]
  20. Ellis, N. C., Römer, U., & O’Donnell, M. B.
    (2016) Usage-based Approaches to Language Acquisition and Processing: Cognitive and Corpus Investigations of Construction Grammar. Wiley.
    [Google Scholar]
  21. Geertzen, J., Alexopoulou, T., & Korhonen, A.
    (2013) Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). InR. T. Miller, K. I. Martin, C. M. Eddington, A. Henery, N. M. Miguel, A. Tseng, A. Tuninetti, & D. Walter (Eds.), Proceedings of the 31st Second Language Research Forum: Building Bridges Between Disciplines. Cascadilla Proceedings Project. www.lingref.com/cpp/slrf/2012/abstract3100.html
    [Google Scholar]
  22. Gerz, D., Vulić, I., Hill, F., Reichart, R., & Korhonen, A.
    (2016) SimVerb-3500: A large-Scale evaluation set of verb similarity. InJ. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp.2173–2182). Association for Computational Linguistics. https://aclweb.org/anthology/D16-1235. 10.18653/v1/D16‑1235
    https://doi.org/10.18653/v1/D16-1235 [Google Scholar]
  23. Goldberg, A. E.
    (1995) Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press.
    [Google Scholar]
  24. Graesser, A. C., McNamara, D. S., & Kulikowich, J. M.
    (2011) Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234. 10.3102/0013189X11413260
    https://doi.org/10.3102/0013189X11413260 [Google Scholar]
  25. Gries, S. T., & Berez, A. L.
    (2017) Linguistic annotation in/for corpus linguistics. InN. Ide & J. Pustejovsky (Eds.), Handbook of Linguistic Annotation (pp.379–409). Springer. 10.1007/978‑94‑024‑0881‑2_15
    https://doi.org/10.1007/978-94-024-0881-2_15 [Google Scholar]
  26. Grishman, R., Macleod, C., & Meyers, A.
    (1994) COMLEX syntax: Building a computational lexicon. InProceedings of the 15th Conference on Computational Linguistics-Volume 1 (pp.268–272). https://www.aclweb.org/anthology/C94-1042.pdf. 10.3115/991886.991931
    https://doi.org/10.3115/991886.991931 [Google Scholar]
  27. Helbig, G., & Schenkel, W.
    (1991) Wörterbuch zur Valenz und Distribution deutscher Verben [Dictionary of the valency and distribution of German verbs]. VEB Bibliographisches Institut. 10.1515/9783111561486
    https://doi.org/10.1515/9783111561486 [Google Scholar]
  28. Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A.
    (2018) Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54. 10.1075/ijcl.16080.hua
    https://doi.org/10.1075/ijcl.16080.hua [Google Scholar]
  29. Jackendoff, R.
    (1990) Semantic Structures. MIT press.
    [Google Scholar]
  30. Jarvis, S.
    (2013) Capturing the diversity in lexical diversity. Language Learning, 63(s1), 87–106. 10.1111/j.1467‑9922.2012.00739.x
    https://doi.org/10.1111/j.1467-9922.2012.00739.x [Google Scholar]
  31. Kyle, K.
    (2016) Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-based Indices of Syntactic Sophistication [Doctoral dissertation, Georgia State University]. https://scholarworks.gsu.edu/alesl_diss/35/
    [Google Scholar]
  32. Levin, B.
    (1993) English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press.
    [Google Scholar]
  33. Lu, X.
    (2010) Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. 10.1075/ijcl.15.4.02lu
    https://doi.org/10.1075/ijcl.15.4.02lu [Google Scholar]
  34. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B.
    (1993) Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
    [Google Scholar]
  35. Meurers, D., Krivanek, J., & Bykh, S.
    (2013) On the automatic analysis of learner corpora: Native language identification as experimental testbed of language modeling between surface features and linguistic abstraction. InA. A. Sintes & S. V. Hernández (Eds.), Diachrony and Synchrony in English Corpus Studies. Peter Lang.
    [Google Scholar]
  36. Meyers, A., Macleod, C., & Grishman, R.
    (1996) Standardization of the complement adjunct distinction. InM. Gellerstam, J. Järborg, S.-G. Malmgren, K. Norén, L. Rogström, & C. Röjder Papmehl (Eds.), Proceedings of EURALEX 96 (International Conference on Lexicography). Novum Grafiska AB. https://euralex.org/wp-content/themes/euralex/proceedings/Euralex%201996%20Part%201/016_Adam%20Kilgarriff%20&%20Raphael%20Salkie%20-Corpus%20Similarity%20and%20Homogeneity%20via%20Word%20Frequency.pdf
    [Google Scholar]
  37. Mikolov, T., Chen, K., Corrado, G., & Dean, J.
    (2013) Efficient estimation of word representations in vector space. InY. Bengio & Y. LeCun (Eds.), 1st International Conference on Learning Representations. arxiv.org/abs/1301.3781
    [Google Scholar]
  38. Nicholls, D.
    (2003) The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. InA. Dawn, P. Rayson, A. Wilson, & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference (pp.572–581). UCREL. ucrel.lancs.ac.uk/publications/CL2003/papers/nicholls.pdf
    [Google Scholar]
  39. Norris, J. M., & Ortega, L.
    (2009) Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578. 10.1093/applin/amp044
    https://doi.org/10.1093/applin/amp044 [Google Scholar]
  40. Preiss, J., Briscoe, T., & Korhonen, A.
    (2007) A system for large-scale acquisition of verbal, nominal and adjectival subcategorization frames from corpora. InA. Zaenen & A. van den Bosch (Eds.), Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp.912–919). Association for Computational Linguistics. https://www.aclweb.org/anthology/P07-1115/
    [Google Scholar]
  41. Procter, P.
    (1978) Longman Dictionary of Contemporary English. Longman.
    [Google Scholar]
  42. Quochi, V., Frontini, F., Bartolini, R., Hamon, O., Poch, M., Padró, M., Bel, N., Thurmair, G., Toral, A., & Kamram, A.
    (2014) Third Evaluation Report: Evaluation of PANACEA v3 and Produced Resources. hdl.handle.net/10230/22533
    [Google Scholar]
  43. Römer, U., O’Donnell, M. B., & Ellis, N. C.
    (2015) Using COBUILD grammar patterns for a large-scale analysis of verb-argument constructions. InN. Groom, M. Charles, & S. John (Eds.), Corpora, Grammar and Discourse: In Honour of Susan Hunston (pp.43–72). John Benjamins. 10.1075/scl.73.03rom
    https://doi.org/10.1075/scl.73.03rom [Google Scholar]
  44. Römer, U., Roberson, A., O’Donnell, M. B., & Ellis, N. C.
    (2014) Linking learner corpus and experimental data in studying second language learners’ knowledge of verb-argument constructions. ICAME Journal, 38(1), 115–135. 10.2478/icame‑2014‑0006
    https://doi.org/10.2478/icame-2014-0006 [Google Scholar]
  45. Somers, H. L.
    (1984) On the validity of the complement-adjunct distinction in valency grammar. Linguistics, 22(4), 507–530. 10.1515/ling.1984.22.4.507
    https://doi.org/10.1515/ling.1984.22.4.507 [Google Scholar]
  46. Taguchi, N., Crawford, W., & Wetzel, D. Z.
    (2013) What linguistic features are indicative of writing quality? A case of argumentative essays in a college composition program. TESOL Quarterly, 47(2), 420–430. 10.1002/tesq.91
    https://doi.org/10.1002/tesq.91 [Google Scholar]
  47. Tesnière, L.
    (1965) Eléments de Syntaxe Structurale [Elements of structural syntax]. John Benjamins. 10.1075/z.185
    https://doi.org/10.1075/z.185 [Google Scholar]
  48. Tono, Y.
    (2004) Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English. InG. Aston, S. Bernardini, & D. Stewart (Eds.), Corpora and Language Learners (pp.45–66). John Benjamins. 10.1075/scl.17.05ton
    https://doi.org/10.1075/scl.17.05ton [Google Scholar]
  49. Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y.
    (1998) Second Language Development in Writing: Measures of Fluency, Accuracy, and Complexity. University of Hawai’i Press.
    [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error