Full text loading...
-
Refining and modifying the EFCAMDAT
Lessons from creating a new corpus from an existing large-scale English learner language database
- Source: International Journal of Learner Corpus Research, Volume 6, Issue 2, Dec 2020, p. 220 - 236
-
- 10 Dec 2020
Abstract
Abstract
This report outlines the development of a new corpus, which was created by refining and modifying the largest open-access L2 English learner database – the EFCAMDAT. The extensive data-curation process, which can inform the development and use of other corpora, included procedures such as converting the database from XML to a tabular format, and removing problematic markup tags and non-English texts. The final dataset contains two corresponding samples, written by similar learners in response to different prompts, which represents a unique research opportunity when it comes to analyzing task effects and conducting replication studies. Overall, the resulting corpus contains ~406,000 texts in the first sample and ~317,000 texts in the second sample, written by learners representing diverse L1s and a large range of L2 proficiency levels.