The effects of frequency, duration, and intensity on L2 learning through Duolingo: A ‘natural’ experiment

Instructed second language (L2) research has frequently addressed the effects of ‘massed’ vs. ‘spaced’ practice (e.g., Nakata & Suzuki, 2019). The present study addresses Rogers and Cheung’s (2021) concerns about the ecological validity of such work via a ‘natural experiment' (Craig et al., 2017). Learners’ self-determined exposure and in-app behavior were examined in relation to language gains over time. Duolingo learners of Spanish or French ( N = 287) completed a background questionnaire, scales measuring L2 motivation and grit, and two tests of L2 proficiency before and after a six-month period of user-controlled app usage. Total minutes of app exposure exhibited a correlation with written but not oral proficiency gains. More dependable correlates of gains were frequency-and curriculum-oriented measures. Additionally, L2 grit and motivation were weakly to moderately correlated with several in-app behaviors. We conclude with implications for how apps can best be leveraged to produce L2 gains.


Introduction
Second-language (L2) development is greatly influenced by the type and amount of instruction that learners receive (e.g., Norris & Ortega, 2000;Saito & Plonsky, 2019).Less clear, however, are the effects of frequency, duration, and intensity of target language exposure and practice.That is, are intensive periods of instruction (i.e., "massed learning") most effective, or is instruction more effective through more frequent (i.e., "spaced") exposure (see Carpenter, 2020;Rogers & Cheung, 2021)?
Informed by a quickly growing body of research in second-language acquisition (SLA) that has sought to address this question (e.g., Kasprowicz et al., 2019;Li & DeKeyser, 2019), the present study builds on these and other recent studies of distributed learning in SLA (e.g., Serrano & Huang, 2018;Suzuki & DeKeyser, 2017) as well as several decades of theoretical and empirical attention in educational psychology (see Carpenter, 2020;Rohrer, 2015) to shed light on the effects of frequency and duration of Duolingo app usage on L2 learning.In doing so, the study seeks to make several unique and worthwhile contributions.
First, being the first to examine distributional effects for app-based language learning, this study allows us to gain a better understanding of the effects of frequency, duration, and intensity specifically for the Duolingo app and its users.Second, previous studies on distributed learning have targeted predominantly vocabulary and grammar knowledge in either traditional classroom or laboratory settings (e.g., Kasprowicz et al., 2019;Nakata & Suzuki, 2019).The present study, by contrast, considers overall proficiency in both oral and written modes in a new and flourishing context of instructional technology.Third, Rogers and Cheung (2021) raised concerns over the ecological validity of previous work on distributed learning.In particular, the authors questioned whether existing results, which are almost exclusively obtained in labs, would hold in less controlled settings.The present study addresses this concern by conducting a 'natural experiment' (Craig et al., 2017), whereby learners' self-determined exposure and behavior (i.e., frequency, duration, and intensity of app usage) are examined in relation to language proficiency gains made over a six-month period.Fourth, in addition to exposure effects, two individual differences, L2 grit (Teimouri et al., 2022) and motivation (Dörnyei, 2009), are modeled in relation to learner behavior and gains in proficiency to isolate and better understand the effects of frequency, duration, and intensity of instruction.The literature review that follows offers an overview of theories and studies on distributed learning in SLA and neighboring disciplines, introduces the concept of natural experiment, surveys recent studies on the role of app-based technology in instructed SLA, and discusses the concepts of L2 grit and motivation that are relevant to the present study.The last section of the literature review provides an overview of the Duolingo course structure at the time of data collection to better situate the study.

Literature Review Distributed Learning
Rogers and Cheung (2021) provided definitions of several concepts pertinent to the present investigation.First, "distribution of practice, also referred to as input spacing, refers to whether and how learning is spaced over multiple learning episodes" (pp.1138-1139; see also Rogers, 2017 for an overview of theoretical and methodological differences in research on the distribution of practice in SLA and cognitive psychology).Second, "massed practice refers to experimental conditions in which learning is concentrated into a single, uninterrupted training session, whereas distributed or spaced practice refers to learning that is spread over two or more training episodes" (p.1139).According to Carpenter (2020), there is a consensus in educational research that distributed practice, or spacing effect, is more conducive to learning when repeated instruction is dispersed over time rather than happens in rapid succession (for a review of spacing and massing, see Rohrer, 2015).This stems in part from deficient processing theory, which postulates that spaced instruction provides learners with more opportunities for noticing and ultimately leads to better retention of the information presented.Another theory that explains the spacing effect is that of encoding variability; it emphasizes the role of contextual cues, which become more pronounced in spaced learning experiences.Next, a study-phase retrieval theory suggests that retrieval of learners' previous experience, which typically happens during distributed instruction, aids in future retention of the material.Finally, consolidation in the form of neural activation processes is more likely to occur during repeated spaced rather than repeated massed practice.Nonetheless, spaced instructional activities may be challenging to implement in real-life classrooms as this approach requires considerable planning on the part of instructors who need to create multiple lesson plans in advance and adhere to spacing schedules consistently for longer periods of time.
Furthermore, Rogers and Cheung (2021) contended that experimental research in the domain of distributed learning tends to overemphasize internal validity of a study to the detriment of its external and ecological validity, which limits the generalizability of tightly controlled experiments to authentic instructional settings.In fact, enhanced ecological validity of Rogers and Cheung's (2021) conceptual replication was arguably one of the reasons why their study did not lend support to the advantages of a more distributed L2 vocabulary learning practice (i.e., 8 days between training sessions) compared to a less distributed one (i.e., 1 day between training sessions) among English-as-a-foreign language (EFL) child participants in Hong Kong.This finding largely contradicted previous lab-based research with adult participants studying L2 vocabulary (e.g., Nakata & Suzuki, 2019).However, Rogers and Cheung's (2021) results resembled those of Kasprowicz et al. ( 2019)-another study with enhanced ecological validity that examined the effects of distributed practice among young learners of French studying L2 grammar (verb morphology).The interval being examined by Kasprowicz et al. (2019), although quite constrained (3.5 vs. 7 days), emulated "the most common lesson frequency in UK primary schools (one or two lessons per week)" (p.585); yet it did not yield differences in learning between groups.By contrast, Li and DeKeyser (2019) found enhanced procedural knowledge retention among adult learners of Mandarin Chinese exposed to instruction with shorter rather than longer intervals between sessions (1 day vs. 1 week), but these results did not hold for declarative knowledge.Similarly complex results were reported by Serrano and Huang (2018).In their study, intense repeated L2 reading sessions (1-day interval) were more beneficial for short-term gains in L2 vocabulary for teenage EFL learners in Taiwan (following the immediate posttest), whereas spaced sessions (1-week interval) resulted in higher retention long-term (from the immediate to the delayed posttest).Nonetheless, the difference between the two groups was negligible when vocabulary gains were compared from the pretest to the delayed posttest.
To make the picture even more complex, Suzuki and DeKeyser's (2017) study of L2 Japanese morphology in adult learners found no advantages of distributed practice (7day interval) over massed practice (1-day interval) for utterance accuracy; moreover, it was less conducive to utterance fluency than massed practice.However, the authors emphasized that the results of their lab-based study may not be generalizable to real-life classrooms.Notably, Kim and Webb's (2022) meta-analysis of distributed practice revealed the advantage of spaced learning in SLA, as indicated by small-to-medium and medium-to-large effect sizes (i.e., g = 0.58 and 0.80 for L2 learning and retention, respectively) across 37 eligible studies (48 independent experiments) in their sample.Critically, the researchers argued that the spacing effect varied depending on the methodological features of the study design which explains in part the complex and conflicting results of the primary studies reviewed above.

Natural Experiment
Despite the comprehensiveness of Kim and Webb's (2022) meta-analysis, one methodological variable that was not examined as a moderator in their sample was that of ecological validity of primary studies on distributed spacing.Apart from experimental designs, high ecological validity bears relevance for observational designs as well.One example of an observational study with increased ecological validity is a natural experiment, which refers to "any event not under the control of a researcher that divides a population into exposed and unexposed groups"; this allows natural experiments to "use this naturally occurring variation in exposure to identify the impact of the event on some outcome of interest" (Craig et al., 2017, p. 2).Although natural experiments have been particularly embraced in public health research, they have immediate applicability in instructed SLA and applied linguistics research more generally, especially in situations where experimental manipulations are not deemed reasonable or ethical.As such, natural experiments have arguably higher ecological validity compared to more traditional, rigidly controlled experiments.However, to our knowledge, no study to date has conducted a natural experiment to investigate the effectiveness of mobile-assisted language learning.

Mobile-Assisted Language Learning
As noted by Loewen (2020), the use of technology is now considered to be one of four major contexts in instructed SLA along with traditional classroom, study abroad, and immersion instruction (see Loewen, 2020).One of the advantages of instructional technology is that it has "the potential to speed up or enhance the process" of L2 learning and, most importantly, "deliver individualized instructional materials that meet learners at their specific levels of proficiency" (Loewen, 2020, p. 193).A second-order synthesis by Plonsky and Ziegler (2016) found a small advantage of computer-assisted language learning over traditional classroom instruction (based on the results of 14 meta-analyses in this domain, which included a total of 408 primary studies and over 14,000 language learners).Nonetheless, this synthesis was unable to examine the effectiveness of mobileassisted language learning due to the lack of primary studies in this area.
The situation, however, is rapidly changing, and research focusing on the role of mobileand app-based technology in instructed SLA is currently on the rise (e.g., García Botero et al., 2019; Jiang, Rollinson, et al., 2021;Loewen et al., 2019Loewen et al., , 2020)).In fact, several recent studies have focused specifically on the relative effectiveness of Duolingo vs. universitybased language instruction.The beginner and intermediate Duolingo learners attained similar-and in some cases, superior-L2 proficiency levels as university students who studied foreign languages for four and five semesters, respectively (see Jiang et al., 2020;Jiang, Chen, et al., 2021).Of particular relevance to the present investigation is a study by Jiang, Rollinson, Plonsky, et al. (2021) that explored a possible relationship between app usage and gains in proficiency and observed modest correlations between the two (ρ = .02−.14 for L2 French listening and reading; ρ = .01−.06 for L2 Spanish listening and reading, respectively).However, only one of many possible temporal / exposure-related indicators was used (total hours).Furthermore, the sample included only novices, and no pretest data were collected-features the present study seeks to improve on.investigation of Duolingo users' overall L2 Turkish proficiency along with five subareas (listening, speaking, writing, reading, and lexicogrammar); although comprehensive and informative, this study did not involve a control or comparison group and included only nine participants.Clearly, the field stands to benefit from more research on L2 learners' overall proficiency in the domain of mobile-assisted language learning.

L2 Grit and Motivation
Although research on individual differences has long established its niche in SLA overall (Gass et al., 2020) and computer-assisted language learning in particular (see Pawlak & Kruk, 2022), much remains uncertain about the role of individual differences in mobileassisted language learning.For example, Loewen et al. (2020) raised concerns over participants' attitudes to some app-related features (e.g., lack of interaction and limited variation in Duolingo tasks).In the same vein, García Botero et al. (2019) noted inconsistencies between Duolingo learners' questionnaire responses, which pointed to students' motivation in and positive attitude towards using the app out of class, and their interview data, which demonstrated students' mixed views on engagement and lack of long-term interest while using the app.
To investigate these issues further, research into mobile-assisted language learning would benefit from examining language app users' academic perseverance, or grit, as well as their motivated learning behavior.Grit has been defined as "perseverance and passion for long-term goals" (Duckworth et al., 2007(Duckworth et al., , p. 1087)); some researchers have also conceptualized it as a facet of the personality trait of conscientiousness (Park et al., 2018;Schmidt et al., 2018).Nonetheless, it becomes increasingly common in SLA research to conceptualize and measure grit as a domain-specific construct by tailoring scale items and instructions to a specific language learning context (see Teimouri et al., 2021 for more).Indeed, a growing body of research into L2 grit has found evidence of a positive relationship between language-domain-specific grit and achievement (e.g., Sudina & Plonsky, 2021a;Teimouri et al., 2022).
One of the constructs that is conceptually related to L2 grit is intended effort, defined as "the amount of time, effort, and energy L2 learners expend in the process of L2 learning" (Teimouri, 2017, p. 686) and recently reconceptualized into current L2 motivated learning behavior by Papi et al. (2019).This was done in order to avoid bias in favor of promotionfocus rather than prevention-focus learners and tap into learners' actual motivational behavior rather than their hypothetical disposition (see Papi et al., 2019).

The Duolingo Course Structure
All Duolingo courses are aligned with the Common European Framework of Reference (CEFR).Both French and Spanish courses start with a brief Intro section (also known as A1.0, where A corresponds to the beginner or Basic User level; see Council of Europe, 2001) followed by the A1 content, which has two sections (A1.1 and A1.2) and covers both communicatively functional as well as grammatical topics, and the A2 content, which also consists of two sections (A2.1 and A2.2) and covers more advanced vocabulary and grammar.The last section of each Duolingo course includes B1 content, where B corresponds to the intermediate or Independent User level; see Council of Europe, 2001).The B1 content has four sections (B1.1 through B1.4), at the end of which language learners are expected to have mastered even more advanced communicatively functional and grammatical topics (e.g., "World news," "Learning," "Subjunctive with common conjunctions," and "Past conditional" for French; "World news," "Gossip," "Imperfect subjunctive," and "Passive" for Spanish).In addition to allowing for multiple "opportunities for practice and repeated exposure to target language structures," the Duolingo courses combine "more implicit, comprehension-based learning with explicit feedback and explanations" (Jiang, Rollinson, Plonsky, et al., 2021, p. 981).Notably, Duolingo encourages a high "degree of user autonomy in navigating the platform," which translates into "substantial variation among individual learners on both the percentage of content they complete before reaching the end" of the B1.4 level and "on the total amount of time spent learning" (Jiang, Chen, et al., 2021, p. 2).

The Present Study
Expanding on previous research on distributed learning in SLA, the present study in the form of a natural experiment addressed the following research questions (

Method Participants
In the Fall of 2021, a group of 787 participants studying Spanish (k = 406) or French (k = 381) on Duolingo were invited to participate and completed a pretest (completion rate = 34%).They were recruited at the beginning of the A1.2 section among beginner-level learners (see The Duolingo Course Structure in the Literature Review section for more).More specifically, the participants were at Row 18 of the French course tree structure (out of a total of 106 rows) and Row 21 of the Spanish course tree structure (out of a total of 121 rows), respectively.This suggests that the Duolingo course participants were already familiar with the basics (e.g., family and travel-related vocabulary and expressions, the present tense) as well as slightly more advanced topics (e.g., shopping and routinesrelated vocabulary, grammatical agreement).
Six months later, in the Spring of 2022, a group of 288 participants completed the posttest (out of a total of 427 of those who met the selection criteria and eligibility requirements, see Procedure; response rate = 67%).One participant was excluded due to completing a pretest in Spanish and a posttest in French.Therefore, the final sample comprised 287 participants (Spanish: k = 148; French: k = 139; age: M = 44.01,SD = 14.44, range: 19−77; gender: 61% female; 38% male; 1% other).Although all participants in the final sample were L1 English speakers, 10% of the respondents reported having been exposed to one or more other languages at home in early childhood.Participants' demographic and language-related characteristics by the group are summarized in Table 1.

Instruments and Materials
Three different types of data and corresponding data sources were used in the study.
1. Exposure/Behavioral data.The first data source shed light on learners' exposure to the target language and related behavior (in-app engagement).Of particular interest was (a) the duration of app usage measured as total minutes per participant across the 6-month period of study (i.e., "Minutes"), (b) the number of times the learner opened the app in a given week (i.e., "Logins") and the number of days the learner completed at least one lesson (i.e., "Sessions")-two frequency measures, and (c) the following content-related/curriculum-oriented intensity variables: "Lessons" (i.e., the number of lessons completed), "Level reviews" (i.e., the final lesson for a given Level/Skill combination), "Skill practice" (i.e., when a learner goes back to review skills that they have already "gilded"), "Stories" (i.e., the number of stories completed), and "Tests" (i.e., the number of tests completed).
All of these indicators were used in their raw forms as predictors of learner gains.
2. Self-report data.To understand learner demographics as well as participants' language learning history, an instrument that largely mirrored Second are a set of practical considerations.These proficiency measures are highly efficient and can be completed independently and online in approximately 20−30 minutes.Upon completion, these tests can then be scored quickly and accurately.The instruments are also freely available and do not carry any proprietary restrictions.
Finally, both C-tests and EITs have been developed and are available in a range of languages (Arabic, Chinese, German, Japanese, Russian).Therefore, the present study could be replicated in other L2s without changing this critical design feature (i.e., the dependent measure).The instruments used to collect self-report and L2 proficiency data are available in Appendix A.

Procedure
Following IRB approval, the self-report and L2 proficiency measures were pilot-tested, and the data were collected using an online survey platform Gorilla (https://gorilla.sc).Eligible app users (see above) were invited to be part of the study starting on August 3, 2021; the first round of data collection lasted until November 5, 2021.Those who expressed interest were asked to begin the study by completing an online survey that included a consent form followed by (a) instruments for L2 grit and motivation (items for each scale were randomized to control for order effect), (b) a language background questionnaire, (c) an EIT, and (d) a C-test for their chosen language (Spanish or French).This battery of instruments was completed remotely, without a proctor, in about an hour.Upon completion, participants were reminded of the minimum app engagement required for participation (i.e., at least 2 logins to the app per week for the following 26 weeks).After all data were collected and de-identified, the C-tests were scored automatically, whereas the EITs were scored by four trained raters, all highly proficient in the target language (two raters per language: the lead rater scored all of the EIT items, whereas the second rater scored approximately 10% of the sample's EITs).Following rater training and norming sessions, which lasted approximately two hours, the raters for each language (French vs. Spanish) got calibrated themselves and proceeded to independently score the EITs.To avoid potential rater bias, raters were kept unaware of which audio files had come from the pretest and which were from the posttest.
During the data clean-up, 30 cases (10% out of a total of 287; 17 Spanish, 13 French) were excluded listwise due to issues with recordings or participants' misinterpretation of the task (several produced English translations rather than French/Spanish imitations).There were no missing data on other variables.
RQ1a.The assumptions for paired samples t-tests by language were satisfactorily met (i.e., the dependent variable of proficiency gain scores was continuously scaled; the distribution of the differences in proficiency gain scores followed the normal curve and contained no extreme univariate outliers; the independent variable of test mode-oral vs. written-consisted of categorical data from two related groups).
RQ1b.There were no major violations of assumptions for the two-sample t-tests.For the EIT gains in French vs. Spanish, the dependent variable was approximately normally distributed for each language group, but Levene's test was statistically significant, suggesting the lack of homogeneity of variances; nonetheless, the sample sizes for the two language groups were roughly equal, which does not require equal population variances.RQ4b.To check the assumptions for multiple regression analyses, first, all extreme univariate outliers were removed from the variables of interest (see RQ4a).Five multivariate outliers on three continuous predictor variables were removed as well.All the assumptions of linearity, the absence of multicollinearity, the absence of autocorrelation, and normality, linearity, and homoscedasticity of residuals were met.

Preliminary Analyses: Scale Data
The inspection of the scale data revealed two items with low corrected item-total correlations (ITCs < .40) on the L2 Grit Consistency of Interest subscale: CI7R and CI8R, which considerably affected reliability of the scale and were, therefore, removed from further analyses.The rest of the corrected ITCs for all constructs and subconstructs were > .40 on both the pretest and the posttest.The stability of constructs over time was assessed by test-retest reliability: r(L2 grit) = .68;r(L2 perseverance of effort) = .69;r(L2 consistency of interest) = .58;r(L2 motivation) = .61,p < .001.As demonstrated in Table 2, internal-consistency reliability of the scales was also acceptable.Additionally, descriptive statistics indicated that the participants had the highest mean score on L2 consistency of interest and the lowest mean score on L2 motivation on both the pretest and the posttest.Note.k = number of items; M = mean; SD = standard deviation; 95% CI = 95% confidence intervals of coefficient alphas; L2 = second language.Spearman's rho = .59a , .65 b .

Preliminary Analyses: Proficiency Data
As shown in Table 3, the participants' mean scores on the EIT (i.e., oral proficiency test) were higher in Spanish than in French, which was observed during both the pretest and the posttest.However, Spanish EIT scores were more spread out, as demonstrated by higher standard deviations.The participants' average C-test scores (i.e., written proficiency test) were overall slightly higher than the EIT scores, except for Spanish pretest mean scores, which were virtually the same in both modes.Nonetheless, the two groups' proficiency in the written mode appeared to be at about the same level on both the pretest and the posttest (see Figure 1).RQ1a.The results of dependent-samples t-tests for RQ1 demonstrated that there were no statistically significant differences in learner proficiency gains in the written vs. oral mode (see also Figure 1), and the effect sizes adjusted for the within-sample correlation were small (see Plonsky  RQ2a.As demonstrated in Table 4, learners' self-determined exposure and behavior (i.e., frequency, duration, and intensity of the Duolingo app usage) were weakly to moderately correlated with their language proficiency gains made over a six-month period.More specifically, the exposure-related variables most strongly associated with gains measured in the oral model (i.e., via EIT) were the number of lessons, level reviews, and logins.Raw duration measured in total minutes of exposure during the 6-month period of study exhibited almost no association with gains in the oral mode.As with EIT gains, inapp exposure/behaviors associated with gains in proficiency in the written mode included the number of lessons completed and total number of level reviews (although the latter association was neither particularly strong nor statistically significant, as shown by a confidence interval that crossed zero).Unlike EIT gains, however, minutes of exposure was found to be associated with C-test gains.
While addressing RQ2, it occurred to us that some of the observed correlations may have been artificially attenuated due to a restricted range of observed values.In particular, requiring at least two logins per week, though necessary to ensure regular exposure to the target language, seems to have yielded an unusual level of homogeneity in login data across the sample: The mean number of logins was 165 (6.35 logins per week) with a standard deviation of only 11.30.The relationships between logins and gains on the EIT and C-test were then re-examined using Thorndike's formula for correction for range restriction, resulting in substantially larger correlations in both cases (i.e., .46 and .11,respectively).Notes. 1 Bias-corrected and accelerated 95% confidence intervals for correlation coefficients.*When adjusted for range restriction, the correlations for logins x EIT (r = .14)and logins x C-test gains (r = .03)increase to .46 and .11,respectively.
RQ2b.The results of the first multiple regression analysis suggested that when the three variables most strongly associated with EIT gains were entered into the model as predictors along with the target language, which was added as a covariate, the model explained 4−6% of the variance in EIT gains and was statistically significant: F(4,238) = 3.56, p = .008,R2 = .06,adjusted R2 = .04. 'Level Review' emerged as the only meaningful positive predictor (see Table 5 1 ).The results of the second multiple regression analysis demonstrated that when the two variables most strongly correlated with C-test gains were entered in the model as predictors along with the target language, which was added as a covariate, the model explained 5−6% of the variance in C-test gains and was statistically significant: F(3,257) = 5.21, p = .002,R2 = .06,adjusted R2 = .05. 'Lessons' emerged as the only meaningful positive predictor (see Table 6).

RQ3: To what extent are L2 grit and motivation associated with the frequency, duration, and intensity with which learners use Duolingo?
As shown in Table 7, individual difference variables of L2 grit and motivation measured at the pretest were weakly to moderately correlated with learners' frequency, duration, and intensity of the Duolingo app usage.Although the magnitude of effect sizes was small, meaningful positive relationships were observed between the two frequency variables (i.e., "Logins" and "Sessions") and L2 grit perseverance of effort as well as L2 motivation.Additionally, duration (i.e., "Minutes") and content-related intensity (i.e., "Lessons" and "Stories") were positively correlated with the two subcomponents of L2 grit as well as L2 motivation.

Discussion
The current study sought to examine the predictive power of two sets of variables on L2 gains made in app-based language learning via Duolingo.Specifically, we were interested in better understanding L2 development as a function of both (a) learners' app-based exposure/behavior (e.g., instructional frequency, duration) as well as (b) learners' L2 grit and motivation.On a broad, theoretical level, these sets of variables represent the two main types of factors (learner-internal and learner-external) known to influence L2 learning (Gass et al., 2020).On a practical level, the results have the potential to inform the instructional design of Duolingo's curriculum and to provide implications for in-app experience that increase learner efficiency.
The study is unique in at least two respects.First, to our knowledge, this is the only study to consider spaced vs. massed learning effects in the context of mobile-based language learning.Moreover, we have done so by means of a 'natural experiment' thereby greatly increasing the study's ecological validity.Second, although a growing body of evidence has begun to accumulate on the role of grit in L2 development (e.g., Teimouri et al., 2021), no study to date has done so with mobile language learners.It is also the first study to employ a longitudinal design to examine the power of grit in predicting gains over time.
One challenge to these goals, which we want to be upfront about, were the relatively modest gains observed on both the written and oral proficiency tests (i.e., C-test and EIT) in both languages.These gains appear in conflict with previous findings on the effectiveness of Duolingo (e.g., Jiang et al., 2021).However, there are several alternate explanations.For example, unlike other standardized proficiency tests (e.g., ACTFL's Oral Proficiency Interview), the dependent measures in the present study were not developed with lower proficiency levels in mind and may have been too difficult, as noted to us by several participants.Another explanation for the modest gains observed may be a lack of effort on pre-and post-assessments on the part of learners.Finally, we need to account for the user autonomy and the amount of the course content covered by the participants after six months of learning.A follow-up analysis of participants' maximum course tree depth (i.e., the furthest row of each section of the courses the participants were at) within seven days of completing the posttest suggested that our participants did not cover enough new content over six months of learning on Duolingo.The results revealed that the majority of our learners remained at the beginning level at the posttest based on the amount of material they covered.In light of this finding, one suggestion for efficient use of Duolingo is moving forward and learning new content.Regardless of the source, the lack of target language gains that were observed over time (i.e., our main dependent variable) imposed a limitation on the study's findings.In short, less gains necessarily means less for the predictor or independent variables to explain.
As stated above, one of our main interests in this study was to examine language gains made in relation to learners' in-app exposure (frequency, duration, intensity) and associated behaviors (e.g., content-related choices).Somewhat surprisingly, total minutes of exposure during the period of study only exhibited a correlation with gains when measured with the C-test.Rather, the more consistent and dependable (i.e., across modes and dependent measures) correlates of gains were the more frequency-and curriculum-oriented measures such as the number of lessons, reviews, and logins.
This finding carries several important ramifications and interpretations.First, these results generally confirm the lack of a relationship between hours of exposure on Duolingo and language gains observed in Jiang et al. (2021).Such a finding might be perceived as counter-intuitive or even disappointing in that greater time spent by learners does not necessarily yield greater gains.However, we view these findings more optimistically in that they indicate that gains can be made even with shorter, more frequent and purposeful in-app engagement.Moreover, the lack of a relationship between gains and time spent using the app is even more noteworthy given the relative homogeneity of logins across the participants.In other words, although the participants were free to log in as frequently as they liked as long as they did so at least twice per week during the 26-week period of study, most logged in on a daily or almost daily basis (>6 logins/week), thus providing a kind of built-in control for the role of frequency of exposure.This finding also aligns well with Kim and Webb's recent (2022) meta-analysis showing a marked advantage for more frequent exposure to the target language as opposed to longer (in minutes, hours) periods of exposure.The implication here for learners is fairly straightforward: Log in to the app frequently with the goal of completing entire lessons and reviews, even if those sessions do not last for long periods of time.
Research questions 3 and 4 both involved the two learner individual differences variables of L2 grit and motivation.RQ3 was concerned with the association between these two variables and the learners' in-app behaviors and exposure.It is natural to expect that learners who possess higher levels of (L2) grit and/or motivation would engage more often and/or more thoroughly with the Duolingo curriculum.This supposition was indeed the case at least for some of the individual differences and for some of the learner behaviors.Not surprisingly, motivation exhibited the strongest associations with in-app exposure.However, we also observed meaningful and statistically significant correlations between L2 perseverance of effort and several in-app behaviors.
The findings for RQ3 are noteworthy on multiple levels.First of all, we have to understand that, on a theoretical level, grit and motivation do not in and of themselves induce greater learning.Rather, as this study shows, these qualities are associated with the types of activities that lead to learning such as seeking out target language interlocutors, spending time studying or reading in the target language, or-most pertinent to the present studyengaging with the target language instruction via a language learning app such as Duolingo.Thus, the findings of the present study provide one of the first pieces of evidence of the predictive validity of the L2 grit scale of language learning behaviors.From an educational standpoint, these results demonstrate that greater motivation and grit may lead to higher frequency of the types of activities shown in response to R2 to be associated with language gains (see Figure 2).Duolingo may be interested, therefore, in seeking to foster learner motivation and grit as a means to enhance linguistic development if only indirectly.RQ4 addressed the same relationship modeled in Figure 2 but without the mediating effect of in-app exposure.The findings for this relationship were modest but provide additional evidence of the predive validity of L2 grit in the context of app-based language learning.

Conclusion
The findings of this study carry relevance and potential benefits on multiple levels.First, this study allowed us to gain a better understanding of the role of technology in instructed second-language acquisition.This is critical as technological advances have the potential to make language learning not only "a lifelong (spanning one's lifetime) but also a lifewide (not confined to a particular location, such as a school) activity" (Reinders & Stockwell, 2017, p. 372).Second, the present investigation contributed to the growing line of evidence of Duolingo's effectiveness by assessing L2 learners' proficiency in both written and oral modes using high validity and high practicality measures.The study also shed further light on our understanding of the individual and combined effects of frequency, duration, and intensity of instruction on L2 development and, critically, on the learnerinternal factors that lead to choices to engage with the app.Finally, on a practical level, the results of the present study may also inform Duolingo lesson design and recommendations provided to learners with respect to the frequency, duration, and intensity of app usage.This form asks for background information about you.Although we ask for your name and email, we do so only because we want to associate your answers to this questionnaire with your other data.Your answers will be treated confidentially.Only the researchers will have access to the information you provide.

Introduction
This language test will ask you to listen to several short audio files in Spanish and make a recording in response.(Please be patient as recordings may take time to load.)Note that some of the items might be quite challenging.Please try to complete each of them to the best of your ability.<Click here to start>

Instructions-1
You are going to hear several sentences in English (6 in total).After each sentence, there will be a short pause, followed by a tone sound {TONE}.Your task is to try to repeat exactly what you hear.You will have only one attempt to do so.You will be given sufficient time after the tone to repeat the sentence.Repeat as much as you can.Remember, don't start repeating the sentence until after you hear the tone sound {TONE}.Now let's begin.

Practice stimuli
1. We drove to the park. 2. I'll call her tomorrow night.3.You can buy meat at the butcher shop.4. My brother just bought a brand new computer.5. Sometimes they take their dog for a walk in the park.6.We're going to play volleyball at the gym that I told you about.

Instructions-2
Now, you are going to hear a number of sentences in Spanish (36 in total).Once again, after each sentence, there will be a short pause, followed by a tone sound {TONE}.Your task is to try to repeat exactly what you hear in Spanish.You will have only one attempt to do so.You will be given sufficient time after the tone to repeat the sentence.Repeat as much as you can.Remember, don't start repeating the sentence until after you hear the tone sound {TONE}.Now let's begin.

Main stimuli
1. Quiero cortarme el pelo.(7 syllables) 2. El libro está en la mesa.(7 syllables) FRENCH ELICITED IMITATION TEST (Gaillard & Tremblay, 2016) Introduction This language test will ask you to listen to several short audio files in French and make a recording in response.(Please be patient as recordings may take time to load.)Note that some of the items might be quite challenging.Please try to complete each of them to the best of your ability.<Click here to start> Instructions-1 You are going to hear several sentences in English (6 in total).After each sentence, there will be a short pause, followed by a tone sound {TONE}.Your task is to try to repeat exactly what you hear.You will have only one attempt to do so.You will be given sufficient time after the tone to repeat the sentence.Repeat as much as you can.Remember, don't start repeating the sentence until after you hear the tone sound {TONE}.Now let's begin.

<I'm ready>
Practice stimuli 1.We drove to the park. 2. I'll call her tomorrow night.3.You can buy meat at the butcher shop.4. My brother just bought a brand new computer.5. Sometimes they take their dog for a walk in the park.6.We're going to play volleyball at the gym that I told you about.

Instructions-2
Now, you are going to hear a number of sentences in French (50 in total).Once again, after each sentence, there will be a short pause, followed by a tone sound {TONE}.Your task is to try to repeat exactly what you hear in French.You will have only one attempt to do so.You will be given sufficient time after the tone to repeat the sentence.Repeat as much as you can.Remember, don't start repeating the sentence until after you hear the tone sound {TONE}.Now let's begin.

Notes on EIT administration and scoring
For the sake of comparability of the findings and the testing procedures, several modifications were made: 1.The original French stimuli were amplified in Audacity.2. Sample items (practice stimuli) for both tests were taken from the Spanish version of the EIT and provided in English.
3. Both Spanish and French stimuli were presented in increasing length (not randomized as was the case with the original French stimuli).
4. An introduction for both French and Spanish EIT was added to explain the nature of the test. 5.The original instructions had to be slightly modified (due to the self-paced nature of both tests in the present study).6.All instructions were recorded by a female speaker with a standard American dialect.After recording in Audacity, the peak amplitude was normalized to −1.0 dB (to help with the volume); a noise reduction for extraneous background noises was performed; and the beep sounds were added where indicated in the script.7.For both French and Spanish EITs, a tone sound (.25s) from the original French EIT was used.A 3-second pause was inserted between the auditory sentence and the tone Introduction In this language test, you will be presented with short Spanish texts in which parts of words are deleted.The deletions correspond to the final portions of the words.Please do your best to fill in the missing part of the word.
Complete the words as accurately as possible, paying attention to the spelling and grammatical features like accents or agreement in gender and number.
You may put a zero in the blank if you do not know the answer and do not want to guess.There will be a total of 5 texts, each taking about 3 to 5 minutes to complete.Please try to finish each text in under 6 minutes.

Main Part
Below you will be presented with short Spanish texts in which parts of words are deleted.The deletions correspond to the final portions of the words.Please do your best to fill in the missing part of the word.Complete the words as accurately as possible, paying attention to the spelling and grammatical features like accents or agreement in gender and number.You may put a zero in the blank if you do not know the answer and do not want to guess.There will be a total of 5 texts, each taking about 3 to 5 minutes to complete.Please try to finish each text in under 6 minutes.Spanish accents (if you do not have a Spanish keyboard): á, é, í, ó, ú, ñ, ü.

Example:
On Sunday, the weather was beautiful, and we went for a walk.On Monday, it was raining, and we stay at home.

Introduction
In this language test, you will be presented with short French texts in which parts of words are deleted.The deletions correspond to the final portions of the words.Please do your best to fill in the missing part of the word.
Complete the words as accurately as possible, paying attention to the spelling and grammatical features like accents or agreement in gender and number.Words with hyphens or apostrophes like "celui-ci" or "l'ami" count as one word.
You may put a zero in the blank if you do not know the answer and do not want to guess.There will be a total of 5 texts, each taking about 3 to 5 minutes to complete.Please try to finish each text in under 6 minutes.

Main Part
Below you will be presented with short French texts in which parts of words are deleted.The deletions correspond to the final portions of the words.Please do your best to fill in the missing part of the word.Complete the words as accurately as possible, paying attention to the spelling and grammatical features like accents or agreement in gender and number.Words with hyphens or apostrophes like "celui-ci" or "l'ami" count as one word.You may put a zero in the blank if you do not know the answer and do not want to guess.There will be a total of 5 texts, each taking about 3 to 5 minutes to complete.Please try to finish each text in under 6 minutes.

Example:
On Sunday, the weather was beautiful, and we went for a walk.On Monday, it was raining, and we stay at home.Note.*The answer key was slightly different from the test version and had "comme toujours" instead.**The answer key version ("la cuisine.Elle fait vraiment partie de ma vie") was preferred to the test version ("la d_______ cuisine.El_______ fait vrai_______ partie de ma vie").
Six months later (i.e., in February through May 2022), each participant who had met the eligibility requirement was contacted again and invited to retake the two individual difference scales as well as the two proficiency tests.Participants who met the selection criteria and the eligibility requirements received a $100 Amazon gift card.The selection criteria included: (a) being a Duolingo user studying either Spanish or French and (b) being a native speaker of English residing in the US.The requirements included: (a) completion of a survey and two language tests (at the time of the pretest and posttest 6 months later) and (b) a minimum of 52 logins on the Duolingo app (2 per week x 26 weeks) to ensure minimally sufficient engagement with the target language and Duolingo content.

Figure 1
Figure 1 Proficiency Test Gains in the Written vs. Oral Mode and by Language

Figure 2
Figure 2 Hypothesized Model of the Effect of Individual Differences on App-Based Learning, Mediated by In-App Exposure and Engagement

Text 5 :
L'humour belge L'humour est belge, assurément !Voilà une affirmation qui ne manquera pas de faire s'esclaffer la France entière.Mieux enc_______, elle se_______ l'objet d_______ quolibets div_______.C'est qu'_______ adorent ç_______, nos ch_______ voisins, ridic_______, critiquer, cho_______.À Paris rè_______ la gra_______ tyrannie d_______ persiflage.S_______ montrer d_______ et méc_______ est dev_______ gage d_______ réussite po_______ un humo_______.Goût d_______ scandale.Esca_______ de l_______ méchanceté.No_______ les Bel_______, avons l'hu_______ plus tendre, plus bon enfant.Rire ne signifie nullement gouailler, railler.Humour gentil, candide, humour belge.The final message at the end of the pretestThis is the end of Part I of the study.As a reminder, please do your best to use Duolingo at least twice a week for a total of 26 weeks.Six months later, you will be invited to participate in Part II of the study.Thank you!The final message at the end of the posttest This is the end of Part II of the study.If you are eligible for compensation, you will be contacted by researchers within a month.Thank you!Notes on C-test administration and scoring

Table 1
Participant Characteristics Teimouri et al. (2022)validated their instrument with a sample of 191 learners of English in a foreign language context (Iran) and reported Cronbach's α reliability of .80 for the full L2 grit scale, .86 for the perseverance of effort subscale (PE, five positively keyed items), and .66 for the consistency of interest subscale (CI, four negatively keyed items).Papi et al.'s (2019) L2 motivated learning behavior questionnaire (five positively keyed items) was first used with a sample of 257 learners of English in a second language context (the US) and had internal consistency-reliability of .86 as measured by Cronbach's α.Both scales were employed after implementing minor adjustments.Specifically, the word English in the original scales was replaced with French or Spanish in the present study in order to tailor the item wording to participants' target languages.Additionally, for the sake of consistency, Papi et al.'s (2019) original 5-point Likert-type scale Moreover, the validity of these two groups of tests is supported not only by primary studies but also by two recent meta-analyses.Synthesizing results across 239 studies, McKay (2019) found an almost perfect correlation between C-tests and tests of general language proficiency (r = .94).The evidence for EITs is likewise very strong.
Papi et al., 2019)022)sky, et al.'s (2021) background questionnaire was used.Additionally, we collected data using scales for measuring two individual difference variables: L2 grit (adapted fromTeimouri et al., 2022)and L2 motivated learning behavior (adapted fromPapi et al., 2019).These variables, individually and in tandem, allowed the study to examine these two individual differences as additional predictors of both in-app engagement and gains in learning.(endpoints: 1 = never true of me; 5 = always true of me) was replaced with Teimouri et al.'s (2022) 5-point fully verbal and numerical Likert-type scale (endpoints: 1 = not like me at all; 5 = very much like me).An example item for L2 grit: "I am a diligent French/Spanish learner"; an example item for L2 motivated learning behavior: "I work hard at studying French/Spanish."3. L2 proficiency.Two different types of language tests were used to measure participants' L2 proficiency: A C-test (Spanish: Riggs & Maimone, 2018; French: Counsell, 2018) and an elicited imitation test (EIT; Spanish: Solon et al., 2019; French: Gaillard & Tremblay, 2016).These instruments were chosen based on a number of considerations.First, Spanish and French C-tests and EITs have undergone rigorous development and possess strong validity arguments.To illustrate, Riggs and Maimone (2018) reported a high correlation between Spanish C-test scores and (a) self-assessed proficiency (r = .81,p < .001) as well as (b) class level (r = .73,p < .001).Counsell (2018) also reported sizeable and positive correlations between French C-test scores and (a) self-assessed from .78 to .97 for the 30-item EIT and from .84 to .97 for the 36-item EIT for L2 learners at different proficiency levels.
For the C-test gains in French vs. Spanish, the dependent variable was, again, approximately normally distributed for each language group, and Levene's test was not statistically significant (i.e., equal variances assumed).However, univariate outlier analysis revealed two extreme outliers (|z| > 3.29) on the EIT gains variable and six additional extreme outliers on the C-test gains variable.A close inspection of these scores did not indicate any red flags in participants' performance.Therefore, the analyses were conducted twice, with and without outliers, to allow for comparisons.RQ2a.To meet the assumptions for Pearson correlations, all extreme outliers (|z| > 3.29) were removed from the variables of interest (i.e., 6 from the Login and C-test gains variables, 4 from the Session, Minutes, Level review, Skill practice, and Tests variables, 2 from the Lessons and EIT gains variables, and 1 from the Stories variable) as they were found to affect the correlation estimates.The assumption of linearity was satisfied as indicated by the matrix scatterplot.Q-Q plots and histograms suggested minor deviations from normality.Therefore, bootstrapped Pearson correlations (based on 1,000 samples) with bias-corrected and accelerated confidence intervals were performed (final N = 233).RQ2b.Prior to performing multiple regression analyses, all extreme univariate outliers were removed from the variables of interest.The strongest predictors were chosen based on correlational analyses (see RQ2a).However, in the model predicting EIT gains, the Sessions and Logins variables were highly correlated.The ensure the absence of multicollinearity, the Sessions variable was removed from the model because it had a weaker correlation with EIT gains than the Logins variable.Following the removal of 10 multivariate outliers on three continuous predictor variables in the model predicting EIT gains and the removal of 19 multivariate outliers on two continuous predictor variables in the model predicting C-test gains, the assumptions of linearity; absence of multicollinearity; absence of autocorrelation; and normality, linearity, and homoscedasticity of residuals were met.RQ3.The assumptions for Pearson correlations between Duolingo app usage variables (i.e., frequency, duration, and intensity) and individual differences (i.e., L2 grit and motivation) were satisfied after removing extreme outliers (|z| > 3.29) from the variables of interest.The inspection of the matrix scatterplot supported the assumption of linearity.To account for occasional deviations from normality, which were indicated by Q-Q plots and histograms, bootstrapped Pearson correlations (based on 1,000 samples) with biascorrected and accelerated confidence intervals were performed (final N = 260).RQ4a.Concerning the assumptions for Pearson correlations, first, eight extreme outliers (|z| > 3.29) were removed from the gains variables (i.e., two from the EIT and six from the C-test gains), and four outliers were eliminated from L2 grit consistency of interest because they were found to affect the correlation values.A matrix scatterplot did not indicate any violations of linearity.To account for minor deviations from normality (as suggested by Q-Q plots and histograms), bootstrapped Pearson correlations (based on 1,000 samples) with bias-corrected and accelerated confidence intervals were performed (final N = 248).

Table 3
Descriptive Statistics for Proficiency Test Scores (N = 287)

Table 5
Regression Analysis Summary for Variables Predicting EIT Gains
Thank you!Your responses have been recorded.