Volume 5, Issue 2
  • ISSN 2215-1931
  • E-ISSN: 2215-194X
Buy:$35.00 + Taxes



Researchers have increasingly turned to Amazon Mechanical Turk (AMT) to crowdsource speech data, predominantly in English. Although AMT and similar platforms are well positioned to enhance the state of the art in L2 research, it is unclear if crowdsourced L2 speech ratings are reliable, particularly in languages other than English. The present study describes the development and deployment of an AMT task to crowdsource comprehensibility, fluency, and accentedness ratings for L2 Spanish speech samples. Fifty-four AMT workers who were native Spanish speakers from 11 countries participated in the ratings. Intraclass correlation coefficients were used to estimate group-level interrater reliability, and Rasch analyses were undertaken to examine individual differences in rater severity and fit. Excellent reliability was observed for the comprehensibility and fluency ratings, but indices were slightly lower for accentedness, leading to recommendations to improve the task for future data collection.


Article metrics loading...

Loading full text...

Full text loading...


  1. Akiyama, Y., & Saito, K.
    (2017) Development of comprehensibility and its linguistic correlates: A longitudinal study of video-mediated telecollaboration. The Modern Language Journal, 100(3), 585–609. doi:  10.1111/modl.12338
    https://doi.org/10.1111/modl.12338 [Google Scholar]
  2. Bergeron, A., & Trofimovich, P.
    (2017) Linguistic dimensions of accentedness and comprehensibility: Exploring task and listener effects in second language French. Foreign Language Annals, 50(3), 547–566. doi:  10.1111/flan.12285
    https://doi.org/10.1111/flan.12285 [Google Scholar]
  3. Buhrmester, M., Kwang, T., & Gosling, S. D.
    (2011) Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data?Perspectives on Psychological Science, 6(1), 3–5. doi:  10.1177/1745691610393980
    https://doi.org/10.1177/1745691610393980 [Google Scholar]
  4. Crowther, D., Trofimovich, P., Isaacs, T., & Saito, K.
    (2015) Does a speaking task affect second language comprehensibility?The Modern Language Journal, 99(1), 80–95. doi:  10.1111/modl.12185
    https://doi.org/10.1111/modl.12185 [Google Scholar]
  5. Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T.
    (2018) Linguistic dimensions of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second Language Acquisition, 40(2), 443–457. doi:  10.1017/S027226311700016X
    https://doi.org/10.1017/S027226311700016X [Google Scholar]
  6. Derwing, T. M., & Munro, M. J.
    (2013) The development of L2 oral language skills in two L1 groups: A 7-year study. Language Learning, 63(2), 163–185. doi:  10.1111/lang.12000
    https://doi.org/10.1111/lang.12000 [Google Scholar]
  7. Eckes, T.
    (2005) Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. doi:  10.1207/s15434311laq0203_2
    https://doi.org/10.1207/s15434311laq0203_2 [Google Scholar]
  8. (2015) Introduction to many-facet Rasch measurement. New York: Peter Lang.
    [Google Scholar]
  9. Eskénazi, M., Levow, G.-A., Meng, H., Parent, G., & Suendermann, D.
    (Eds.) (2013) Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. UK: John Wiley & Sons. 10.1002/9781118541241
    https://doi.org/10.1002/9781118541241 [Google Scholar]
  10. Evanini, K., Higgins, D., & Zechner, K.
    (2010) Using Amazon Mechanical Turk for transcription of non-native speech. Paper presented at theProceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA.
    [Google Scholar]
  11. Flege, J. E., & Fletcher, K. L.
    (1992) Talker and listener effects on degree of perceived foreign accent. The Journal of the Acoustical Society of America, 91(1), 370–389. doi:  10.1121/1.402780
    https://doi.org/10.1121/1.402780 [Google Scholar]
  12. Fort, K., Adda, G., & Bretonnel Cohen, K.
    (2011) Amazon Mechanical Turk: Gold mine or coal mine?Computational Linguistics, 37(2), 413–420. 10.1162/COLI_a_00057
    https://doi.org/10.1162/COLI_a_00057 [Google Scholar]
  13. Gelas, H., Teferra Abate, S., Besacier, L., & Pellegrino, F.
    (2011) Quality assessment of crowdsourcing transcriptions for African languages Interspeech-2011 (pp.3065–3068).
    [Google Scholar]
  14. Goodman, J. K., Cryder, C. E., & Cheema, A.
    (2013) Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26(3), 213–224. doi:  10.1002/bdm.1753
    https://doi.org/10.1002/bdm.1753 [Google Scholar]
  15. Hallgren, K. A.
    (2012) Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34. 10.20982/tqmp.08.1.p023
    https://doi.org/10.20982/tqmp.08.1.p023 [Google Scholar]
  16. Isaacs, T., & Thomson, R. I.
    (2013) Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. doi:  10.1080/15434303.2013.769545
    https://doi.org/10.1080/15434303.2013.769545 [Google Scholar]
  17. Kennedy, S., Foote, J. A., & Dos Santos Buss, L. K.
    (2015) Second language speakers at university: Longitudinal development and rater behaviour. TESOL Quarterly, 49(1), 199–209. doi:  10.1002/tesq.212
    https://doi.org/10.1002/tesq.212 [Google Scholar]
  18. Kunath, S. A., & Weinberger, S. H.
    (2010) The wisdom of the crowd’s ear: Speech accent rating and annotation with Amazon Mechanical TurkProceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp.168–171). Los Angeles, CA: Association for Computational Linguistics.
    [Google Scholar]
  19. Linacre, J. M.
    (2002) What do infit and outfit, mean-square and standardized mean?Rasch Measurement Transactions, 16(2), 878.
    [Google Scholar]
  20. Martin, D., Hanrahan, B. V., O’Neill, J., & Gupta, N.
    (2014) Being a turker. Paper presented at the17th ACM Conference on Computer Supported Cooperative Work and Social Computing, Baltimore, MD. 10.1145/2531602.2531663
    https://doi.org/10.1145/2531602.2531663 [Google Scholar]
  21. McAllister Byun, T., Halpin, P. F., & Szeredi, D.
    (2015) Online crowdsourcing for efficient rating of speech: A validation study. Journal of Communication Disorders, 53, 70–83. doi:  10.1016/j.jcomdis.2014.11.003
    https://doi.org/10.1016/j.jcomdis.2014.11.003 [Google Scholar]
  22. McGraw, K. O., & Wong, S. P.
    (1996) Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. doi:  10.1037/1082‑989X.1.1.30
    https://doi.org/10.1037/1082-989X.1.1.30 [Google Scholar]
  23. Muñoz, C.
    (Ed.) (2006) Age and the rate of foreign language learning. Tonawanda, NY: Multilingual Matters. 10.21832/9781853598937
    https://doi.org/10.21832/9781853598937 [Google Scholar]
  24. Munro, M. J., & Derwing, T. M.
    (1995) Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97. doi:  10.1111/j.1467‑1770.1995.tb00963.x
    https://doi.org/10.1111/j.1467-1770.1995.tb00963.x [Google Scholar]
  25. Myford, C. M., & Wolfe, E. W.
    (2003) Detecting and measuring rater effects using many-facet Rasch measurement: Part 1. Journal of Applied Measurement, 4(4), 386–422.
    [Google Scholar]
  26. Nagle, C.
    (2018a) Modeling classroom language learners’ comprehensibility and accentedness over time: The case of L2 Spanish. InJ. Levis (Ed.), Proceedings of the 9th Pronunciation in Second Language Learning and Teaching Conference (pp.17–29). Ames, IA: Iowa State University.
  27. (2018b) Motivation, comprehensibility, and accentedness in L2 Spanish: Investigating motivation as a time-varying predictor of pronunciation development. The Modern Language Journal, 102(1), 199–217. doi:  10.1111/modl.12461
    https://doi.org/10.1111/modl.12461 [Google Scholar]
  28. O’Brien, M. G.
    (2014) L2 learners’ assessments of accentedness, fluency, and comprehensibility of native and nonnative German speech. Language Learning, 64(4), 715–748. doi:  10.1111/lang.12082
    https://doi.org/10.1111/lang.12082 [Google Scholar]
  29. (2016) Methodological choices in rating speech samples. Studies in Second Language Acquisition, 38(3), 587–605. doi:  10.1017/S0272263115000418
    https://doi.org/10.1017/S0272263115000418 [Google Scholar]
  30. Paolacci, G., & Chandler, J.
    (2014) Inside the Turk. Current Directions in Psychological Science, 23(3), 184–188. doi:  10.1177/0963721414531598
    https://doi.org/10.1177/0963721414531598 [Google Scholar]
  31. Paolacci, G., Chandler, J., & Ipeirotis, P. G.
    (2010) Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419.
    [Google Scholar]
  32. Pavlick, E., Post, M., Irvine, A., Kachaev, D., & Callison-Burch, C.
    (2014) The language demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics (Vol.2, pp.79–92). 10.1162/tacl_a_00167
    https://doi.org/10.1162/tacl_a_00167 [Google Scholar]
  33. Peabody, M. A.
    (2011) Methods for pronunciation assessment in computer aided language learning (Unpublished doctoral dissertation). Massachusetts Institute of Technology, Cambridge, MA.
  34. Peer, E., Vosgerau, J., & Acquisti, A.
    (2014) Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavioral Research Methods, 46(4), 1023–1031. doi:  10.3758/s13428‑013‑0434‑y
    https://doi.org/10.3758/s13428-013-0434-y [Google Scholar]
  35. Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B.
    (2010) Who are the crowdworkers? Shifting demographics in mechanical turk. Paper presented at theCHI ’10 Extended Abstracts on Human Factors in Computing Systems, Atlanta, GA.
    [Google Scholar]
  36. Saito, K., Dewaele, J.-M., Abe, M., & In’nami, Y.
    (2018) Motivation, emotion, learning experience, and second language comprehensibility development in classroom settings: A cross-sectional and longitudinal study. Language Learning, 68(3), 709–743. doi:  10.1111/lang.12297
    https://doi.org/10.1111/lang.12297 [Google Scholar]
  37. Saito, K., Trofimovich, P., & Isaacs, T.
    (2017) Using listener judgments to investigate linguistic influences on L2 comprehensibility and accentedness: A validation and generalization study. Applied Linguistics, 38(4), 439–462. doi:  10.1093/applin/amv047
    https://doi.org/10.1093/applin/amv047 [Google Scholar]
  38. Shrout, P. E., & Fleiss, J. L.
    (1979) Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 10.1037/0033‑2909.86.2.420
    https://doi.org/10.1037/0033-2909.86.2.420 [Google Scholar]
  39. Trofimovich, P., & Isaacs, T.
    (2012) Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(4), 905–916. doi:  10.1017/S1366728912000168
    https://doi.org/10.1017/S1366728912000168 [Google Scholar]
  40. Wang, H., Qian, X., & Meng, H.
    (2013) Predicting gradation of L2 English mispronunciations using crowdsourced ratings and phonological rules. InP. Badin, T. Hueber, G. Bailly, D. Demolin, & F. Raby (Eds.), Proceedings of Speech and Language Technology in Education (SLaTE 2013) (pp.127–131). Grenoble, France.
    [Google Scholar]
  41. Wu, M., & Adams, R. J.
    (2013) Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14, 339–355.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): many-facet Rasch measurement; reliability; research methods; Spanish; speech ratings
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error