1887
image of Exploring automatic speech recognition for corrective and confirmative pronunciation feedback
USD
Buy:$35.00 + Taxes

Abstract

Abstract

Given that second language pronunciation errors are typically variable, learners would benefit from feedback that both flags errors () and confirms correct pronunciation (). We investigated Google Translate (GT) automatic speech recognition (ASR) transcription accuracy to determine its capacity to provide such feedback, based on Quebec francophone recordings of correctly/incorrectly realized English th-initial, h-initial and vowel-initial items in predictable/unpredictable sentence contexts. Recordings from male and female speakers were used to verify possible gender bias. In predictable contexts, transcription accuracy rates were higher for correct vs incorrect pronunciations; rates in unpredictable contexts for correct or incorrect pronunciations fell midway between the two. GT ASR is thus better at providing confirmative feedback in predictable contexts but corrective feedback in unpredictable contexts. Regardless of context, accuracy was considerably higher on errors leading to real-word than nonword output. Contra the anticipated pattern, female speakers were transcribed with higher accuracy than male.

Loading

Article metrics loading...

/content/journals/10.1075/jslp.24035.joh
2025-04-01
2025-04-25
Loading full text...

Full text loading...

References

  1. Adda-Decker, M., & Lamel, L.
    (2005) Do speech recognizers prefer female speakers?InINTERSPEECH 2005 — Eurospeech, 9th European conference on speech communication and technology (pp.–), ISCA. https://dblp.org/rec/conf/interspeech/Adda-DeckerL05. 10.21437/Interspeech.2005‑699
    https://doi.org/10.21437/Interspeech.2005-699 [Google Scholar]
  2. Ashwell, T., & Elam, J. R.
    (2017) How accurately can the Google WEB Speech API recognize and transcribe Japanese L2 English learners’ oral production?The JALT CALL Journal, (), –. 10.29140/jaltcall.v13n1.j212
    https://doi.org/10.29140/jaltcall.v13n1.j212 [Google Scholar]
  3. Best, C. T., & Tyler, M. D.
    (2007) Nonnative and second-language speech perception: Commonalities and complementarities. InM. J. Munro and O.-S. Bohn (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp.–). John Benjamins. 10.1075/lllt.17.07bes
    https://doi.org/10.1075/lllt.17.07bes [Google Scholar]
  4. Bliss, H., Abel, J., & Gick, B.
    (2018) Computer-assisted visual articulation feedback in L2 pronunciation instruction — A review. Journal of Second Language Pronunciation, (), –. 10.1075/jslp.00006.bli
    https://doi.org/10.1075/jslp.00006.bli [Google Scholar]
  5. Cámara-Arenas, E., Tejedor-García, C., Tomas-Vázquez, C. J., & Escudero-Mancebo, D.
    (2023) Automatic pronunciation assessment vs. automatic speech recognition: A study of conflicting conditions for L2-English. Language Learning & Technology, (), –. https://hdl.handle.net/10125/73512
    [Google Scholar]
  6. Chanethom, V., & Henderson, A.
    (2023) Alignment in ASR and L1 listeners’ recognition of L2 learner speech: French EFL learners & Dictation.Io. Research in Language, (), –. 10.18778/1731‑7533.21.3.03
    https://doi.org/10.18778/1731-7533.21.3.03 [Google Scholar]
  7. Cox, T., & Davies, R.
    (2012) Using automatic speech recognition technology with elicited oral response testing. CALICO Journal, (), –. 10.11139/cj.29.4.601‑618
    https://doi.org/10.11139/cj.29.4.601-618 [Google Scholar]
  8. Criado Perez, I.
    (2019) Invisible women: Data bias in a world designed for men. Abrams Press.
    [Google Scholar]
  9. Dai, Y., & Wu, Z.
    (2023) Mobile-assisted pronunciation learning with feedback from peers and/or automatic speech recognition: A mixed-methods study. Computer Assisted Language Learning, (), –. 10.1080/09588221.2021.1952272
    https://doi.org/10.1080/09588221.2021.1952272 [Google Scholar]
  10. Darcy, I., Daidone, D., & Kojima, C.
    (2013) Asymmetric lexical access and fuzzy lexical representations in second language learners. The Mental Lexicon, (), –. 10.1075/ml.8.3.06dar
    https://doi.org/10.1075/ml.8.3.06dar [Google Scholar]
  11. Derwing, T. M., Munro, M. J., & Carbonaro, J.
    (2000) Does popular speech recognition software work with ESL speech?TESOL Quarterly, (), –. 10.2307/3587748
    https://doi.org/10.2307/3587748 [Google Scholar]
  12. Derwing, T. M., & Munro, M. J.
    (2015) Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. John Benjamins. 10.1075/lllt.42
    https://doi.org/10.1075/lllt.42 [Google Scholar]
  13. Dillon, T., & Wells, D.
    (2021) Student perceptions of mobile automated speech recognition for pronunciation study and testing. English Teaching, (), –. 10.15858/engtea.76.4.202112.101
    https://doi.org/10.15858/engtea.76.4.202112.101 [Google Scholar]
  14. Evers, K., & Chen, S.
    (2020) Effects of an automatic speech recognition system with peer feedback on pronunciation instruction for adults. Computer Assisted Language Learning, (), –. 10.1080/09588221.2020.1839504
    https://doi.org/10.1080/09588221.2020.1839504 [Google Scholar]
  15. Feng, S., Kudina, O., Halpern, B. M., & Scharenborg, O.
    (2021) Quantifying bias in automatic speech recognition. Arxiv, –. 10.48550/arXiv.2103.15122
    https://doi.org/10.48550/arXiv.2103.15122 [Google Scholar]
  16. Filippidou, F., & Moussiades, L.
    (2020) Benchmarking of IBM, Google and Wit Automatatic Speech Recognition Systems. InMaglogiannis, I., Iliadis, L., & Pimenidis, E. (Eds.), Artificial intelligence applications and innovations, Part 1 of the proceedings of the 16th IFIG WG 12.5 International Conference (pp.–), Springer. 10.1007/978‑3‑030‑49161‑1_7
    https://doi.org/10.1007/978-3-030-49161-1_7 [Google Scholar]
  17. Flege, J. E., & Bohn, O.-S.
    (2021) The revised speech learning model (SLM-r). InR. Wayland (Ed.), Second language speech learning: Theoretical and empirical progress (pp.–). Cambridge University Press. 10.1017/9781108886901.002
    https://doi.org/10.1017/9781108886901.002 [Google Scholar]
  18. Flege, J. E., Munro, M. J., & MacKay, I. R. A.
    (1995) Effects of age of second-language learning on the production of English consonants. Speech Communication, (), –. 10.1016/0167‑6393(94)00044‑B
    https://doi.org/10.1016/0167-6393(94)00044-B [Google Scholar]
  19. Flege, J. E., Yeni-Komshian, G. H., & Liu, S.
    (1999) Age constraints on second-language acquisition. Journal of Memory and Language, (), –. 10.1006/jmla.1999.2638
    https://doi.org/10.1006/jmla.1999.2638 [Google Scholar]
  20. García, C., Nikolai, D., & Jones, L.
    (2020) Traditional versus ASR-based pronunciation instruction: An empirical study. CALICO Journal, (), –. 10.1558/cj.40379
    https://doi.org/10.1558/cj.40379 [Google Scholar]
  21. Garnerin, M., Rossato, S., & Besacier, L.
    (2019) Gender representation in French broadcast corpora and its impact on ASR performance. InAI4TV ’19: Proceedings of the 1st international workshop on AI for smart TV content production (pp.–). Association for Computing Machinery. 10.1145/3347449.3357480
    https://doi.org/10.1145/3347449.3357480 [Google Scholar]
  22. Goldsmith, J., & Laks, B.
    (2012) Generative phonology: Its origins, its principles, and its successors. Cambridge University Press.
    [Google Scholar]
  23. Guskaroska, A.
    (2020) ASR-dictation on smartphones for vowel pronunciation practice. Journal of Contemporary Philology, (), –. 10.37834/JCP2020045g
    https://doi.org/10.37834/JCP2020045g [Google Scholar]
  24. Heffernan, K.
    (2010) Mumbling is macho: Phonetic distinctiveness in the speech of American radio DJs. American Speech, (), –. 10.1215/00031283‑2010‑003
    https://doi.org/10.1215/00031283-2010-003 [Google Scholar]
  25. Inceoglu, S., Chen, W., & Lim, H.
    (2023) Assessment of L2 intelligibility: Comparing L1 listeners and automatic speech recognition. ReCALL, (), –. 10.1017/S0958344022000192
    https://doi.org/10.1017/S0958344022000192 [Google Scholar]
  26. Janda, R. D., & Auger, J.
    (1992) Quantitative evidence, qualitative hypercorrection, sociolinguistic variables — and French speakers’ ‘eadhaches with English h/Ø”, Language & Communication, (), –. 10.1016/0271‑5309(92)90015‑2
    https://doi.org/10.1016/0271-5309(92)90015-2 [Google Scholar]
  27. John, P.
    (2006) Variable h-epenthesis in the interlanguage of francophone ESL learners. Unpublished Master’s thesis, Concordia University.
    [Google Scholar]
  28. John, P., & Cardoso, W.
    (2009) Francophone ESL learners’ difficulties with English /h/. InM. A. Watkins, A. S. Rauber, & B. O. Baptista (Eds.), Recent research in second language phonetics/phonology: Perception and production (pp.–). Cambridge Scholars Publishing.
    [Google Scholar]
  29. John, P., Cardoso, W., & Johnson, C.
    (2023) Automatic speech recognition as a source of corrective feedback on L2 pronunciation. InM. Peterson & N. Jabbari (Eds.), Frontiers in technology-mediated language learning (pp.–). Routledge. 10.4324/9781003395218‑4
    https://doi.org/10.4324/9781003395218-4 [Google Scholar]
  30. John, P., & Frasnelli, J.
    (2022) On the lexical source of variable L2 phoneme production. The Mental Lexicon, (), –. 10.1075/ml.22002.joh
    https://doi.org/10.1075/ml.22002.joh [Google Scholar]
  31. John, P., & Rigoulot, S.
    (2023) On the representation of /h/ by Quebec francophone learners of English. Frontiers in Language Sciences, , –. 10.3389/flang.2023.1286084
    https://doi.org/10.3389/flang.2023.1286084 [Google Scholar]
  32. Johnson, C., & Cardoso, W.
    (2024) Hey Google, let’s write: Examining L2 learners’ acceptance of automatic speech recognition as a writing tool. CALICO Journal, (), –. 10.1558/cj.22431
    https://doi.org/10.1558/cj.22431 [Google Scholar]
  33. Johnson, C., Cardoso, W., Zuercher, B., Brannen, K., & Springer, S.
    (2024) Assessing pronunciation using dictation tools: the use of Google Voice Typing to score a pronunciation placement test. Journal of Second Language Pronunciation, (), –. 10.1075/jslp.23033.joh
    https://doi.org/10.1075/jslp.23033.joh [Google Scholar]
  34. Kang, O., & Moran, M.
    (2014) Functional loads of pronunciation features in nonnative speakers’ oral assessment. TESOL Quarterly, (), –. 10.1002/tesq.152
    https://doi.org/10.1002/tesq.152 [Google Scholar]
  35. Kathiresan, T.
    (2022) Gender bias in voice recognition: An i- and x-vector-based gender-specific automatic speaker recognition study. InItalian Association for Speech Science Conference (pp.–). 10.17469/O2108AISV000006
    https://doi.org/10.17469/O2108AISV000006 [Google Scholar]
  36. Kenstowicz, M.
    (1994) Phonology in generative grammar. Blackwell.
    [Google Scholar]
  37. Këpuska, V., & Bohouta, G.
    (2017) Comparing speech recognition systems (Microsoft API, Google API And CMU Sphinx). International Journal of Engineering Research and Application, (), –. 10.9790/9622‑0703022024
    https://doi.org/10.9790/9622-0703022024 [Google Scholar]
  38. Labov, W.
    (1966) The social stratification of English in New York City. Center for Applied Linguistics.
    [Google Scholar]
  39. Levis, J.
    (2005) Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, (), –. 10.2307/3588485
    https://doi.org/10.2307/3588485 [Google Scholar]
  40. (2018) Intelligibility, oral communication, and the teaching of pronunciation. Cambridge University Press. 10.1017/9781108241564
    https://doi.org/10.1017/9781108241564 [Google Scholar]
  41. Levis, J., & Suvorov, R.
    (2012) Automatic speech recognition. InC. Chapelle (Ed.), The encyclopedia of applied linguistics. John Wiley & Sons. 10.1002/9781405198431.wbeal0066
    https://doi.org/10.1002/9781405198431.wbeal0066 [Google Scholar]
  42. Liakin, D., Cardoso, W., & Liakina, N.
    (2017) Mobilizing instruction in a second-language context: Learners’ perceptions of two speech technologies. Languages, (), –. 10.3390/languages2030011
    https://doi.org/10.3390/languages2030011 [Google Scholar]
  43. Mah, J., Goad, H., & Steinhauer, K.
    (2016) Using event-related brain potentials to assess perceptibility: The case of French speakers and English [h]. Frontiers in psychology, , –. 10.3389/fpsyg.2016.01469
    https://doi.org/10.3389/fpsyg.2016.01469 [Google Scholar]
  44. Major, R.
    (2004) Gender and stylistic variation in second language phonology. Language Variation and Change, (), –. 10.1017/S0954394504163059
    https://doi.org/10.1017/S0954394504163059 [Google Scholar]
  45. McCrocklin, S.
    (2019) Learners’ feedback regarding ASR-based dictation practice for pronunciation learning. CALICO Journal, (), –. 10.1558/cj.34738
    https://doi.org/10.1558/cj.34738 [Google Scholar]
  46. McCrocklin, S. M.
    (2016) Pronunciation learner autonomy: The potential of automatic speech recognition. System, , –. 10.1016/j.system.2015.12.013
    https://doi.org/10.1016/j.system.2015.12.013 [Google Scholar]
  47. McCrocklin, S., Humaidan, A., & Edalatishams, E.
    (2019) ASR dictation program accuracy: Have current programs improved?InJ. Levis, C. Nagle, & E. Todey (Eds.), Proceedings of the 10th Pronunciation in Second Language Learning and Teaching Conference (pp.–). Iowa State University.
    [Google Scholar]
  48. McCrocklin, S., & Edalatishams, I.
    (2020) Revisiting popular speech recognition software for ESL speech. TESOL Quarterly, (), –. 10.1002/tesq.3006
    https://doi.org/10.1002/tesq.3006 [Google Scholar]
  49. Mehdipour-Kolour, D. & Cardoso, W.
    (2023) A systematic literature review of automatic speech recognition in L2 learning: A case for L2 writing. InM. Peterson & N. Jabbari (Eds.), Frontiers in technology-mediated language learning (pp.–). Routledge. 10.4324/9781003395218‑8
    https://doi.org/10.4324/9781003395218-8 [Google Scholar]
  50. Milroy, L.
    (1988) Gender as a speaker variable: The interesting case of the glottalised stops in Tyneside. InYork Papers in Linguistics 13: Selected papers from the Sociolinguistics Symposium (pp.–). York University.
    [Google Scholar]
  51. Moyer, A.
    (2016) The puzzle of gender effects in L2 phonology. Journal of Second Language Pronunciation, (), –. 10.1075/jslp.2.1.01moy
    https://doi.org/10.1075/jslp.2.1.01moy [Google Scholar]
  52. Mroz, A.
    (2020) Aiming for advanced intelligibility and proficiency using mobile ASR. Journal of Second Language Pronunciation, (), –. 10.1075/jslp.18030.mro
    https://doi.org/10.1075/jslp.18030.mro [Google Scholar]
  53. Munro, M., & Derwing, T.
    (2006) The functional load principle in ESL pronunciation instruction: an exploratory study. System, (), –. 10.1016/j.system.2006.09.004
    https://doi.org/10.1016/j.system.2006.09.004 [Google Scholar]
  54. Nelson, C., & Cardoso, W.
    (2023) Evaluating the effectiveness of Microsoft Transcribe for automating the assessment of pronunciation in language proficiency tests. InEUROCALL 2023 Short Papers. 10.4995/EuroCALL2023.2023.17007
    https://doi.org/10.4995/EuroCALL2023.2023.17007 [Google Scholar]
  55. Olson, D. J.
    (2014) Benefits of visual feedback on segmental production in the L2 classroom. Language Learning & Technology, (), –. llt.msu.edu/issues/october2014/olson.pdf
    [Google Scholar]
  56. O’Shaughnessy, D.
    (2008) Automatic speech recognition: History, methods and challenges. Pattern Recognition, (), –. 10.1016/j.patcog.2008.05.008
    https://doi.org/10.1016/j.patcog.2008.05.008 [Google Scholar]
  57. Saito, K.
    (2021) Effects of corrective feedback on second language pronunciation development. InH. Nassaji & E. Kartchava (Eds.), The Cambridge handbook of corrective feedback in second language learning and teaching (pp.–). Cambridge University Press. 10.1017/9781108589789.020
    https://doi.org/10.1017/9781108589789.020 [Google Scholar]
  58. Sewell, A.
    (2017) Functional load revisited. Journal of Second Language Pronunciation, (), –. 10.1075/jslp.3.1.03sew
    https://doi.org/10.1075/jslp.3.1.03sew [Google Scholar]
  59. Sun, W.
    (2023) The impact of automatic speech recognition technology on second language pronunciation and speaking skills of EFL learners: A mixed methods investigation. Frontiers in psychology, , –. 10.3389/fpsyg.2023.1210187
    https://doi.org/10.3389/fpsyg.2023.1210187 [Google Scholar]
  60. Suzukida, Y., & Saito, K.
    (2019) Which segmental features matter for successful L2 comprehensibility? Revisiting and generalizing the pedagogical value of the Functional Load Principle. Language Teaching Research, (), –. 10.1177/1362168819858246
    https://doi.org/10.1177/1362168819858246 [Google Scholar]
  61. Tatman, R.
    (2017) Gender and dialect bias in YouTube’s automatic captions. InProceedings of the first workshop on ethics in natural language processing (pp.–). Association for Computational Linguistics. 10.18653/v1/W17‑1606
    https://doi.org/10.18653/v1/W17-1606 [Google Scholar]
  62. Tatman, R., & Kasten, C.
    (2017) Effects of talker dialect, gender & race on accuracy of Bing speech and YouTube automatic captions. InInterspeech 2017, 18th annual conference of the International Speech Communication Association (pp.–). ISCA. 10.21437/Interspeech.2017‑1746
    https://doi.org/10.21437/Interspeech.2017-1746 [Google Scholar]
  63. Thi-Nhu Ngo, T., Hao-Jan Chen, H., & Kuo-Wei Lai, K.
    (2023) The effectiveness of automatic speech recognition in ESL/EFL pronunciation: A meta-analysis. ReCALL, (), –. 10.1017/S0958344023000113
    https://doi.org/10.1017/S0958344023000113 [Google Scholar]
  64. Trofimovich, P., & John, P.
    (2011) When three equals tree: Examining the nature of phonological entries in L2 lexicons of Quebec speakers of English. InP. Trofimovich & K. McDonough (Eds.), Applying priming methods to L2 learning, teaching and research: Insights from psycholinguistics (pp.–). John Benjamins. 10.1075/lllt.30.09tro
    https://doi.org/10.1075/lllt.30.09tro [Google Scholar]
  65. Trudgill, P.
    (1983) On dialect: Social and geographical perspectives. Oxford: Blackwell.
    [Google Scholar]
  66. van Lieshout, C., & Cardoso, W.
    (2022) Google Translate as a tool for self-directed language learning. Language Learning & Technology, (), –. hdl.handle.net/10125/73460
    [Google Scholar]
  67. Wang, Y.-H., & Young, S. S.-C.
    (2015) Effectiveness of feedback for enhancing English pronunciation in an ASR-based CALL system. Journal of Computer Assisted Learning, (), –. 10.1111/jcal.12079
    https://doi.org/10.1111/jcal.12079 [Google Scholar]
  68. White, E. J., Titone, D., Genesee, F., & Steinhauer, K.
    (2015) Phonological processing in late second language learners: The effects of proficiency and task. Bilingualism: Language and Cognition, (), –. 10.1017/S1366728915000620
    https://doi.org/10.1017/S1366728915000620 [Google Scholar]
  69. Whiteside, S. P., & Irving, C. J.
    (1997) Speakers’ sex differences in voice onset time: Some preliminary findings. Perceptual and Motor Skills, (), –. 10.2466/pms.1997.85.2.459
    https://doi.org/10.2466/pms.1997.85.2.459 [Google Scholar]
  70. Winford, D.
    (1978) Phonological hypercorrection in the process of decreolization — the case of Trinidadian English. Journal of Linguistics, (), –. 10.1017/S0022226700005909
    https://doi.org/10.1017/S0022226700005909 [Google Scholar]
/content/journals/10.1075/jslp.24035.joh
Loading
/content/journals/10.1075/jslp.24035.joh
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error