1887
Volume 11, Issue 3
  • ISSN 2215-1931
  • E-ISSN: 2215-194X
USD
Buy:$35.00 + Taxes

Abstract

Abstract

This study examines the validity of WER as a proxy for pronunciation quality in EFL contexts. Human ratings of comprehensibility and accentedness were compared with WER and automated pronunciation scores from six ASR systems — Kaldi, wav2vec 2.0, HuBERT, Whisper (Base and Large-v3), and Microsoft Azure — using 190 read-aloud recordings by Korean elementary learners. With respect to pronunciation scoring, Azure’s phoneme-level accuracy scores demonstrated moderate correlations with human judgments, while Kaldi’s GOP scores showed no meaningful association. Analysis of WER revealed a critical trade-off between ASR accuracy and perceptual sensitivity: high-performing systems such as Whisper Large-v3 and Azure produced near-zero WERs but weakly correlated with human ratings. In contrast, mid-performing systems such as Whisper Base and HuBERT showed stronger correlations, indicating that moderate WER values may better reflect pronunciation variation. These results underscore the limitations of WER in advanced ASR systems and the need for perceptually grounded, interpretable metrics.

Loading

Article metrics loading...

/content/journals/10.1075/jslp.25012.won
2025-07-07
2026-03-08
Loading full text...

Full text loading...

References

  1. Alphacephei
    Alphacephei (2025) Vosk speech recognition toolkit. https://alphacephei.com/vosk/
    [Google Scholar]
  2. Baevski, A., Zhou, H., Mohamed, A., & Auli, M.
    (2020) wav2vec 2.0: A framework for self-supervised learning of speech representations. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
    [Google Scholar]
  3. Baker, A.
    (2014) Exploring teachers’ knowledge of second language pronunciation techniques: Teacher cognitions, observed classroom practices, and student perceptions. TESOL Quarterly, 48(1), 136–163. 10.1002/tesq.99
    https://doi.org/10.1002/tesq.99 [Google Scholar]
  4. Cámara-Arenas, E., Tejedor-García, C., Tomas-Vázquez, C. J., & Escudero-Mancebo, D.
    (2023) Automatic pronunciation assessment vs. automatic speech recognition: A study of conflicting conditions for L2-English. Language Learning & Technology, 27(1), 1–19. https://hdl.handle.net/10125/73512
    [Google Scholar]
  5. Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T.
    (2018) Linguistic dimensions of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second Language Acquisition, 40(2), 443–457. 10.1017/S027226311700016X
    https://doi.org/10.1017/S027226311700016X [Google Scholar]
  6. Dai, Y., & Wu, Z.
    (2023) Mobile-assisted pronunciation learning with feedback from peers and/or automatic speech recognition: A mixed-methods study. Computer Assisted Language Learning, 36(5–6), 861–884. 10.1080/09588221.2021.1952272
    https://doi.org/10.1080/09588221.2021.1952272 [Google Scholar]
  7. Deadman, J.
    (2023) Simulating realistic multiparty speech data: For the development of distant microphone ASR systems. [Doctoral dissertation, University of Sheffield]. https://etheses.whiterose.ac.uk/id/eprint/33420/
  8. Derwing, T. M., & Munro, M. J.
    (1997) Accent, intelligibility, and comprehensibility: Evidence from four L1s. Studies in Second Language Acquisition, 19(1), 1–16. 10.1017/S0272263197001010
    https://doi.org/10.1017/S0272263197001010 [Google Scholar]
  9. (2005) Second language accent and pronunciation teaching: A research-based approach. TESOL Quarterly, 39(3), 379–397. 10.2307/3588486
    https://doi.org/10.2307/3588486 [Google Scholar]
  10. (2015) Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. John Benjamins Publishing Company. 10.1075/lllt.42
    https://doi.org/10.1075/lllt.42 [Google Scholar]
  11. Dizon, G.
    (2020) Evaluating intelligent personal assistants for L2 listening and speaking development. Language Learning & Technology, 24(1), 16–26. 10125/44705
    https://doi.org/10125/44705 [Google Scholar]
  12. Eckes, T.
    (2015) Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang GmbH.
    [Google Scholar]
  13. El Kheir, Y., Ali, A., & Chowdhury, S. A.
    (2023) Automatic pronunciation assessment: A review. Findings of the Association for Computational Linguistics: EMNLP 2023, 8304–8324. https://aclanthology.org/2023.findings-emnlp.557.pdf. 10.18653/v1/2023.findings‑emnlp.557
    https://doi.org/10.18653/v1/2023.findings-emnlp.557 [Google Scholar]
  14. Farrús, M.
    (2023) Automatic speech recognition in L2 learning: A review based on PRISMA methodology. Languages, 8(4), 242. 10.3390/languages8040242
    https://doi.org/10.3390/languages8040242 [Google Scholar]
  15. Ferraro, A., Galli, A., La Gatta, V., & Postiglione, M.
    (2023) Benchmarking open source and paid services for speech to text: An analysis of quality and input variety. Frontiers in Big Data, 61, 1210559. 10.3389/fdata.2023.1210559
    https://doi.org/10.3389/fdata.2023.1210559 [Google Scholar]
  16. Geng, H., Saito, D., & Minematsu, N.
    (2024) Simulating native speaker shadowing for nonnative speech assessment with latent speech representations. arXiv. 10.48550/arXiv.2409.11742
    https://doi.org/10.48550/arXiv.2409.11742 [Google Scholar]
  17. Gong, Y., Chen, Z., Chu, I.-H., Chang, P., & Glass, J.
    (2022) Transformer-based multi-aspect multi-granularity non-native English speaker pronunciation assessment. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7262–7266. 10.1109/ICASSP43922.2022.9746743
    https://doi.org/10.1109/ICASSP43922.2022.9746743 [Google Scholar]
  18. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B.
    (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97. 10.1109/MSP.2012.2205597
    https://doi.org/10.1109/MSP.2012.2205597 [Google Scholar]
  19. Hirai, A., & Kovalyova, A.
    (2023) Using speech-to-text applications for assessing English language learners’ pronunciation: A comparison with human raters. InM.-d.-M. Suárez & W. M. El-Henawy (Eds.), Optimizing online English language learning and teaching (pp.337–355). Springer International Publishing. 10.1007/978‑3‑031‑27825‑9_17
    https://doi.org/10.1007/978-3-031-27825-9_17 [Google Scholar]
  20. Hosseini-Kivanani, N., Gretter, R., Matassoni, M., & Falavigna, G. D.
    (2021) Experiments of ASR-based mispronunciation detection for children and adult English learners. arXiv. 10.48550/arXiv.2104.05980
    https://doi.org/10.48550/arXiv.2104.05980 [Google Scholar]
  21. Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A.
    (2021) HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 291, 3451–3460. 10.1109/TASLP.2021.3122291
    https://doi.org/10.1109/TASLP.2021.3122291 [Google Scholar]
  22. Inceoglu, S., Chen, W.-H., & Lim, H.
    (2023) Assessment of L2 intelligibility: Comparing L1 listeners and automatic speech recognition. ReCALL, 35(1), 89–104. 10.1017/S0958344022000192
    https://doi.org/10.1017/S0958344022000192 [Google Scholar]
  23. Isbell, D. R., & Lee, J.
    (2022) Self-assessment of comprehensibility and accentedness in second language Korean. Language Learning, 72(3), 806–852. 10.1111/lang.12497
    https://doi.org/10.1111/lang.12497 [Google Scholar]
  24. Jelinek, F.
    (1976) Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4), 532–556. 10.1109/PROC.1976.10159
    https://doi.org/10.1109/PROC.1976.10159 [Google Scholar]
  25. Jenkins, J.
    (2000) The phonology of English as an international language: New models, new norms, new goals. Oxford University Press.
    [Google Scholar]
  26. Jurafsky, D., & Martin, J. H.
    (2024) Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition with language models (3rd ed.). https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
    [Google Scholar]
  27. Kang, O., & Rubin, D. L.
    (2009) Reverse linguistic stereotyping: Measuring the effect of listener expectations on speech evaluation. Journal of Language and Social Psychology, 28(4), 441–456. 10.1177/0261927X09341950
    https://doi.org/10.1177/0261927X09341950 [Google Scholar]
  28. Karhila, R., Smolander, A.-R., Ylinen, S., & Kurimo, M.
    (2019) Transparent pronunciation scoring using articulatorily weighted phoneme edit distance. Proceedings of INTERSPEECH 2019, 1866–1870. 10.21437/Interspeech.2019‑1785
    https://doi.org/10.21437/Interspeech.2019-1785 [Google Scholar]
  29. Khabbazbashi, N., Xu, J., & Galaczi, E. D.
    (2021) Opening the black box: Exploring automated speaking evaluation. InB. Lanteigne, C. Coombe, & J. D. Brown (Eds.), Challenges in language testing around the world (pp.333–343). Springer. 10.1007/978‑981‑33‑4232‑3_25
    https://doi.org/10.1007/978-981-33-4232-3_25 [Google Scholar]
  30. Kheddar, H., Hemis, M., & Himeur, Y.
    (2024) Automatic speech recognition using advanced deep learning approaches: A survey. Information Fusion, 1091, 102422. 10.1016/j.inffus.2024.102422
    https://doi.org/10.1016/j.inffus.2024.102422 [Google Scholar]
  31. Kim, M.
    (2023) Digital enhancement of pronunciation assessment: Automated speech recognition and human raters. Phonetics and Speech Sciences, 15(2), 13–20. 10.13064/KSSS.2023.15.2.013
    https://doi.org/10.13064/KSSS.2023.15.2.013 [Google Scholar]
  32. Kim, S.-E., Chernyak, B. R., Seleznova, O., Keshet, J., Goldrick, M., & Bradlow, A. R.
    (2024) Automatic recognition of second language speech-in-noise. JASA Express Letters, 4(2), 025204. 10.1121/10.0024877
    https://doi.org/10.1121/10.0024877 [Google Scholar]
  33. Knight, P.
    (2021) ‘Smart speaker, tell me about your acoustic sensor’. Physics World, 33(12), 25. 10.1088/2058‑7058/33/12/27
    https://doi.org/10.1088/2058-7058/33/12/27 [Google Scholar]
  34. Koizumi, R., Okabe, Y., & Kashimada, Y.
    (2017) A multifaceted Rasch analysis of rater reliability of the speaking section of the GTEC CBT. ARELE: Annual Review of English Language Education in Japan, 281, 241–256. 10.20581/arele.28.0_241
    https://doi.org/10.20581/arele.28.0_241 [Google Scholar]
  35. Kumalija, E., & Nakamoto, Y.
    (2022) Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech. Frontiers in Signal Processing, 21, 999457. 10.3389/frsip.2022.999457
    https://doi.org/10.3389/frsip.2022.999457 [Google Scholar]
  36. Kunal, G.
    (2023, August24). Advancements in automatic speech recognition (ASR): Revolutionizing speech recognition technology. https://www.softobotics.com/blogs/advancements-in-automatic-speech-recognition-asr-revolutionizing-speech-recognition-technology/
    [Google Scholar]
  37. Levis, J.
    (2005) Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39(3), 369–377. 10.2307/3588485
    https://doi.org/10.2307/3588485 [Google Scholar]
  38. (2020) Revisiting the intelligibility and nativeness principles. Journal of Second Language Pronunciation, 6(3), 310–328. 10.1075/jslp.20050.lev
    https://doi.org/10.1075/jslp.20050.lev [Google Scholar]
  39. Liakin, D., Cardoso, W., & Liakina, N.
    (2017) Mobilizing instruction in a second-language context: Learners’ perceptions of two speech technologies. Languages, 2(3), 11. 10.3390/languages2030011
    https://doi.org/10.3390/languages2030011 [Google Scholar]
  40. Likhomanenko, T., Xu, Q., Pratap, V., Tomasello, P., Kahn, J., Avidov, G., Collobert, R., & Synnaeve, G.
    (2021) Rethinking evaluation in ASR: Are our models robust enough?Proceedings of INTERSPEECH 2021, 311–315. 10.21437/Interspeech.2021‑1758
    https://doi.org/10.21437/Interspeech.2021-1758 [Google Scholar]
  41. Linacre, J. M.
    (2014) A user’s guide to FACETS (Version 3.80). https://www.winsteps.com/a/Facets-ManualPDF.zip
  42. Lindemann, S.
    (2002) Listening with an attitude: A model of native-speaker comprehension of non-native speakers in the United States. Language in Society, 31(3), 419–441. 10.1017/S0047404502020286
    https://doi.org/10.1017/S0047404502020286 [Google Scholar]
  43. Lounis, M., Dendani, B., & Bahi, H.
    (2024) Mispronunciation detection and diagnosis using deep neural networks: A systematic review. Multimedia Tools and Applications, 831, 62793–62827. 10.1007/s11042‑023‑17899‑x
    https://doi.org/10.1007/s11042-023-17899-x [Google Scholar]
  44. Ma, M.
    (2023, February14). Speech service update: Hierarchical Transformer for pronunciation assessment. https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/speech-service-update-hierarchical-transformer-for-pronunciation/ba-p/3740866
    [Google Scholar]
  45. McGuire, M.
    (2025) Automatic speech recognition for non-native English: Accuracy and disfluency handling. arXiv. 10.48550/arXiv.2503.06924
    https://doi.org/10.48550/arXiv.2503.06924 [Google Scholar]
  46. Meeker, M.
    (2017, May31). Internet trends 2017. https://www.bondcap.com/report/it17
    [Google Scholar]
  47. Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., & Poria, S.
    (2023) A review of deep learning techniques for speech processing. Information Fusion, 991, 101869. 10.1016/j.inffus.2023.101869
    https://doi.org/10.1016/j.inffus.2023.101869 [Google Scholar]
  48. Microsoft [Google Scholar]
  49. Munro, M. J., & Derwing, T. M.
    (1995a) Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97. 10.1111/j.1467‑1770.1995.tb00963.x
    https://doi.org/10.1111/j.1467-1770.1995.tb00963.x [Google Scholar]
  50. (1995b) Processing time, accent, and comprehensibility in the perception of native and foreign-accented speech. Language and Speech, 38(3), 289–306. 10.1177/002383099503800305
    https://doi.org/10.1177/002383099503800305 [Google Scholar]
  51. (2011) The foundations of accent and intelligibility in pronunciation research. Language Teaching, 44(3), 316–327. 10.1017/S0261444811000103
    https://doi.org/10.1017/S0261444811000103 [Google Scholar]
  52. NCH Software
    NCH Software (2022) WavePad audio editor (Version 16.01) [Computer software]. https://www.nch.com.au/wavepad/index.html
    [Google Scholar]
  53. Neri, A., Cucchiarini, C., & Strik, H.
    (2008) The effectiveness of computer-based speech corrective feedback for improving segmental quality in L2 Dutch. ReCALL, 20(2), 225–243. 10.1017/S0958344008000724
    https://doi.org/10.1017/S0958344008000724 [Google Scholar]
  54. O’Shaughnessy, D.
    (2024) Trends and developments in automatic speech recognition research. Computer Speech & Language, 831, 101538. 10.1016/j.csl.2023.101538
    https://doi.org/10.1016/j.csl.2023.101538 [Google Scholar]
  55. Ockey, G. J., Chukharev-Hudilainen, E., & Hirch, R. R.
    (2023) Assessing interactional competence: ICE versus a human partner. Language Assessment Quarterly, 20(4-5), 377–398. 10.1080/15434303.2023.2237486
    https://doi.org/10.1080/15434303.2023.2237486 [Google Scholar]
  56. Ortega, M., Mora, J. C., & Mora-Plaza, I.
    (2022) L2 learners’ self-assessment of comprehensibility and accentedness: Over/under-estimation, effects of rating peers, and attention to speech features. Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference. 10.31274/psllt.13354
    https://doi.org/10.31274/psllt.13354 [Google Scholar]
  57. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S.
    (2015) Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210. 10.1109/ICASSP.2015.7178964
    https://doi.org/10.1109/ICASSP.2015.7178964 [Google Scholar]
  58. Patman, C., & Chodroff, E.
    (2024) Speech recognition in adverse conditions by humans and machines. JASA Express Letters, 4(11), 115204. 10.1121/10.0032473
    https://doi.org/10.1121/10.0032473 [Google Scholar]
  59. Pieraccini, R.
    (2012) The voice in the machine: Building computers that understand speech. MIT Press. 10.7551/mitpress/9072.001.0001
    https://doi.org/10.7551/mitpress/9072.001.0001 [Google Scholar]
  60. Povey, D.
    (2020) Librispeech ASR model. https://kaldi-asr.org/models/m13
    [Google Scholar]
  61. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P.
    (2011) The Kaldi speech recognition toolkit. Proceedings of ASRU 2011, IEEE Signal Processing Society.
    [Google Scholar]
  62. Rabiner, L. R.
    (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. 10.1109/5.18626
    https://doi.org/10.1109/5.18626 [Google Scholar]
  63. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I.
    (2023) Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning, 28492–28518. https://proceedings.mlr.press/v202/radford23a.html
    [Google Scholar]
  64. Saito, K., Webb, S., Trofimovich, P., & Isaacs, T.
    (2016) Lexical correlates of comprehensibility versus accentedness in second language speech. Bilingualism: Language and Cognition, 19(3), 597–609. 10.1017/S1366728915000255
    https://doi.org/10.1017/S1366728915000255 [Google Scholar]
  65. Sun, W.
    (2023) The impact of automatic speech recognition technology on second language pronunciation and speaking skills of EFL learners: A mixed methods investigation. Frontiers in Psychology, 141, 1210187. 10.3389/fpsyg.2023.1210187
    https://doi.org/10.3389/fpsyg.2023.1210187 [Google Scholar]
  66. Tergujeff, E.
    (2021) Second language comprehensibility and accentedness across oral proficiency levels: A comparison of two L1s. System, 1001, 102567. 10.1016/j.system.2021.102567
    https://doi.org/10.1016/j.system.2021.102567 [Google Scholar]
  67. Thi-Nhu Ngo, T., Hao-Jan Chen, H., & Kuo-Wei Lai, K.
    (2023) The effectiveness of automatic speech recognition in ESL/EFL pronunciation: A meta-analysis. ReCALL, 36(1), 4–21. 10.1017/S0958344023000113
    https://doi.org/10.1017/S0958344023000113 [Google Scholar]
  68. Thomson, R. I., & Derwing, T. M.
    (2015) The effectiveness of L2 pronunciation instruction: A narrative review. Applied Linguistics, 36(3), 326–344. 10.1093/applin/amu076
    https://doi.org/10.1093/applin/amu076 [Google Scholar]
  69. Trofimovich, P., & Isaacs, T.
    (2012) Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(4), 905–916. 10.1017/S1366728912000168
    https://doi.org/10.1017/S1366728912000168 [Google Scholar]
  70. Yu, D., & Deng, L.
    (2015) Automatic speech recognition: A deep learning approach. Springer. 10.1007/978‑1‑4471‑5779‑3
    https://doi.org/10.1007/978-1-4471-5779-3 [Google Scholar]
  71. Zhang, Y., & Ai, J.
    (2024) Semantic-weighted word error rate based on BERT for evaluating automatic speech recognition models. 2024 11th International Conference on Dependable Systems and Their Applications (DSA), 189–198. 10.1109/DSA63982.2024.00034
    https://doi.org/10.1109/DSA63982.2024.00034 [Google Scholar]
  72. Zou, B., Du, Y., Wang, Z., Chen, J., & Zhang, W.
    (2023) An investigation into artificial intelligence speech evaluation programs with automatic feedback for developing EFL learners’ speaking skills. Sage Open, 13(3), 21582440231193818. 10.1177/21582440231193818
    https://doi.org/10.1177/21582440231193818 [Google Scholar]
/content/journals/10.1075/jslp.25012.won
Loading
/content/journals/10.1075/jslp.25012.won
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error