Full text loading...
Abstract
This study examines the validity of WER as a proxy for pronunciation quality in EFL contexts. Human ratings of comprehensibility and accentedness were compared with WER and automated pronunciation scores from six ASR systems — Kaldi, wav2vec 2.0, HuBERT, Whisper (Base and Large-v3), and Microsoft Azure — using 190 read-aloud recordings by Korean elementary learners. With respect to pronunciation scoring, Azure’s phoneme-level accuracy scores demonstrated moderate correlations with human judgments, while Kaldi’s GOP scores showed no meaningful association. Analysis of WER revealed a critical trade-off between ASR accuracy and perceptual sensitivity: high-performing systems such as Whisper Large-v3 and Azure produced near-zero WERs but weakly correlated with human ratings. In contrast, mid-performing systems such as Whisper Base and HuBERT showed stronger correlations, indicating that moderate WER values may better reflect pronunciation variation. These results underscore the limitations of WER in advanced ASR systems and the need for perceptually grounded, interpretable metrics.
Article metrics loading...
Full text loading...
References
Data & Media loading...