Full text loading...
, Xiaolei Lu2
, Weiwei Wang3
and Shirong Chen4
Abstract
We have recently witnessed a number of studies conducted to employ n-gram-based machine-translation evaluation metrics such as BLEU to assess human interpreting automatically. A major limitation of this research lies in the non-probabilistic sampling of a limited number of renditions. Consequently, the correlation coefficients calculated between machine and human assessments, which serve as a proxy for machine–human parity, lack generalizability. Against this background, we conducted a battery of replications of Han and Lu (2023) in order to evaluate the efficacy of three n-gram-based automated metrics — BLEU, NIST and METEOR — in the assessment of interpreting. Our replications are based on a self-curated corpus involving a total of 1,695 interpretations across different modes and directions of interpreting, based on various source speeches. Following the replications, we also conducted a four-level meta-analysis to produce an overall estimate of the machine–human correlation and to identify potential moderators. Our main findings are that the replication success rate for BLEU was above 95%, followed by NIST (at about 70%) and METEOR (at about 40%); the overall machine–human correlation was rs = .638; and the three significant moderators identified were the direction of interpreting, the reliability of human scoring and the type of automated metrics. Our study has methodological and practical implications for conducting interpreting research and assessment.
Article metrics loading...
Full text loading...
References
Data & Media loading...