1887
image of Applying n-gram-based evaluation metrics to assess human interpreting
USD
Buy:$35.00 + Taxes

Abstract

Abstract

We have recently witnessed a number of studies conducted to employ -gram-based machine-translation evaluation metrics such as BLEU to assess human interpreting automatically. A major limitation of this research lies in the non-probabilistic sampling of a limited number of renditions. Consequently, the correlation coefficients calculated between machine and human assessments, which serve as a proxy for machine–​human parity, lack generalizability. Against this background, we conducted a battery of replications of Han and Lu (2023) in order to evaluate the efficacy of three -gram-based automated metrics — BLEU, NIST and METEOR — in the assessment of interpreting. Our replications are based on a self-curated corpus involving a total of 1,695 interpretations across different modes and directions of interpreting, based on various source speeches. Following the replications, we also conducted a four-level meta-analysis to produce an overall estimate of the machine–human correlation and to identify potential moderators. Our main findings are that the replication success rate for BLEU was above 95%, followed by NIST (at about 70%) and METEOR (at about 40%); the overall machine–human correlation was  = .638; and the three significant moderators identified were the direction of interpreting, the reliability of human scoring and the type of automated metrics. Our study has methodological and practical implications for conducting interpreting research and assessment.

Loading

Article metrics loading...

/content/journals/10.1075/intp.00127.han
2025-11-20
2025-12-06
Loading full text...

Full text loading...

References

  1. Anderson, S. F. & Maxwell, S. E.
    (2016) There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods (), –. 10.1037/met0000051
    https://doi.org/10.1037/met0000051 [Google Scholar]
  2. Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., Fiedler, S., Funder, D. C., Kliegl, R., Nosek, B. A., Perugini, M., Roberts, B. W., Schmitt, M., van Aken, M. A. G., Weber, H. & Wicherts, J. M.
    (2013) Recommendations for increasing replicability in psychology. European Journal of Personality (), –. 10.1002/per.1919
    https://doi.org/10.1002/per.1919 [Google Scholar]
  3. Banerjee, S. & Lavie, A.
    (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, –. https://www.aclweb.org/anthology/W05-0909/
    [Google Scholar]
  4. Bonett, D. G.
    (2012) Replication-extension studies. Current Directions in Psychological Science (), –. 10.1177/0963721412459512
    https://doi.org/10.1177/0963721412459512 [Google Scholar]
  5. Borenstein, M., Hedges, L. V., Higgins, J. P. T. & Rothstein, H. R.
    (2009) Introduction to meta-analysis. Chichester: John Wiley & Sons. 10.1002/9780470743386
    https://doi.org/10.1002/9780470743386 [Google Scholar]
  6. Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., Grange, J. A., Perugini, M., Spies, J. R. & van’t Veer, A.
    (2014) The Replication Recipe: What makes for a convincing replication?Journal of Experimental Social Psychology, –. 10.1016/j.jesp.2013.10.005
    https://doi.org/10.1016/j.jesp.2013.10.005 [Google Scholar]
  7. Braver, S. L., Thoemmes, F. J. & Rosenthal, R.
    (2014) Continuously cumulating meta-analysis and replicability. Perspectives on Psychological Science (), –. 10.1177/1745691614529796
    https://doi.org/10.1177/1745691614529796 [Google Scholar]
  8. Cheung, M. W-L.
    (2019) A guide to conducting a meta-analysis with non-independent effect sizes. Neuropsychology Review, –. 10.1007/s11065‑019‑09415‑6
    https://doi.org/10.1007/s11065-019-09415-6 [Google Scholar]
  9. Cochran, W. G.
    (1950) The comparison of percentages in matched samples. Biometrika (), –. 10.1093/biomet/37.3‑4.256
    https://doi.org/10.1093/biomet/37.3-4.256 [Google Scholar]
  10. Coughlin, D.
    (2003) Correlating automated and human assessments of machine translation quality. Proceedings of Machine Translation Summit IX: Papers. https://aclanthology.org/2003.mtsummit-papers.9.pdf
    [Google Scholar]
  11. Cumming, G.
    (2012) Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge. 10.4324/9780203807002
    https://doi.org/10.4324/9780203807002 [Google Scholar]
  12. (2014) The new statistics: Why and how. Psychological Science (), –. 10.1177/0956797613504966
    https://doi.org/10.1177/0956797613504966 [Google Scholar]
  13. Doddington, G.
    (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research, –. 10.3115/1289189.1289273
    https://doi.org/10.3115/1289189.1289273 [Google Scholar]
  14. Easley, R. W., Madden, C. S. & Dunn, M. G.
    (2000) Conducting marketing science: The role of replication in the research process. Journal of Business Research (), –. 10.1016/S0148‑2963(98)00079‑4
    https://doi.org/10.1016/S0148-2963(98)00079-4 [Google Scholar]
  15. Frankenberg-Garcia, A.
    (2022) Can a corpus-driven lexical analysis of human and machine translation unveil discourse features that set them apart?Target (), –. 10.1075/target.20065.fra
    https://doi.org/10.1075/target.20065.fra [Google Scholar]
  16. Ghiselli, S.
    (2022) Working memory tasks in interpreting studies: A meta-analysis. Translation, Cognition & Behavior (), –. 10.1075/tcb.00063.ghi
    https://doi.org/10.1075/tcb.00063.ghi [Google Scholar]
  17. Gile, D.
    (1990) Scientific research vs. personal theories in the investigation of interpretation. InL. Gran & C. Taylor (Eds.), Aspects of applied and experimental research on conference interpretation. Udine: Campanotto, –.
    [Google Scholar]
  18. Goh, J. X., Hall, J. A. & Rosenthal, R.
    (2016) Mini meta-analysis of your own studies: Some arguments on why and a primer on how. Social and Personality Psychology Compass (), –. 10.1111/spc3.12267
    https://doi.org/10.1111/spc3.12267 [Google Scholar]
  19. Higgins, J. P. T. & Thompson, S. G.
    (2002) Quantifying heterogeneity in a meta-analysis. Statistics in Medicine (), –. 10.1002/sim.1186
    https://doi.org/10.1002/sim.1186 [Google Scholar]
  20. Han, C. & Lu, X-L.
    (2021) Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society (), –. 10.1177/27523810211033670
    https://doi.org/10.1177/27523810211033670 [Google Scholar]
  21. (2023) Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?Computer Assisted Language Learning (), –. 10.1080/09588221.2021.1968915
    https://doi.org/10.1080/09588221.2021.1968915 [Google Scholar]
  22. Han, C., & Lu, X-L.
    (2025) Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics (), . 10.1016/j.rmal.2025.100184
    https://doi.org/10.1016/j.rmal.2025.100184 [Google Scholar]
  23. Han, C. & Wang, Y-Q.
    (2025) Conducting replication in translation and interpreting studies: Stakeholders’ perceptions, practices, and expectations. Target (), –. 10.1075/target.23164.han
    https://doi.org/10.1075/target.23164.han [Google Scholar]
  24. Han, C. & Yang, L-Y.
    (2023) Relating utterance fluency to perceived fluency of interpreting: A partial replication and a mini meta-analysis. Translation and Interpreting Studies (), –. 10.1075/tis.20091.han
    https://doi.org/10.1075/tis.20091.han [Google Scholar]
  25. Hoeppner, S.
    (2019) A note on replication analysis. International Review of Law and Economics, –. 10.1016/j.irle.2019.05.004
    https://doi.org/10.1016/j.irle.2019.05.004 [Google Scholar]
  26. Koehn, P.
    (2010) Statistical machine translation. Cambridge: Cambridge University Press. 10.1017/CBO9780511815829
    https://doi.org/10.1017/CBO9780511815829 [Google Scholar]
  27. Liu, M-H.
    (2016) Putting the horse before the cart: Righting the experimental approach in interpreting studies. InC. Bendazzoli & C. Monacelli (Eds.), Addressing methodological challenges in interpreting studies research. Newcastle upon Tyne: Cambridge Scholars, –.
    [Google Scholar]
  28. Liu, Y-B. & Zhang, W.
    (2022) Exploring the predictive validity of an interpreting aptitude test battery: An approximate replication. Interpreting (), –. 10.1075/intp.00078.liu
    https://doi.org/10.1075/intp.00078.liu [Google Scholar]
  29. Loper, E. & Steven, B.
    (2002) NLTK: the natural language toolkit. Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, –. 10.3115/1118108.1118117
    https://doi.org/10.3115/1118108.1118117 [Google Scholar]
  30. López-López, J. A., Page, M. J., Lipsey, M. W. & Higgins, J. P. T.
    (2018) Dealing with effect size multiplicity in systematic reviews and meta-analyses. Research Synthesis Methods (), –. 10.1002/jrsm.1310
    https://doi.org/10.1002/jrsm.1310 [Google Scholar]
  31. Lu, X-L. & Han, C.
    (2023) Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics: A multi-scenario exploratory study. Interpreting (), –. 10.1075/intp.00076.lu
    https://doi.org/10.1075/intp.00076.lu [Google Scholar]
  32. McShane, B. B. & Böckenholt, U.
    (2017) Single-paper meta-analysis: Benefits for study summary, theory testing, and replicability. Journal of Consumer Research (), –. 10.1093/jcr/ucw085
    https://doi.org/10.1093/jcr/ucw085 [Google Scholar]
  33. Mellinger, C. D. & Hanson, T. A.
    (2017) Quantitative research methods in translation and interpreting studies. Abingdon: Routledge. 10.4324/9781315647845
    https://doi.org/10.4324/9781315647845 [Google Scholar]
  34. (2019) Meta-analyses of simultaneous interpreting and working memory. Interpreting (), –. 10.1075/intp.00026.mel
    https://doi.org/10.1075/intp.00026.mel [Google Scholar]
  35. (2020) Meta-analysis and replication in interpreting studies. Interpreting (), –. 10.1075/intp.00037.mel
    https://doi.org/10.1075/intp.00037.mel [Google Scholar]
  36. Olalla-Soler, C.
    (2020) Practices and attitudes toward replication in empirical translation and interpreting studies. Target (), –. 10.1075/target.18159.ola
    https://doi.org/10.1075/target.18159.ola [Google Scholar]
  37. Open Science Collaboration
    Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science (). 10.1126/science.aac4716
    https://doi.org/10.1126/science.aac4716 [Google Scholar]
  38. Papineni, K., Roukos, S., Ward, T. & Zhu, W-J.
    (2002) BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, –. https://www.aclweb.org/anthology/P02-1040.pdf
    [Google Scholar]
  39. Pöchhacker, F.
    (2011) Replication in research on quality in conference interpreting. T&I Review, –.
    [Google Scholar]
  40. Rosenthal, R.
    (1997) Some issues in the replication of social science research. Labour Economics (), –. 10.1016/S0927‑5371(97)00012‑2
    https://doi.org/10.1016/S0927-5371(97)00012-2 [Google Scholar]
  41. Schenker, N. & Gentleman, J. F.
    (2001) On judging the significance of differences by examining the overlap between confidence intervals. The American Statistician, –. 10.1198/000313001317097960
    https://doi.org/10.1198/000313001317097960 [Google Scholar]
  42. Valentine, J. C., Biglan, A., Boruch, R. F., Castro, F. G., Collins, L. M., Flay, B. R., Kellam, S., Mościcki, E. K. & Schinke, S. P.
    (2011) Replication in prevention science. Prevention Science, –. 10.1007/s11121‑011‑0217‑6
    https://doi.org/10.1007/s11121-011-0217-6 [Google Scholar]
  43. Van den Noortgate, W., López-López, J. A., Marín-Martínez, F. & Sánchez-Meca, J.
    (2013) Three-level meta-analysis of dependent effect sizes. Behavior Research Methods, –. 10.3758/s13428‑012‑0261‑6
    https://doi.org/10.3758/s13428-012-0261-6 [Google Scholar]
  44. Viechtbauer, W.
    (2010) Conducting meta-analyses in R with the metafor package. Journal of Statistical Software (), –. 10.18637/jss.v036.i03
    https://doi.org/10.18637/jss.v036.i03 [Google Scholar]
  45. Wang, X-M. & Yuan, L.
    (2023) Machine-learning based automatic assessment of communication in interpreting. Frontiers in Communication. 10.3389/fcomm.2023.1047753
    https://doi.org/10.3389/fcomm.2023.1047753 [Google Scholar]
  46. Wen, H. & Dong, Y.
    (2019) How does interpreting experience enhance working memory and short-term memory: A meta-analysis. Journal of Cognitive Psychology (), –. 10.1080/20445911.2019.1674857
    https://doi.org/10.1080/20445911.2019.1674857 [Google Scholar]
/content/journals/10.1075/intp.00127.han
Loading
/content/journals/10.1075/intp.00127.han
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error