Volume 25, Issue 1
  • ISSN 1384-6647
  • E-ISSN: 1569-982X
Buy:$35.00 + Taxes



Automated metrics for machine translation (MT) such as BLEU are customarily used because they are quick to compute and sufficiently valid to be useful in MT assessment. Whereas the instantaneity and reliability of such metrics are made possible by automatic computation based on predetermined algorithms, their validity is primarily dependent on a strong correlation with human assessments. Despite the popularity of such metrics in MT, little research has been conducted to explore their usefulness in the automatic assessment of human translation or interpreting. In the present study, we therefore seek to provide an initial insight into the way MT metrics would function in assessing spoken-language interpreting by human interpreters. Specifically, we selected five representative metrics – BLEU, NIST, METEOR, TER and BERT – to evaluate 56 bidirectional consecutive English–Chinese interpretations produced by 28 student interpreters of varying abilities. We correlated the automated metric scores with the scores assigned by different types of raters using different scoring methods (i.e., multiple assessment scenarios). The major finding is that BLEU, NIST, and METEOR had moderate-to-strong correlations with the human-assigned scores across the assessment scenarios, especially for the English-to-Chinese direction. Finally, we discuss the possibility and caveats of using MT metrics in assessing human interpreting.


Article metrics loading...

Loading full text...

Full text loading...


  1. Banerjee, S. & Lavie, A.
    (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72. https://www.aclweb.org/anthology/W05-0909/
    [Google Scholar]
  2. Callison-Bruch, C., Osborne, M. & Koehn, P.
    (2006) Re-evaluating the role of BLEU in machine translation research. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 249–256. https://www.aclweb.org/anthology/E06-1032.pdf
    [Google Scholar]
  3. Chen, J., Yang, H-B. & Han, C.
    (2021) Holistic versus analytic scoring of spoken-language interpreting: A multi-perspectival comparative analysis. Manuscript submitted for publication.
    [Google Scholar]
  4. Christodoulides, G. & Lenglet, C.
    (2014) Prosodic correlates of perceived quality and fluency in simultaneous interpreting. InN. Campbell, D. Gibbon & D. Hirst (Eds.), Proceedings of the 7th Speech Prosody Conference, 1002–1006. https://www.researchgate.net/publication/264984935_Prosodic_correlates_of_perceived_quality_and_fluency_in_simultaneous_interpreting. 10.21437/SpeechProsody.2014‑190
    https://doi.org/10.21437/SpeechProsody.2014-190 [Google Scholar]
  5. Chung, H-Y.
    (2020) Automatic evaluation of human translation: BLEU vs. METEOR. Lebende Sprachen65 (1), 181–205. 10.1515/les‑2020‑0009
    https://doi.org/10.1515/les-2020-0009 [Google Scholar]
  6. Coughlin, D.
    (2003) Correlating automated and human assessments of machine translation quality. www.mt-archive.info/MTS-2003-Coughlin.pdf
  7. Devlin, J., Chang, M-W., Lee, K. & Toutanova, K.
    (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186. https://www.aclweb.org/anthology/N19-1423/
    [Google Scholar]
  8. Doddington, G.
    (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research, 138–145. 10.3115/1289189.1289273
    https://doi.org/10.3115/1289189.1289273 [Google Scholar]
  9. Ginther, A., Dimova, S. & Yang, R.
    (2010) Conceptual and empirical relationships between temporal measures of fluency and oral English proficiency with implications for automated scoring. Language Testing27 (3), 379–399. 10.1177/0265532210364407
    https://doi.org/10.1177/0265532210364407 [Google Scholar]
  10. Han, C.
    (2015) Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting17 (2), 255–283. 10.1075/intp.17.2.05han
    https://doi.org/10.1075/intp.17.2.05han [Google Scholar]
  11. (2018) Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting20 (1), 59–95. 10.1075/intp.00003.han
    https://doi.org/10.1075/intp.00003.han [Google Scholar]
  12. (2022a) Interpreting testing and assessment: A state-of-the-art review. Language Testing39 (1), 30–55. 10.1177/02655322211036100
    https://doi.org/10.1177/02655322211036100 [Google Scholar]
  13. (2022b) Assessing spoken-language interpreting: The method of comparative judgement. Interpreting24 (1), xx–xx. 10.1075/intp.00068.han
    https://doi.org/10.1075/intp.00068.han [Google Scholar]
  14. Han, C. & Lu, X-L.
    (2021a) Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society1 (1), 70–90. 10.1177/27523810211033670
    https://doi.org/10.1177/27523810211033670 [Google Scholar]
  15. (2021b) Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?Computer Assisted Language Learning, 1–24. 10.1080/09588221.2021.1968915
    https://doi.org/10.1080/09588221.2021.1968915 [Google Scholar]
  16. Han, C. & Xiao, X-Y.
    (2021) A comparative judgment approach to assessing Chinese Sign Language interpreting. Language Testing, 1–24. 10.1177/02655322211038977
    https://doi.org/10.1177/02655322211038977 [Google Scholar]
  17. Han, C., Hu, J. & Deng, Y.
    (forthcoming). Effects of language background and directionality on raters’ assessments of spoken-language interpreting: An exploratory experimental study. Revista Española de Lingüística Aplicada.
    [Google Scholar]
  18. Han, C., Chen, S-J., Fu, R-B. & Fan, Q.
    (2020) Modeling the relationship between utterance fluency and raters’ perceived fluency of consecutive interpreting. Interpreting22 (2), 211–237. 10.1075/intp.00040.han
    https://doi.org/10.1075/intp.00040.han [Google Scholar]
  19. International School of Linguists
    International School of Linguists (2020) Diploma in Public Service Interpreting learner handbook. London, UK. https://www.islinguists.com/wp-content/uploads/2017/07/ISL-DPSI-Handbook-v4.2.pdf
    [Google Scholar]
  20. Koehn, P.
    (2010) Statistical machine translation. Cambridge: Cambridge University Press.
    [Google Scholar]
  21. Le, N-T., Lecouteux, B. & Besacier, L.
    (2018) Automatic quality estimation for speech translation using joint ASR and MT features. Machine Translation32 (4), 325–351. 10.1007/s10590‑018‑9218‑6
    https://doi.org/10.1007/s10590-018-9218-6 [Google Scholar]
  22. Lee, J.
    (2008) Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer2 (2), 165–184. 10.1080/1750399X.2008.10798772
    https://doi.org/10.1080/1750399X.2008.10798772 [Google Scholar]
  23. Lee, S-B.
    (2020) Holistic assessment of consecutive interpretation: How interpreter trainers rate student performances. Interpreting21 (2), 245–269. 10.1075/intp.00029.lee
    https://doi.org/10.1075/intp.00029.lee [Google Scholar]
  24. Liu, M-H.
    (2013) Design and analysis of Taiwan’s interpretation certification examination. In: D. Tsagari & R. van Deemter (Eds.), Assessment issues in language translation and interpreting. Frankfurt: Peter Lang, 163–178.
    [Google Scholar]
  25. Liu, Y-M.
    (2021) Exploring a corpus-based approach to assessing interpreting quality. In: J. Chen & C. Han (Eds.), Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 159–178. 10.1007/978‑981‑15‑8554‑8_8
    https://doi.org/10.1007/978-981-15-8554-8_8 [Google Scholar]
  26. Loper, E. & Steven, B.
    (2002) NLTK: the natural language toolkit. Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, 63–70. 10.3115/1118108.1118117
    https://doi.org/10.3115/1118108.1118117 [Google Scholar]
  27. Mathur, N., Wei, J., Freitag, M., Ma, Q-S. & Bojar, O.
    (2020) Results of the WMT20 metrics shared task. Proceedings of the Fifth Conference on Machine Translation, 688–725. https://www.aclweb.org/anthology/2020.wmt-1.77/
    [Google Scholar]
  28. NAATI
    NAATI (2019) Certified interpreter test assessment rubrics. https://www.naati.com.au/media/2245/ci_spoken_assessment_rubrics.pdf
  29. Ouyang, L-W., Lv, Q-X. & Liang, J-Y.
    (2021) Coh-Metrix model-based automatic assessment of interpreting quality. In: J. Chen & C. Han (Eds.), Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 179–200. 10.1007/978‑981‑15‑8554‑8_9
    https://doi.org/10.1007/978-981-15-8554-8_9 [Google Scholar]
  30. Papineni, K., Roukos, S., Ward, T. & Zhu, W-J.
    (2002) BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://www.aclweb.org/anthology/P02-1040.pdf
    [Google Scholar]
  31. Reiter, E.
    (2018) A structured review of the validity of BLEU. Computational Linguistics44 (3), 393–401. 10.1162/coli_a_00322
    https://doi.org/10.1162/coli_a_00322 [Google Scholar]
  32. Sellam, T., Das, D. & Parikh, A. P.
    (2020) BLEURT: Learning robust metrics for text generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7881–7892. https://www.aclweb.org/anthology/2020.acl-main.704.pdf. 10.18653/v1/2020.acl‑main.704
    https://doi.org/10.18653/v1/2020.acl-main.704 [Google Scholar]
  33. Setton, R. & Dawrant, A.
    (2016) Conference Interpreting: A Trainer’s Guide. Amsterdam & Philadelphia: John Benjamins. 10.1075/btl.121
    https://doi.org/10.1075/btl.121 [Google Scholar]
  34. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. & Makhoul, J.
    (2006) A study of translation edit rate with targeted human annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, 223–231. mt-archive.info/AMTA-2006-Snover.pdf
    [Google Scholar]
  35. Stewart, C., Vogler, N., Hu, J-J., Boyd-Graber, J. & Neubig, G.
    (2018) Automatic estimation of simultaneous interpreter performance. The 56th Annual Meeting of the Association for Computational Linguistics. https://aclweb.org/anthology/P18-2105. 10.18653/v1/P18‑2105
    https://doi.org/10.18653/v1/P18-2105 [Google Scholar]
  36. Su, W.
    (2019) Exploring native English teachers’ and native Chinese teachers’ assessment of interpreting. Language and Education331, 577–594. 10.1080/09500782.2019.1596121
    https://doi.org/10.1080/09500782.2019.1596121 [Google Scholar]
  37. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C-W., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q. & Rush, A.
    (2020) Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6/. 10.18653/v1/2020.emnlp‑demos.6
    https://doi.org/10.18653/v1/2020.emnlp-demos.6 [Google Scholar]
  38. Wu, S. C.
    (2010) Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. https://theses.ncl.ac.uk/jspui/bitstream/10443/1122/1/Wu%2011.pdf
  39. Wu, Z-W.
    (2021) Chasing the unicorn? The feasibility of automatic assessment of interpreting fluency. In: J. Chen & C. Han (Eds.). Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 143–158. 10.1007/978‑981‑15‑8554‑8_7
    https://doi.org/10.1007/978-981-15-8554-8_7 [Google Scholar]
  40. Yang, L-Y.
    (2015) An exploratory study of fluency in English output of Chinese consecutive interpreting learners. Journal of Zhejiang International Studies University11, 60–68.
    [Google Scholar]
  41. Yu, W-T. & van Heuven, V. J.
    (2017) Predicting judged fluency of consecutive interpreting from acoustic measures: Potential for automatic assessment and pedagogic implications. Interpreting191, 47–68. 10.1075/intp.19.1.03yu
    https://doi.org/10.1075/intp.19.1.03yu [Google Scholar]
  42. Zhang, M.
    (2013) Contrasting automated and human scoring of essays. R&D Connections211. https://www.ets.org/Media/Research/pdf/RD_Connections_21.pdf
    [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error