1887
Volume 27, Issue 2
  • ISSN 1384-6647
  • E-ISSN: 1569-982X
USD
Buy:$35.00 + Taxes

Abstract

Interpreting corpora serve as the descriptive foundation of research and the ‘ground truth’ against which machine interpreting technologies are evaluated. However, access to corpora remains a critical bottleneck in interpreting studies due to data collection and processing challenges and the absence of interpreting- and translation-specific corpus publication venues. In this article, we present two technical infrastructures that facilitate corpus access: a metadata schema which standardises corpus description and the Unified Interpreting Corpus (UNIC) platform for data and metadata search and publication. Guided by the internationally established FAIR (findability, accessibility, interoperability and reusability) and CARE (collective benefit, authority to control, responsibility and ethics) principles for scientific data management and stewardship, we designed the infrastructures based on a review of 125 spoken and signed language interpreting corpora, relevant international standards and community knowledge and also by using open-source technologies. Feedback obtained from interpreting students, researchers and interpreters demonstrates greater perceived usefulness of and satisfaction with UNIC compared to general-purpose search portals. Overall, we illustrate a value- and consensus-driven path towards optimising the use of interpreting corpora and the careful curation of new ones, which avoids the duplication of effort, helps to chart research directions and fosters co-design with communities.

Loading

Article metrics loading...

/content/journals/10.1075/intp.00123.liu
2025-08-29
2026-05-19
Loading full text...

Full text loading...

References

  1. Adolph, K. E., Gilmore, R. O., Freeman, C., Sanderson, P. & Millman, D.
    (2012) Toward open behavioral science. Psychological Inquiry23 (3), 244–247. 10.1080/1047840X.2012.705133
    https://doi.org/10.1080/1047840X.2012.705133 [Google Scholar]
  2. Albanie, S., Varol, G., Momeni, L., Afouras, T., Chung, J. S., Fox, N. & Zisserman, A.
    (2021) BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. InA. Vedaldi, H. Bischof, T. Brox & J.-M. Frahm (Eds.), Proceedings of the 16th European Conference on Computer Vision 2020. Glasgow: Springer, 35–53. 10.1007/978‑3‑030‑58621‑8_3
    https://doi.org/10.1007/978-3-030-58621-8_3 [Google Scholar]
  3. Australian FAIR Access Working Group
    Australian FAIR Access Working Group (2017) Policy statement on F.A.I.R. access to Australia’s research outputs. https://www.fair-access.net.au/fair-statement (accessed6 June 2024).
  4. Bendazzoli, C.
    (2010) Il corpus DIRSI: Creazione e sviluppo di un corpus elettronico per lo studio della direzionalità in interpretazione simultanea. PhD thesis, University of Bologna.
    [Google Scholar]
  5. (2018) Corpus-based interpreting studies: Past, present and future developments of a (wired) cottage industry. InM. Russo, C. Bendazzoli & B. Defrancq (Eds.), Making way in corpus-based interpreting studies. Singapore: Springer, 1–19.
    [Google Scholar]
  6. (2021) Corpus studies in conference interpreting. InM. Albl-Mikasa & E. Tiselius (Eds.), The Routledge handbook of conference interpreting. London: Routledge, 443–456.
    [Google Scholar]
  7. Bendazzoli, C., Bertozzi, M. & Russo, M.
    (2020) Du texte aux ressources multimodales: Faire avancer la recherche en interprétation à partir d’un corpus déjà existant. Meta65 (1), 211–236. 10.7202/1073643ar
    https://doi.org/10.7202/1073643ar [Google Scholar]
  8. Bernardini, S., Ferraresi, A. & Miličević, M.
    (2016) From EPIC to EPTIC: Exploring simplification in interpreting and translation from an intermodal perspective. Target28 (1), 61–86. 10.1075/target.28.1.03ber
    https://doi.org/10.1075/target.28.1.03ber [Google Scholar]
  9. Bird, S. & Simons, G.
    (2003) Seven dimensions of portability for language documentation and description. Language79 (3), 557–582.
    [Google Scholar]
  10. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.
    (2020) Language models are few-shot learners. InH. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan & H. Lin (Eds.), Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc., 1877–1901.
    [Google Scholar]
  11. Bührig, K., Kliche, O., Meyer, B. & Pawlack, B.
    (2012) The corpus ‘Interpreting in Hospitals’: Possible applications for research and communication training. InT. Schmidt & K. Wörner (Eds.), Multilingual corpora and multilingual corpus analysis. Amsterdam: John Benjamins, 305–315.10.1075/hsm.14.21buh
    https://doi.org/10.1075/hsm.14.21buh [Google Scholar]
  12. Camgöz, N. C., Hadfield, S., Koller, O., Ney, H. & Bowden, R.
    (2018) Neural sign language translation. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT: Institute of Electrical and Electronics Engineers (IEEE), 7784–7793. 10.1109/CVPR.2018.00812
    https://doi.org/10.1109/CVPR.2018.00812 [Google Scholar]
  13. Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R.
    (2020) The CARE principles for Indigenous data governance. Data Science Journal19 (1), 1–12. 10.5334/dsj‑2020‑043
    https://doi.org/10.5334/dsj-2020-043 [Google Scholar]
  14. Chmiel, A., Janikowski, P., Kajzer-Wietrzny, M., Koržinek, D. & Jakubowski, D.
    (2021) EU Parliament Speech Corpus. CLARIN-PL digital repository. hdl.handle.net/11321/821
  15. CLARIN
    CLARIN (n.d.). National consortia. https://www.clarin.eu/content/national-consortia (accessed16 June 2025).
  16. Defrancq, B. & Verliefde, S.
    (2023) A Dutch discourse marker in interpreter-mediated police interviewing with drafting: A corpus-based approach to dialogue interpreting. Research in Corpus Linguistics11 (2), 50–78. 10.32714/ricl.11.02.04
    https://doi.org/10.32714/ricl.11.02.04 [Google Scholar]
  17. Department for General Assembly and Conference Management
    Department for General Assembly and Conference Management (2024) Speech bank for interpretation training. United Nations. https://speechbank.un.org/ (accessed4 February 2025).
    [Google Scholar]
  18. Directorate-General for Research and Innovation
    Directorate-General for Research and Innovation (2021) Horizon Europe, open science: Early knowledge and data sharing, and open collaboration. Publications Office of the European Union. 10.2777/18252
    https://doi.org/10.2777/18252 [Google Scholar]
  19. Egbert, J., Biber, D. & Gray, B.
    (2022) Designing and evaluating language corpora: A practical framework for corpus representativeness. Cambridge: Cambridge University Press.
    [Google Scholar]
  20. El-Kishky, A., Chaudhary, V., Guzmán, F. & Koehn, P.
    (2020) CCAligned: A massive collection of cross-lingual web-document pairs. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020). Online: Association for Computational Linguistics, 5960–5969. 10.18653/v1/2020.emnlp‑main.480
    https://doi.org/10.18653/v1/2020.emnlp-main.480 [Google Scholar]
  21. Everaert, M., Musgrave, S. & Dimitriadis, A.
    (Eds.) (2009) The use of databases in cross-linguistic studies. Berlin: De Gruyter Mouton.
    [Google Scholar]
  22. Fišer, D. & Witt, A.
    (Eds.) (2022) CLARIN: The infrastructure for language resources. Berlin: De Gruyter.
    [Google Scholar]
  23. Franco Aixelá, J.
    (2001–2023) BITRA (Bibliography of Interpreting and Translation). 10.14198/bitra
    https://doi.org/10.14198/bitra [Google Scholar]
  24. Friedman, B. & Hendry, D. G.
    (2019) Value sensitive design: Shaping technology with moral imagination. Cambridge, MA: MIT Press. 10.7551/mitpress/7585.001.0001
    https://doi.org/10.7551/mitpress/7585.001.0001 [Google Scholar]
  25. Gambier, Y. & van Doorslaer, L.
    (Eds.) (2024) Translation studies bibliography (TSB). John Benjamins. 10.1075/etsb
    https://doi.org/10.1075/etsb [Google Scholar]
  26. Gile, D.
    (2024) CIRIN bibliography. https://www.cirin-gile.fr/Bibliohome.html
  27. GO FAIR Initiative
    GO FAIR Initiative (n.d.). FAIR principles. https://www.go-fair.org/fair-principles/ (accessed7 June 2024).
  28. Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., Flandin, G., Ghosh, S. S., Glatard, T., Halchenko, Y. O.
    (2016) The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data3 (1), 160044. 10.1038/sdata.2016.44
    https://doi.org/10.1038/sdata.2016.44 [Google Scholar]
  29. Jennings, L., Anderson, T., Martinez, A., Sterling, R., Chavez, D. D., Garba, I., Hudson, M., Garrison, N. A. & Carroll, S. R.
    (2023) Applying the ‘CARE principles for Indigenous data governance’ to ecology and biodiversity research. Nature Ecology & Evolution7 (10), 1547–1551. 10.1038/s41559‑023‑02161‑2
    https://doi.org/10.1038/s41559-023-02161-2 [Google Scholar]
  30. Jiang, Z., Müller, M., Ebling, S., Moryossef, A. & Ribback, R.
    (2023) SRF DSGS Daily news broadcast: Video and original subtitle data. LaRS — Language Repository of Switzerland. 10.48656/mzmd‑hd67
    https://doi.org/10.48656/mzmd-hd67 [Google Scholar]
  31. Joshi, P., Santy, S., Budhiraja, A., Bali, K. & Choudhury, M.
    (2020) The state and fate of linguistic diversity and inclusion in the NLP world. InD. Jurafsky, J. Chai, N. Schluter & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 6282–6293. 10.18653/v1/2020.acl‑main.560
    https://doi.org/10.18653/v1/2020.acl-main.560 [Google Scholar]
  32. Joze, H. R. V. & Koller, O.
    (2019) MS-ASL: A large-scale data set and benchmark for understanding American Sign Language. InProceedings of the 30th British Machine Vision Conference 2019. Cardiff, UK: British Machine Vision Association.
    [Google Scholar]
  33. Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C.
    (2022) Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics101, 50–72. 10.1162/tacl_a_00447
    https://doi.org/10.1162/tacl_a_00447 [Google Scholar]
  34. Liceras, J. M., Fernández Fuertes, R., Perales, S., Pérez-Tattam, R. & Spradlin, K. T.
    (2008) Gender and gender agreement in bilingual native and non-native grammars: A view from child and adult functional–lexical mixings. Lingua118 (6), 827–851. 10.1016/j.lingua.2007.05.006
    https://doi.org/10.1016/j.lingua.2007.05.006 [Google Scholar]
  35. Liu, N.
    (2023) Speaking in the first-person singular or plural: A multifactorial, speech corpus-based analysis of institutional interpreters. Interpreting25 (2), 239–273. 10.1075/intp.00088.liu
    https://doi.org/10.1075/intp.00088.liu [Google Scholar]
  36. Lösch, A., Mapelli, V., Piperidis, S., Vasiļjevs, A., Smal, L., Declerck, T., Schnur, E., Choukri, K. & van Genabith, J.
    (2018) European Language Resource Coordination: Collecting language resources for public sector multilingual information management. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), 1339–1343.
    [Google Scholar]
  37. Lušicky, V. & Wissik, T.
    (2017) Discovering resources in the VLO: A pilot study with students of translation studies. InSelected papers from the CLARIN Annual Conference 2016. Linköping: Linköping University Electronic Press, 63–75.
    [Google Scholar]
  38. Macháček, D., Žilinec, M. & Bojar, O.
    (2024) ESIC 1.1 — Europarl Simultaneous Interpreting Corpus (2024-02-05). LINDAT/CLARIAH-CZ. hdl.handle.net/11234/1-5415
    [Google Scholar]
  39. Marsden, E. & Mackey, A.
    (2014) IRIS: A new resource for second language research. Linguistic Approaches to Bilingualism4 (1), 125–130. 10.1075/lab.4.1.05mar
    https://doi.org/10.1075/lab.4.1.05mar [Google Scholar]
  40. National Science Foundation
    National Science Foundation (2023) NSF public access plan 2.0: Ensuring open, immediate and equitable access to National Science Foundation funded research. https://www.nsf.gov/pubs/2023/nsf23104/nsf23104.pdf (accessed25 June 2024).
  41. Paullada, A., Raji, I. D., Bender, E. M., Denton, E. & Hanna, A.
    (2021) Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns2 (11), 100336. 10.1016/j.patter.2021.100336
    https://doi.org/10.1016/j.patter.2021.100336 [Google Scholar]
  42. Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A. & Oostenveld, R.
    (2019) EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data6 (1), 103. 10.1038/s41597‑019‑0104‑8
    https://doi.org/10.1038/s41597-019-0104-8 [Google Scholar]
  43. Pöchhacker, F.
    (2022) Introducing interpreting studies (3rd ed.). London/New York: Routledge.
    [Google Scholar]
  44. (2024) Is machine interpreting interpreting?Translation Spaces (Online First). 10.1075/ts.23028.poc
    https://doi.org/10.1075/ts.23028.poc [Google Scholar]
  45. Pruitt, J. & Grudin, J.
    (2003) Personas: Practice and theory. InProceedings of the 2003 Conference on Designing for User Experiences. New York, NY: Association for Computing Machinery, 1–15. 10.1145/997078.997089
    https://doi.org/10.1145/997078.997089 [Google Scholar]
  46. Rehm, G., Piperidis, S., Bontcheva, K., Hajic, J., Arranz, V., Vasiļjevs, A., Backfried, G., Gomez-Perez, J. M., Germann, U., Calizzano, R.
    (2021) European Language Grid: A joint platform for the European language technology community. InD. Gkatzia & D. Seddah (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System demonstrations. Online: Association for Computational Linguistics, 221–230. 10.18653/v1/2021.eacl‑demos.26
    https://doi.org/10.18653/v1/2021.eacl-demos.26 [Google Scholar]
  47. Russo, M., Bendazzoli, C., Sandrelli, A. & Spinolo, N.
    (2012) The European Parliament Interpreting Corpus (EPIC): Implementation and developments. InF. Straniero Sergio & C. Falbo (Eds.), Breaking ground in corpus-based interpreting studies. Bern: Peter Lang, 53–90.
    [Google Scholar]
  48. Saunders, B., Camgöz, N. C. & Bowden, R.
    (2022) Signing at scale: Learning to co-articulate signs for large-scale photo-realistic sign language production. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA: Institute of Electrical and Electronics Engineers (IEEE), 5131–5141. 10.1109/CVPR52688.2022.00508
    https://doi.org/10.1109/CVPR52688.2022.00508 [Google Scholar]
  49. Seeber, K. G.
    (2006) SIMON: An online clearing house for interpreter training materials. InC. M. Crawford, R. Carlsen, K. McFerrin, J. Price, R. Weber & D. A. Willis (Eds.), Proceedings of Society for Information Technology & Teacher Education International Conference 2006. Orlando, FL: Association for the Advancement of Computing in Education (AACE), 2403–2408.
    [Google Scholar]
  50. Setton, R.
    (2011) Corpus-based interpretation studies (CIS): Reflections and prospects. InA. Kruger, K. Wallmach & J. Munday (Eds.), Corpus-based translation studies: Research and applications. London: Continuum, 33–75.
    [Google Scholar]
  51. Shlesinger, M.
    (1998) Corpus-based interpreting studies as an offshoot of corpus-based translation studies. Meta43 (4), 486–493. 10.7202/004136ar
    https://doi.org/10.7202/004136ar [Google Scholar]
  52. Surrey Research Park
    Surrey Research Park (2024) Signapse’s sign language technology advances Deaf accessibility. https://surrey-research-park.com/news/revolutionary-sign-language-technology-from-signapse/ (accessed23 August 2024).
  53. Technical Committee ISO/TC 37/SC 2
    Technical Committee ISO/TC 37/SC 2 (2023) Code for individual languages and language groups. Technical Report ISO 639:2023, Geneva: International Organization for Standardization.
    [Google Scholar]
  54. Technical Committee ISO/TC 37/SC 4
    Technical Committee ISO/TC 37/SC 4 (2015) Language resource management — Component Metadata Infrastructure (CMDI) — Part 1: The component metadata model. Technical Report ISO 24622-1, Geneva: International Organization for Standardization.
    [Google Scholar]
  55. Technical Committee ISO/TC 37/SC 4
    Technical Committee ISO/TC 37/SC 4 (2019) Language resource management — Component Metadata Infrastructure (CMDI) — Part 2: Component metadata specification language. Technical Report ISO 24622-2:2019, Geneva: International Organization for Standardization.
    [Google Scholar]
  56. Technical Committee ISO/TC 46/SC 4
    Technical Committee ISO/TC 46/SC 4 (2017) Information and documentation — The Dublin Core metadata element set Part 1: Core elements. Technical Report ISO 15836-1:2017, Geneva: International Organization for Standardization.
    [Google Scholar]
  57. Technical Committee ISO/TC 154
    Technical Committee ISO/TC 154 (2019) Date and time — Representations for information interchange Part 1: Basic rules. Technical report ISO 8601-1:2019, Geneva: International Organization for Standardization.
    [Google Scholar]
  58. Temnikova, I., Abdelali, A., Hedaya, S., Vogel, S. & Al Daher, A.
    (2017) Interpreting strategies annotation in the WAW corpus. InProceedings of the First Workshop on Human-informed Translation and Interpreting Technology (HiT-IT). Varna, Bulgaria: Incoma Ltd., 36–43.
    [Google Scholar]
  59. Thompson, B., Dhaliwal, M., Frisch, P., Domhan, T. & Federico, M.
    (2024) A shocking amount of the web is machine translated: Insights from multi-way parallelism. InL.-W. Ku, A. Martins & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 1763–1775. 10.18653/v1/2024.findings‑acl.103
    https://doi.org/10.18653/v1/2024.findings-acl.103 [Google Scholar]
  60. Ticca, A. C.
    (2008) L’interprete ad hoc nel dialogo mediato medico-paziente: processi interazionali in una clinica dello Yucatan indigeno. PhD thesis, University of Pisa.
    [Google Scholar]
  61. Vandeghinste, V., Van Dyck, B., De Coster, M., Goddefroy, M. & Dambre, J.
    (2022) BeCoS corpus: Belgian Covid-19 Sign Language corpus. A corpus for training sign language recognition and translation. Computational Linguistics in the Netherlands Journal121, 7–17.
    [Google Scholar]
  62. Wallmach, K.
    (2000) Examining simultaneous interpreting norms and strategies in a South African legislative context: A pilot corpus analysis. Language Matters31 (1), 198–221. 10.1080/10228190008566165
    https://doi.org/10.1080/10228190008566165 [Google Scholar]
  63. Wang, B.
    (2012) A descriptive study of norms in interpreting based on the Chinese–English consecutive interpreting corpus of Chinese premier press conferences. Meta57 (1), 198–212.
    [Google Scholar]
  64. Wehrmeyer, E.
    (2019) A corpus for signed language interpreting research. Interpreting21 (1), 62–90.
    [Google Scholar]
  65. Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E.
    (2016) The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data3 (1), 1–9. 10.1038/sdata.2016.18
    https://doi.org/10.1038/sdata.2016.18 [Google Scholar]
  66. Wilkinson, M. D., Dumontier, M., Sansone, S.-A., Bonino da Silva Santos, L. O., Prieto, M., Batista, D., McQuilton, P., Kuhn, T., Rocca-Serra, P., Crosas, M.
    (2019) Evaluating FAIR maturity through a scalable, automated, community-governed framework. Scientific Data6 (1), 174. 10.1038/s41597‑019‑0184‑5
    https://doi.org/10.1038/s41597-019-0184-5 [Google Scholar]
/content/journals/10.1075/intp.00123.liu
Loading
/content/journals/10.1075/intp.00123.liu
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error