Volume 36, Issue 2
  • ISSN 0213-2028
  • E-ISSN: 2254-6774
Buy:$35.00 + Taxes



Modern Standard Arabic makes extensive use of coordination particles whereas punctuation marks are scarce and erratic, leading to long clauses. This is generally assumed to hinder Sentence Boundary Detection and to promote sentence splitting when translating from Arabic into English. Previous literature on translation from Arabic to Spanish is practically inexistent. We have tested this hypothesis regarding translation from Arabic to Spanish on a sample of 282,714 graphic words extracted from a bilingual corpus of 8,681,110 graphic words and found that each Arabic sentence yielded an average of 1.5 Spanish sentences. Furthermore, our data shows the potential impact of directionality in that sentence splitting when translating from Arabic into Spanish is 50% more frequent than from English into Arabic. We also determined statistically that five elements ( [و], [حيث], [كما], [وقد], and [وذلك]) are the most salient potential markers for sentence splitting in the resulting Spanish translations. Our findings should be particularly interesting for Computational Linguistics and translator training.


Article metrics loading...

Loading full text...

Full text loading...


  1. Abdul-Raof, H.
    (1998) Subject, theme and agent in modern standard Arabic. Curzon Press.
    [Google Scholar]
  2. Ahrenberg, L.
    (2017) Comparing machine translation and human translation: A case study. InI. Temnikova, C. Orasan, G. Corpas, & S. Vogel (Eds.), Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT) (pp.21–28). Association for Computational Linguistics. Retrieved fromhttps://www.acl-bg.org/proceedings/2017/RANLP_W3%202017/pdf/HiT-IT003.pdf. 10.26615/978‑954‑452‑042‑7_003
    https://doi.org/10.26615/978-954-452-042-7_003 [Google Scholar]
  3. Alazzawie, A.
    (2014) The discourse marker wa in standard Arabic – A syntactic and semantic analysis. Theory and Practice in Language Studies, 4(10), 2008–2015. 10.4304/tpls.4.10.2008‑2015
    https://doi.org/10.4304/tpls.4.10.2008-2015 [Google Scholar]
  4. Alfuraih, R.
    (2020) The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics. Language Resources & Evaluation, 541, 801–830. 10.1007/s10579‑019‑09472‑6
    https://doi.org/10.1007/s10579-019-09472-6 [Google Scholar]
  5. Alghamdi, M., & Teahan, W.
    (2017) Experimental evaluation of Arabic OCR systems. PSU Research Review, 1(3), 229–241. 10.1108/PRR‑05‑2017‑0026
    https://doi.org/10.1108/PRR-05-2017-0026 [Google Scholar]
  6. Al-Harthi, M., & Alsaif, A.
    (2019) The design of the SauLTC application for the English-Arabic learner translation corpus. InM. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings of the 3rd Workshop on Arabic Corpus Linguistics (pp.80–88). Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/W19-5610.pdf
    [Google Scholar]
  7. Al-Khuli, M.
    (1998) Al-tārakīb al-shāʾiʿa fi l-lugha al-ʿarabiyya. Dirāsa iḥṣāʾiyya [Most common structures in Arabic language. A statistical study]. Dār Al-Falāḥ.
    [Google Scholar]
  8. Alotaiby, F., Foda, S., & Alkharashi, I.
    (2010) Clitics in Arabic language: A statistical study. Proceedings of Pacific Asia Conference on Language, Information and Computation (PACLIC), 241, 595–602.
    [Google Scholar]
  9. Al-Raisi, F., Lin, W., & Bourai, A.
    (2018) A monolingual parallel corpus of Arabic. Procedia Computer Science, 1421, 334–338. 10.1016/j.procs.2018.10.487
    https://doi.org/10.1016/j.procs.2018.10.487 [Google Scholar]
  10. Altammami, S., Atwell, E., & Alsalka, A.
    (2019) Text segmentation using N-grams to annotate Hadith corpus. InM. El-Haj, P. Rayson, E. Atwell, & L. Alsudias (eds.), Proceedings of the 3rd Workshop on Arabic Corpus Linguistics (pp.31–39). Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/W19-5605.pdf
    [Google Scholar]
  11. Awad, D.
    (2015) The evolution of Arabic writing due to European influence: The case of punctuation. Journal of Arabic and Islamic Studies, 151, 117–136. 10.5617/jais.4650
    https://doi.org/10.5617/jais.4650 [Google Scholar]
  12. Baker, M.
    (1993) Corpus linguistics and translation studies: Implications and applications. InM. Baker, G. Francis, & E. Tognini-Bonelli (eds.), Text and technology: In honour of John Sinclair (pp.233–250). John Benjamins. 10.1075/z.64.15bak
    https://doi.org/10.1075/z.64.15bak [Google Scholar]
  13. Bisiada, M.
    (2013) From hypotaxis to parataxis: An investigation of English–German syntactic convergence in translation [Doctoral dissertation]. Retrieved fromhttps://www.research.manchester.ac.uk/portal/files/54546816/FULL_TEXT.PDF
  14. (2016) Lösen Sie Schachtelsätze möglichst auf: The impact of editorial guidelines on sentence splitting in German business article translations. Applied Linguistics, 37(3), 354–376. 10.1093/applin/amu035
    https://doi.org/10.1093/applin/amu035 [Google Scholar]
  15. Bloch, I.
    (2005) Sentence splitting as an expression of translationese: Seminar paper. InBlack Box Seminar, Bar Ilan University. Retrieved fromhttps://www.biu.ac.il/hu/stud-pub/tr/tr-pub/bloch-split.htm
    [Google Scholar]
  16. Buckwalter, T., & Parkinson, D.
    (2011) A frequency dictionary of Arabic: core vocabulary for learners. Routledge.
    [Google Scholar]
  17. Chen, Y., & Eisele, A.
    (2012) MultiUN v2: UN documents with multilingual alignments. InN. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp.2500–2504). European Language Resources Association (ELRA). Retrieved fromwww.lrec-conf.org/proceedings/lrec2012/pdf/641_Paper.pdf
    [Google Scholar]
  18. Choueka, Y., Conley, E., & Dagan, I.
    (2000) A comprehensive bilingual word alignment system. Application to disparate languages: Hebrew and English. InJ. Véronis (ed.), Parallel text processing. alignment and use of translation corpora (pp.69–96). Kluwer Academic Publishers. 10.1007/978‑94‑017‑2535‑4_4
    https://doi.org/10.1007/978-94-017-2535-4_4 [Google Scholar]
  19. Darwish, K., & Gao, W.
    (2014) Simple effective microblog named entity recognition: Arabic as an example. InN. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidiset (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp.2513–2517). European Languages Resources Association (ELRA). Retrieved fromwww.lrec-conf.org/proceedings/lrec2014/pdf/186_Paper.pdf
    [Google Scholar]
  20. Dickins, J., Sándor, H., & Higgins, I.
    (2017) Thinking Arabic translation. a course in translation method: Arabic to English. Routledge.
    [Google Scholar]
  21. Eisele, A., & Chen, Y.
    (2010) MultiUnited nations: A multilingual corpus from United Nation documents. InN. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (eds.), Proceedings of the Seventh conference on International Language Resources and Evaluation (pp.2868–2872). European Language Resources Association (ELRA). Retrieved fromwww.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf
    [Google Scholar]
  22. Fabricius-Hansen, C.
    (1999) Information packaging and translation: Aspects of translational sentence splitting (German-English/Norwegian). InM. Doherty (ed.), Sprachspezifissche Aspekte der Informationsverteilung (pp.175–214). Akademie Verlag. 10.1515/9783050078137‑008
    https://doi.org/10.1515/9783050078137-008 [Google Scholar]
  23. Farghaly, A., & Shaalan, K.
    (2009) Arabic natural language processing: Challenges and solutions. ACM TraSActions on Asian Language Information Processing (TALIP), 8(4), 1–22. 10.1145/1644879.1644881
    https://doi.org/10.1145/1644879.1644881 [Google Scholar]
  24. Feria, M.
    (2014) Planning the acquisition and enhancement of language skills for translation and interpreting trainees: the case of Arabic. InV. Aguilar, W. Saleh, M. A. Manzano, L. M. Pérez Cañada, & P. Santillán Grimm (eds.), Arabele 2012: enseñanza y aprendizaje de la lengua árabe (pp.197–221). Universidad de Murcia.
    [Google Scholar]
  25. Frankenberg-Garcia, A.
    (2019) A corpus study of splitting and joining sentences in translation. Corpora, 14(1), 1–30. 10.3366/cor.2019.0159
    https://doi.org/10.3366/cor.2019.0159 [Google Scholar]
  26. Gale, W., & Kenneth, C.
    (1993) A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
    [Google Scholar]
  27. García Barrero, D., Feria García, M., & Turell, M.
    (2012) Using function words and punctuation marks in Arabic forensic authorship attribution. InR. Sousa-Silva, R. Faria, N. Gavaldà, & B. Maia (eds.), Proceedings of the 3rd European Conference of the International Association of Forensic Linguists (pp.42–56). Universidade de Porto.
    [Google Scholar]
  28. Ghaly, H.
    (2014) Canvas: A fast and accurate geometric sentence alignment system using lexical cues within complex misalignment settings. CUNY Academic Works.
    [Google Scholar]
  29. Habash, N.
    (2010) Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187. 10.1007/978‑3‑031‑02139‑8
    https://doi.org/10.1007/978-3-031-02139-8 [Google Scholar]
  30. Halliday, M. & Hasan, R.
    (1976) Cohesion in English. London: Longman.
    [Google Scholar]
  31. Hareide, L., & Hofland, K.
    (2012) Compiling a Norwegian-Spanish parallel corpus. Methods and challenges. InM. Oakes, & J. Meng (eds.), Quantitative methods in corpus-based translation studies (pp.75–114). John Benjamins. 10.1075/scl.51.04har
    https://doi.org/10.1075/scl.51.04har [Google Scholar]
  32. Heine, B., & Kuteva, T.
    (2002) World lexicon of grammaticalization. Cambridge University Press. 10.1017/CBO9780511613463
    https://doi.org/10.1017/CBO9780511613463 [Google Scholar]
  33. Keskes, I.
    (2015) Discourse analysis of Arabic documents and application to automatic summarization (Doctoral dissertation). Retrieved fromhttps://core.ac.uk/download/pdf/42969051.pdf
  34. Kunilovskaya, M., & Morgoun, N.
    (2013) Gains and pitfalls of sentence-splitting in translation. Perm National Research Polytechnic University Herald. Issues in Linguistics and Pedagogy, 8(50), 152–166.
    [Google Scholar]
  35. Merkel, M.
    (2001) Comparing source and target texts in a translation corpus. InA. S. Hein (ed.), Proceedings of the 13th Nordic Conference of Computational Linguistics, NODALIDA (pp.81–85). Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/W01-1716.pdf
    [Google Scholar]
  36. Neme, A., & Paumier, S.
    (2020) Restoring Arabic vowels through omission-tolerant dictionary lookup. Language Resources and Evaluation, 541, 487–551. 10.1007/s10579‑019‑09464‑6
    https://doi.org/10.1007/s10579-019-09464-6 [Google Scholar]
  37. Parkinson, D.
    (1981) VSO to SVO in modern standard Arabic: A study in diglossia syntax. Al-Arabiyya, 141, 24–37.
    [Google Scholar]
  38. Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R.
    (2014) MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. InN. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (eds.), LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp.1094–1101). European Language Resources Association. Retrieved fromwww.lrec-conf.org/proceedings/lrec2014/pdf/593_Paper.pdf
    [Google Scholar]
  39. Ramm, W.
    (2004) Sentence-boundary adjustment in Norwegian-German and German-Norwegian translations: First results of a corpus-based study. InK. Aijmer, & H. Hasselgard (eds.), Translation and Corpora (pp.129–147). Acta Universitatis Gothoburgensis.
    [Google Scholar]
  40. Rafalovitch, A., & Dale, R.
    (2009) United Nations General Assembly resolutions: A six-language parallel corpus’. InProceedings of the MT Summit XII (pp.292–299). International Association of Machine Translation. Retrieved fromwww.mt-archive.info/MTS-2009-Rafalovitch.pdf
    [Google Scholar]
  41. Read, J., Dridan, R., Oepen, S., & Solberg, L.
    (2012) Sentence boundary detection: A long solved problem?InM. Kay, & C. Boitet (eds.), Proceedings of COLING 2012: Posters (pp.985–994). COLING 2012 Organization Committee. Retrieved fromhttps://www.aclweb.org/anthology/C12-2096.pdf
    [Google Scholar]
  42. Ryding, K.
    (2005) A reference grammar of modern standard Arabic. Cambridge University Press. 10.1017/CBO9780511486975
    https://doi.org/10.1017/CBO9780511486975 [Google Scholar]
  43. Sainz-Quinn, C. & Feria García, M.
    (2020) Translating Arabic named entities into English and Spanish: Translation consistency at the United Nations. InS. Hanna, H. El-Farahaty, & A. W. Khalifa (eds.), Routledge Handbook of Arabic Translation (pp.381–396). Routledge.
    [Google Scholar]
  44. Salameh, M., Zantout, R., & Mansour, N.
    (2011) Improving the accuracy of English-Arabic statistical sentence alignment. The International Arab Journal of Information Technology, 8(2), 171–177.
    [Google Scholar]
  45. Samy, D., Moreno-Sandoval, A., & Guirao, J. M.
    (2004) An alignment experiment of a Spanish-Arabic parallel corpus. InProceedings of the International Conference on Arabic Language Resources and Tools (pp.85–89). NEMLAR. Retrieved fromelvira.lllf.uam.es/ESP/Publicaciones/AlignmentPaper04.pdf
    [Google Scholar]
  46. Samy, D.
    (2005) Named entities: Structure and translation. A study based on a parallel corpus (Arabic-English-Spanish). InProceedings from the Corpus Linguistics Conference Series. Birmingham. Retrieved fromwww.lllf.uam.es/ESP/Publicaciones/NamedEntitiesParallelCorpus.pdf
    [Google Scholar]
  47. Samy, D., Moreno-Sandoval, A., Guirao, J. M., & Alfonseca, E.
    (2006) Building a parallel multilingual corpus (Arabic-Spanish-English). InN. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, & D. Tapias (eds.), Proceedings of the 5th International Conference on Language Resources and Evaluations (LREC’06). GeNAO. Retrieved fromwww.lllf.uam.es/~doaa/Publications/SamyMultilingualLREC06.pdf
    [Google Scholar]
  48. Samy, D., & González Ledesma, A.
    (2008) Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic-Spanish-English). InN. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (eds.), Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. Retrieved fromwww.mt-archive.info/LREC-2008-Samy.pdf
    [Google Scholar]
  49. Sánchez-Ratia, J.
    (2018) El árabe en la traducción al español de las Naciones Unidas. Retrieved fromhttps://ls-sts.unog.ch/basic-page/el-arabe-en-la-traduccion-al-espanol-de-las-naciones-unidas
    [Google Scholar]
  50. Scott, M.
    (2008) WordSmith Tools 5.0. Lexical Analysis Software.
    [Google Scholar]
  51. Semmar, N., & Fluhr, C.
    (2007) Arabic to French sentence alignment: Exploration of a cross-language information retrieval approach. InV. Cavalli-Sforza, & I. Zitouni (eds.), Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources (pp.73–80). Retrieved fromhttps://www.aclweb.org/anthology/W07-0810.pdf. 10.3115/1654576.1654589
    https://doi.org/10.3115/1654576.1654589 [Google Scholar]
  52. Serbina, T.
    (2014) Sentence splitting in the translation pair English-German. In4th Using Corpora in Contrastive and Translation Studies Conference. Abstract Book (pp.61–62). Lancaster University. Retrieved fromucrel.lancs.ac.uk/uccts4/doc/UCCTS4-abstract-book.pdf
    [Google Scholar]
  53. Shaalan, K.
    (2014) A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510. 10.1162/COLI_a_00178
    https://doi.org/10.1162/COLI_a_00178 [Google Scholar]
  54. Solfjeld, K.
    (2008) Sentence splitting and discourse structure in translations. Languages in Contrast, 8(1), 21–46. 10.1075/lic.8.1.03sol
    https://doi.org/10.1075/lic.8.1.03sol [Google Scholar]
  55. Taji, D., El Gizuli, J., & Habash, N.
    (2018) An Arabic dependency treebank in the travel domain. InN. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). Retrieved fromlrec-conf.org/workshops/lrec2018/W30/pdf/14_W30.pdf
    [Google Scholar]
  56. Touir, A., Mathkour, H., & Al-Sanea, W.
    (2008) Semantic-based segmentation of Arabic texts. Information Technology Journal, 71, 1009–1015. 10.3923/itj.2008.1009.1015
    https://doi.org/10.3923/itj.2008.1009.1015 [Google Scholar]
  57. Xu, J., Fraser, A., & Weischedel, R.
    (2001) TREC 2001 Cross-lingual retrieval at BBN. InNIST TREC 2001 Proceedings (pp.68–77). Retrieved fromhttps://trec.nist.gov/pubs/trec10/papers/BBNTREC2001.pdf
    [Google Scholar]
  58. Zantout, R., & Guessoum, A.
    (2015) Obstacles facing Arabic machine translation: Building a neural network-based transfer module. InS. Izwaini (ed.), Papers in Translation Studies (pp.229–251). Cambridge Scholars Publishing.
    [Google Scholar]

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error