- Home
- e-Journals
- International Journal of Corpus Linguistics
- Previous Issues
- Volume 6, Issue, 2001
International Journal of Corpus Linguistics - Volume 6, Issue 3, 2001
Volume 6, Issue 3, 2001
-
Automatic Extraction of Terminological Translation Lexicon from Czech-English Parallel Texts
Author(s): Martin Cmejrek and Jan Curínpp.: 1–12 (12)More LessWe present experimental results of an automatic extraction of a Czech-English translation dictionary. Two different bilingual corpora (119,886 sentence pairs computer-oriented and 58,137 journalistic corpora) were created. We used the length-based statistical method for sentence alignment (Gale and Church 1991) and noun phrase marker working with regular grammar and probabilistic model (Brown et al. 1993) for dictionary extraction. Resulting dictionaries’ size varies around 6,000 entries. After significance filtering, weighted precision is 86.4% for computer-oriented and 70.7% for journalistic Czech-English dictionary.
-
Words from Bononia Legal Corpus
Author(s): Rema Rossini Favretti, F. Tamburini and E. Martellipp.: 13–34 (22)More LessThe analysis of special multilingual corpora is still in its infancy, but it may serve a particularly important role for the directions it offers both in cross-linguistic investigation and in the selection of the most typical features of text types and genres. To exemplify the information which can be obtained from corpus evidence, the paper reports on an on-going corpus-driven research project, named Bononia Legal Corpus (BOLC). The main aim of BOLC is to build multilingual machine readable law corpora. Data are at present limited to English and Italian, but an extension is envisaged to include other languages. Before the first sample, a preliminary pilot corpus was constructed to consider European legislation and create a conceptual framework to be used as a first-level experience. In the paper, sections 2 and 3 describe the corpus design and formatting as well as the corpus access tools. Sections 4 and 5 discuss two case studies and analyse two semantic areas which can be seen as two ends of the same variational continuum. At one end, we consider the words contratto and contract, which through the extension of international transactions and circulation may be supposed to have acquired transnational traits. At the other, we focus on a semantic area which may be expected to present translation problems for the differences existing in the two socio-institutional systems. Reference is made to the English words tax and duty and to the Italian words tassa and imposta.
-
Hybrid Approaches for Automatic Segmentation and Annotation of a Chinese Text Corpus
Author(s): Zhiwei Fengpp.: 35–42 (8)More LessThis paper describes the hybrid approaches for automatic segmentation and annotation of a Chinese text corpus. Some experiment results are given. Hybrid approaches combine the rule-based method, the statistic-based method, and the automatic learning method. It is a good approach, and it can obviously improve the precision of segmentation and annotation of a Chinese text corpus.
-
Distance Between Languages as Measured by the Minimal-Entropy Model; Plato’s Republic—Slovenian Versus 15 Other translations
Author(s): Primoz Jakopinpp.: 43–53 (11)More LessIn this paper, a language model, based on probabilities of text n-grams, is used as a measure of distance between Slovenian and 15 other European languages. During the construction of the model, a Huffman tree is generated from all the n-grams (n= 1to 32, frequency 2 or more) in the training corpus of Slovenian literary texts (2.7 million words), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to a new text sample, it is cut into n-grams (1–32) in such a way that the sum of model Huffman code lengths for all the obtained n-grams of new text is minimal.
The above model, applied to all (16) translations of Plato’s Republic from the TELRI CD ROM, produced the following language order (average coding length in bits per character): Slovenian (2,37), Serbocroatian (3,77), Croatian (3,84), Bulgarian (3,96), Czech (4,10), Polish (4,32), Russian (4,46), Slovak (4,46), Latvian (4,74), Lithuanian (4,94), English (5,40), French (5,67), German (5,69), Romanian (5,76), Finnish (6,11), and Hungarian (6,47).
-
The Importance of the Syntagmatic Dimension in the Multilingual Lexical Database
Author(s): Rūta Petrauskaitėpp.: 55–65 (11)More LessThis paper describes the idea behind a multilingual database (Muldi) designed to incorporate five constituent parts: monolingual and multilingual corpora, monolingual lexicons, lists of translation equivalents, and terminological records. The emphasis in Muldi is on the presentation, analysis, and description of syntagmatic information contained in lexical items. Types of translation equivalents as well as the problem of relationship between dictionary and corpus translation equivalents is also considered.
-
Compiling Parallel Text Corpora: Towards Automation of Routine Procedures
Author(s): Mihail Mihailov and Hannu Tommolapp.: 67–77 (11)More LessThe aim of the research project running at the Department of Translation Studies of the University of Tampere is to collect a Russian-Finnish parallel corpus of fiction. The corpus will be equipped with efficient search and analysis tools. The texts of the corpus will be stored as ordinary text files. Each text will be registered in a Microsoft Access database and supplied with a description. Automated parallel concordancing is being developed for the corpus. The program will find the keywords in text A (Russian), then look for possible translation equivalents of the keywords in language B (Finnish), and then search for the portion of text B (Finnish) where most of the keywords in question can be found.
-
Data-derived Multilingual Lexicons
Author(s): John McH. Sinclairpp.: 79–94 (16)More LessThis paper first appeared in Arcaini (ed.) 2000: La Traduzione (IV). Quaderni di Libre e Riviste d’Italia, 43; Roma: Ministerio per i bene e le attività culturali. For this publication it has been lightly revised, and the bibliography updated.
-
Bridge Dictionaries as Bridges Between Languages
Author(s): Hana Skoumalovápp.: 95–105 (11)More LessBridge dictionaries are a new sort of dictionary for learners of English. They are based on the monolingual Cobuild learners’ dictionaries, and they are partly translated—they contain translated definitions and translation equivalents. This paper shows the possible ways of exploiting Bridge dictionaries for creating new bilingual or multilingual dictionaries.
One possible way is to extract corresponding translation equivalents, edit them, and make a new printed dictionary. As both sides of such a dictionary were originally created as translations from English, the dictionary requires quite a lot of lexicographic work.
Another possibility is to create an electronic version of the dictionary “as is”. For this purpose, it is necessary to convert the dictionary first into SGML format and define its DTD. This format can then serve as a standard for future Bridge dictionaries and adding new language modules to existing dictionaries would be quite easy.
-
Procedures in Building the Croatian-English Parallel Corpus
Author(s): Marko Tadicpp.: 107–123 (17)More LessThis contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected at the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is the newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for Information and Culture). After a quick survey of existing English-Croatian parallel corpora, the article copes with procedures involved in text conversion and text encoding, particularly the alignment. There are several recent suggestions for alignment encoding, and they are listed and elaborated at the end of the article.
-
Analysing the Fluency of Translators
Author(s): Rafal Uzar and Jacek Tadeusz Walińskipp.: 155–166 (12)More LessThe paper discusses problems involved in analysing the quality of student translation and the type of errors made by students in translation. The authors have developed a TEI-lite conformant corpus of student translations which also includes error category mark-up. This project has allowed the authors to objectively analyse student translation work and has also allowed the students themselves to gain valuable insights into translation problems.
-
Equivalence and Non-equivalence in Parallel Corpora
Author(s): Tamás Váradi and Gábor Kisspp.: 167–177 (11)More LessThe present paper shows how an aligned parallel corpus can be used to investigate the consistency of translation equivalence across the two languages in a parallel corpus. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give, and the preposition with. George Orwell’s novel 1984 was used as source material as it available in English-Hungarian sentence-aligned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds important implications for bilingual lexicography and automatic word alignment methodology.
Volumes & issues
-
Volume 29 (2024)
-
Volume 28 (2023)
-
Volume 27 (2022)
-
Volume 26 (2021)
-
Volume 25 (2020)
-
Volume 24 (2019)
-
Volume 23 (2018)
-
Volume 22 (2017)
-
Volume 21 (2016)
-
Volume 20 (2015)
-
Volume 19 (2014)
-
Volume 18 (2013)
-
Volume 17 (2012)
-
Volume 16 (2011)
-
Volume 15 (2010)
-
Volume 14 (2009)
-
Volume 13 (2008)
-
Volume 12 (2007)
-
Volume 11 (2006)
-
Volume 10 (2005)
-
Volume 9 (2004)
-
Volume 8 (2003)
-
Volume 7 (2002)
-
Volume 6 (2001)
-
Volume 5 (2000)
-
Volume 4 (1999)
-
Volume 3 (1998)
-
Volume 2 (1997)
-
Volume 1 (1996)
Most Read This Month
-
-
Comparing Corpora
Author(s): Adam Kilgarriff
-
- More Less