- Home
- e-Journals
- International Journal of Corpus Linguistics
- Previous Issues
- Volume 6, Issue, 2001
International Journal of Corpus Linguistics - Volume 6, Issue 1, 2001
Volume 6, Issue 1, 2001
-
Policy and Practice in the Anonymisation of Linguistic Data
Author(s): Frances Rockpp.: 1–26 (26)More LessWhat is anonymisation? This paper addresses this question, its relationship to linguistic data and its potential importance to corpus builders and users. It examines attitudes towards anonymisation such as hostility and disinterest and investigates relevant rights, responsibilities, and obligations. The paper then overviews and critiques methods of anonymisation and seeks to assess which items should be anonymised and which maintained. Finally, some troublesome and noteworthy cases are presented as evidence of the need for sensitive, realistic consideration of this issue. The paper was developed through consultation with researchers from the international community of corpus builders and users and, therefore, reflects the diversity of attittude and practice currently at large. It addresses this variability by finally proposing methods for systematic assessment of the need for anonymisation within individual corpora.
-
Communicative Constraints in EFL Pre-School Settings: A Corpus-Driven Approach
Author(s): Jesús Romero-Trillo and Ana Llinares Garcíapp.: 27–46 (20)More LessThe present article investigates the use of interrogatives made by teachers and the responses given by learners in two different (bilingual and non-bilingual) English language classroom contexts in two Spanish nursery schools.The analysis shows the relevance of the type of functions made by the teachers through interrogatives, rather than the quantity of input in the target language. The study classifies the functions of interrogatives in the pre-school context and makes a statistical corpus-driven analysis of the questions and responses in the two schools. Finally, the article makes some suggestions, based on the data, about the kind of questions than can lead to a more natural L2 development in the classroom context.
-
Tagging a Corpus of Spoken Swedish
Author(s): Joakim Nivre and Leif Grönqvistpp.: 47–78 (32)More LessIn this article, we present and evaluate a method for training a statistical part-of-speech tagger on data from written language and then adapting it to the requirements of tagging a corpus of transcribed spoken language, in our case spoken Swedish. This is currently a significant problem for many research groups working with spoken language, since the availability of tagged training data from spoken language is still very limited for most languages. The overall accuracy of the tagger developed for spoken Swedish is quite respectable, varying from 95% to 97% depending on the tagset used. In conclusion, we argue that the method presented here gives good tagging accuracy with relatively little effort.
-
Semantic Encoding of Electronic Documents
Author(s): Caroline Brun and Frédérique Segondpp.: 79–96 (18)More LessThis paper presents an unsupervised, all-words, word sense disambiguation system for English. The system associates a word with its meaning in a given context using an electronic dictionary as a tagged corpora in order to extract semantic disambiguation rules. The methodology attempts to avoid the data acquisition bottleneck observed in word sense disambiguation techniques. Semantic rules are used as input of a semantic application program encoding a linguistic strategy in order to select the best rule to apply. The semantic rule extraction process as well as the application program is described. The methodology is developed in a client/server architecture, which enables the treatment of large corpora. The evaluation of the system is then detailed and some possible extensions and perspectives are finally proposed.
-
Comparing Corpora
Author(s): Adam Kilgarriffpp.: 97–133 (37)More LessCorpus linguistics lacks strategies for describing and comparing corpora. Currently most descriptions of corpora are textual, and questions such as ‘what sort of a corpus is this?’, or ‘how does this corpus compare to that?’ can only be answered impressionistically. This paper considers various ways in which different corpora can be compared more objectively. First we address the issue, ‘which words are particularly characteristic of a corpus?’, reviewing and critiquing the statistical methods which have been applied to the question and proposing the use of the Mann-Whitney ranks test. Results of two corpus comparisons using the ranks test are presented. Then, we consider measures for corpus similarity. After discussing limitations of the idea of corpus similarity, we present a method for evaluating corpus similarity measures. We consider several measures and establish that a\chi\tsup{2}-based one performs best. All methods considered in this paper are based on word and ngram frequencies; the strategy is defended.
-
The Grammar and Use of Korean Reflexives
Author(s): Beom-Mo Kangpp.: 134–150 (17)More LessThis paper discusses the relationship between grammar as linguistic knowledge, as envisaged in Generative Grammar, and usage, the result of performance. In concrete, I analyze the use of Korean reflexives ‘caki’, ‘casin’, and ‘cakicasin’ by examining the occurrences of these reflexives in a 5-million-word Korean corpus, taken from a 10-million-word Korean corpus which is called “KOREA-1 Corpus”, compiled at Korea University (H. Kim and B. Kang 1996). This corpus is composed of various genres of Korean texts including 10% of spoken material. From the KWIC concordances of accusative forms of these reflexives, ‘cakilul, casin-ul, cakicasin-ul’, I examined whether a reflexive has a local antecedent or a long-distance antecedent. The result is that ‘caki’ is almost even in having local and long-distance antecedents, but ‘casin’ has more and ‘cakicasin’ has much more local antecedents. I also examined the thematic roles of the local antecedents of reflexives, which shows that ‘casin’ has relatively more Experiencer antecedents than ‘caki’ has, although in both cases Agent antecedents dominate. The outcome of this frequency analysis suggests that a tendency (probably not yet grammaticalized), or degree of “naturalness” is real and can be captured in the usage data provided that we have a sizable amount of material which can be handled in an efficient way as provided by the corpus linguistic method of the present day. At the least, the result of such an investigation can provide a solid base from which further theorizing may proceed.
Volumes & issues
-
Volume 29 (2024)
-
Volume 28 (2023)
-
Volume 27 (2022)
-
Volume 26 (2021)
-
Volume 25 (2020)
-
Volume 24 (2019)
-
Volume 23 (2018)
-
Volume 22 (2017)
-
Volume 21 (2016)
-
Volume 20 (2015)
-
Volume 19 (2014)
-
Volume 18 (2013)
-
Volume 17 (2012)
-
Volume 16 (2011)
-
Volume 15 (2010)
-
Volume 14 (2009)
-
Volume 13 (2008)
-
Volume 12 (2007)
-
Volume 11 (2006)
-
Volume 10 (2005)
-
Volume 9 (2004)
-
Volume 8 (2003)
-
Volume 7 (2002)
-
Volume 6 (2001)
-
Volume 5 (2000)
-
Volume 4 (1999)
-
Volume 3 (1998)
-
Volume 2 (1997)
-
Volume 1 (1996)
Most Read This Month
Article
content/journals/15699811
Journal
10
5
false

-
-
The Spoken BNC2014
Author(s): Robbie Love, Claire Dembry, Andrew Hardie, Vaclav Brezina and Tony McEnery
-
- More Less