- Home
- e-Journals
- Lingvisticæ Investigationes
- Previous Issues
- Volume 30, Issue, 2007
Lingvisticæ Investigationes - Volume 30, Issue 1, 2007
Volume 30, Issue 1, 2007
-
A survey of named entity recognition and classification
Author(s): David Nadeau and Satoshi Sekinepp.: 3–26 (24)More LessThis survey covers fifteen years of research in the Named Entity Recognition and Classification (NERC) field, from 1991 to 2006. We report observations about languages, named entity types, domains and textual genres studied in the literature. From the start, NERC systems have been developed using hand-made rules, but now machine learning techniques are widely used. These techniques are surveyed along with other critical aspects of NERC such as features and evaluation methods. Features are word-level, dictionary-level and corpus-level representations of words in a document. Evaluation techniques, ranging from intuitive exact match to very complex matching techniques with adjustable cost of errors, are an indisputable key to progress.
-
Diversity in logarithmic opinion pools
Author(s): Andrew D.M. Smith and Miles Osbornepp.: 27–47 (21)More LessConditional random fields are state-of-the-art models for sequencing tasks such as named entity recognition. However, being globally conditioned, they have a tendency to overfit to a greater extent than other sequencing models. We introduce an approach to combat this overfitting called a logarithmic opinion pool (LOP). A LOP consists of a weighted combination of constituent models. We present the theory behind LOPs, and show that effective LOPs require constituent models that are diverse from one another. We examine different ways to introduce such diversity, including an approach that involves training the constituent models together, interactively. Our results show that, as expected from the underlying theory, explicitly optimising for constituent model diversity can improve performance over standard approaches to regularisation.
-
Handling conjunctions in named entities
Author(s): Pawel Mazur and Robert Dalepp.: 49–68 (20)More LessAlthough the literature contains reports of very high accuracy figures for the recognition of named entities in text, there are still some named entity phenomena that remain problematic for existing text processing systems. One of these is the ambiguity of conjunctions in candidate named entity strings, an all-too-prevalent problem in corporate and legal documents. In this paper, we distinguish four uses of the conjunction in these strings, and explore the use of a supervised machine learning approach to conjunction disambiguation trained on a very limited set of ‘name internal’ features that avoids the need for expensive lexical or semantic resources. We achieve 84% correctly classified examples using k-fold evaluation on a data set of 600 instances. We argue that further improvements are likely to require the use of wider domain knowledge and name external features.
-
Complex named entities in Spanish texts: Structures and properties
Author(s): Sofía N. Galicia-Haro and Alexander Gelbukhpp.: 69–94 (26)More LessWe present a linguistic analysis of Named Entities in Spanish texts. Our work is focused on the determination of the structure of complex proper names: names with coordinated constituents, names with prepositional phrases and names formed by several content words initialized by a capital letter. We present the analysis of circa 49,000 examples obtained from Mexican newspapers. We detailed their structure and give some notions about the context surrounding them. Since named entities belong to open class of words they are being created daily, so the challenge for a named entity recognizer is to precisely determine the boundaries of new entity names in any text and to analyze thoroughly their components for deep semantic analysis. Knowing their general classes of structure it should be possible to derive useful heuristics or a specific grammar for natural language processing applications.
-
Named Entity Recognition and transliteration in Bengali
Author(s): Asif Ekbal, Sudip Kumar Naskar and Sivaji Bandyopadhyaypp.: 95–114 (20)More LessThe paper reports about the development of a Named Entity Recognition (NER) system in Bengali using a tagged Bengali news corpus and the subsequent transliteration of the recognized Bengali Named Entities (NEs) into English. Three different models of the NER have been developed. A semi-supervised learning method has been adopted to develop the first two models, one without linguistic features (Model A) and the other with linguistic features (Model B). The third one (Model C) is based on statistical Hidden Markov Model. A modified joint-source channel model has been used along with a number of alternatives to generate the English transliterations of Bengali NEs and vice-versa. The transliteration models learn the mappings from the bilingual training sets optionally guided by linguistic knowledge in the form of conjuncts and diphthongs in Bengali and their representations in English. The NER system has demonstrated the highest average Recall, Precision and F-Score values of 89.62%, 78.67% and 83.79% respectively in Model C. Evaluation of the proposed transliteration models demonstrated that the modified joint source-channel model performs best in terms of evaluation metrics for person and location names for both Bengali to English (B2E) transliteration and English to Bengali transliteration (E2B). The use of the linguistic knowledge during training of the transliteration models improves performance.
-
A note on the semantic and morphological properties of proper names in the Prolex project
Author(s): Duko Vitas, Cvetana Krstev and Denis Maurelpp.: 115–133 (19)More LessIn this paper we present a linguistic approach to the analysis of proper names. The basic assumption of our approach is that proper names are linguistic units of text that should be treated using the same methods that are applied to text in its totality. We illustrate the inflectional and derivational properties of simple and multi-word proper names on the example of Serbian, and describe how these properties have been formalized in order to develop e-dictionaries of the DELA type. In order to support multi-lingual applications we have developed a model of a multilingual relational dictionary of proper names based on an ontology, as well as an actual database. Finally, we outline how the developed dictionaries and database can be used in real monolingual and multi-lingual applications, such as information extraction.
-
Cross-lingual Named Entity Recognition
Author(s): Ralf Steinberger and Bruno Pouliquenpp.: 135–162 (28)More LessNamed Entity Recognition and Classification (NERC) is a known and well-explored text analysis application that has been applied to various languages. We are presenting an automatic, highly multilingual news analysis system that fully integrates NERC for locations, persons and organisations with document clustering, multi-label categorisation, name attribute extraction, name variant merging and the calculation of social networks. The proposed application goes beyond the state-of-the-art by automatically merging the information found in news written in ten different languages, and by using the aggregated name information to automatically link related news documents across languages for all 45 language pair combinations. While state-of-the-art approaches for cross-lingual name variant merging and document similarity calculation require bilingual resources, the methods proposed here are mostly language-independent and require a minimal amount of monolingual language-specific effort. The development of resources for additional languages is therefore kept to a minimum and new languages can be plugged into the system effortlessly. The presented online news analysis application is fully functional and has, at the end of the year 2006, reached average usage statistics of 600,000 hits per day.
Volumes & issues
-
Volume 46 (2023)
-
Volume 45 (2022)
-
Volume 44 (2021)
-
Volume 43 (2020)
-
Volume 42 (2019)
-
Volume 41 (2018)
-
Volume 40 (2017)
-
Volume 39 (2016)
-
Volume 38 (2015)
-
Volume 37 (2014)
-
Volume 36 (2013)
-
Volume 35 (2012)
-
Volume 34 (2011)
-
Volume 33 (2010)
-
Volume 32 (2009)
-
Volume 31 (2008)
-
Volume 30 (2007)
-
Volume 29 (2006)
-
Volume 28 (2005)
-
Volume 27 (2004)
-
Volume 26 (2003)
-
Volume 25 (2002)
-
Volume 24 (2001)
-
Volume 23 (2000)
-
Volume 22 (1998)
-
Volume 21 (1997)
-
Volume 20 (1996)
-
Volume 19 (1995)
-
Volume 18 (1994)
-
Volume 17 (1993)
-
Volume 16 (1992)
-
Volume 15 (1991)
-
Volume 14 (1990)
-
Volume 13 (1989)
-
Volume 12 (1988)
-
Volume 11 (1987)
-
Volume 10 (1986)
-
Volume 9 (1985)
-
Volume 8 (1984)
-
Volume 7 (1983)
-
Volume 6 (1982)
-
Volume 5 (1981)
-
Volume 4 (1980)
-
Volume 3 (1979)
-
Volume 2 (1978)
-
Volume 1 (1977)
Most Read This Month
Article
content/journals/15699927
Journal
10
5
false