1887
Volume 10, Issue 3
  • ISSN 1384-6655
  • E-ISSN: 1569-9811
USD
Buy:$35.00 + Taxes

Abstract

Relational databases can be used to create large corpora that provide both very good search performance and a wide range of queries. This paper outlines how this approach has been used to create theCorpus del Español, which contains 100 million words of text in Spanish texts from the 1200s-1900s. The main databases are composed of n-grams tables (all unique 1, 2, 3, and 4 word sequences) and the associated frequency of all n-grams in each century (historical Spanish) and register (Modern Spanish). These tables are then joined to other tables containing part of speech, lemma, synonyms, and user-defined lists of words and lemma. There is essentially no limit to the amount of annotation that can be added in additional tables (with little or no impact on performance), and the SQL-based queries allow a wide range of searches that are not available with traditional corpora.
Loading

Article metrics loading...

/content/journals/10.1075/ijcl.10.3.02dav
2005-01-01
2019-12-05
Loading full text...

Full text loading...

References

http://instance.metastore.ingenta.com/content/journals/10.1075/ijcl.10.3.02dav
Loading
  • Article Type: Research Article
Keyword(s): historical , n-grams , relational databases , Spanish and SQL
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error