The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation

Mark Davies

doi:10.1075/ijcl.10.3.02dav

ISSN 1384-6655
E-ISSN: 1569-9811

GBP

The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation
Author(s): Mark Davies ¹
View Affiliations Hide Affiliations

Affiliations:
¹ Brigham Young University
Source: International Journal of Corpus Linguistics, Volume 10, Issue 3, Jan 2005, p. 307 - 334
DOI: https://doi.org/10.1075/ijcl.10.3.02dav

Abstract

Relational databases can be used to create large corpora that provide both very good search performance and a wide range of queries. This paper outlines how this approach has been used to create theCorpus del Español, which contains 100 million words of text in Spanish texts from the 1200s-1900s. The main databases are composed of n-grams tables (all unique 1, 2, 3, and 4 word sequences) and the associated frequency of all n-grams in each century (historical Spanish) and register (Modern Spanish). These tables are then joined to other tables containing part of speech, lemma, synonyms, and user-defined lists of words and lemma. There is essentially no limit to the amount of annotation that can be added in additional tables (with little or no impact on performance), and the SQL-based queries allow a wide range of searches that are not available with traditional corpora.

Article metrics loading...

/content/journals/10.1075/ijcl.10.3.02dav

2005-01-01

2024-04-16

From This Site

/content/journals/10.1075/ijcl.10.3.02dav

dcterms_title,dcterms_subject,pub_keyword

-contentType:Journal -contentType:Contributor -contentType:Concept -contentType:Institution

10

5

Full text loading...

http://instance.metastore.ingenta.com/content/journals/10.1075/ijcl.10.3.02dav

Article Type: Research Article

Keyword(s): historical; n-grams; relational databases; Spanish; SQL

Most Cited

- Collostructions: Investigating the interaction of words and constructions
  
  Author(s): Anatol Stefanowitsch and Stefan Th. Gries
- Automatic analysis of syntactic complexity in second language writing
  
  Author(s): Xiaofei Lu
- Extending collostructional analysis: A corpus-based perspective on `alternations'
  
  Author(s): Stefan Th. Gries and Anatol Stefanowitsch
- From key words to key semantic domains
  
  Author(s): Paul Rayson
- The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights
  
  Author(s): Mark Davies
- A corpus-driven approach to formulaic language in English
  
  Author(s): Douglas Biber
- Collocations in context: A new perspective on collocation networks
  
  Author(s): Vaclav Brezina, Tony McEnery and Stephen Wattam
- CQPweb — combining power, flexibility and usability in a corpus analysis tool
  
  Author(s): Andrew Hardie
- Dispersions and adjusted frequencies in corpora
  
  Author(s): Stefan Th. Gries
- Comparing Corpora
  
  Author(s): Adam Kilgarriff
More Less

The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation

Abstract

From This Site

Most Read This Month

Most Cited

Collostructions: Investigating the interaction of words and constructions

Automatic analysis of syntactic complexity in second language writing

Extending collostructional analysis: A corpus-based perspective on `alternations'

From key words to key semantic domains

The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights

A corpus-driven approach to formulaic language in English

Collocations in context: A new perspective on collocation networks

CQPweb — combining power, flexibility and usability in a corpus analysis tool

Dispersions and adjusted frequencies in corpora

Comparing Corpora