A new computing method for extracting contiguous phraseological sequences from academic text corpora

Naixing Wei; Jingjie Li

doi:10.1075/ijcl.18.4.03wei

ISSN 1384-6655
E-ISSN: 1569-9811

GBP

A new computing method for extracting contiguous phraseological sequences from academic text corpora
Author(s): Naixing Wei ¹ and Jingjie Li ²
View Affiliations Hide Affiliations

Affiliations:
¹ Beihang University

² Donghua University
Source: International Journal of Corpus Linguistics, Volume 18, Issue 4, Jan 2013, p. 506 - 535
DOI: https://doi.org/10.1075/ijcl.18.4.03wei

Abstract

This study aims to develop a new computing method for extracting contiguous phraseological sequences (PSs) of various lengths from academic text corpora by measuring internal associations of n-grams. We construct a new normalizing algorithm of probability-weighted average for refining the MI measure and enhancing precision in extracting PSs from corpora. This computing method is applied to the data in a medium-sized text corpus of academic English. Results indicate that the resultant new MI measure can provide statistics which better reveal internal associations within an n-gram, regardless of size. Lexico-grammatical sequences extracted with this method are more complete and less arbitrary in terms of grammar and semantics. The method can be applied to treating a variety of linguistic phenomenon, ranging from well-established phrases to likely phrasal entities, thus having potentially practical applications in corpus-based studies of phraseology and natural language processing.

Article metrics loading...

/content/journals/10.1075/ijcl.18.4.03wei

2013-01-01

2024-04-16

From This Site

/content/journals/10.1075/ijcl.18.4.03wei

dcterms_title,dcterms_subject,pub_keyword

-contentType:Journal -contentType:Contributor -contentType:Concept -contentType:Institution

10

5

Full text loading...

http://instance.metastore.ingenta.com/content/journals/10.1075/ijcl.18.4.03wei

Article Type: Research Article

Keyword(s): internal association; n-grams; phraseology; probability-weighted average; pseudo-bigram transformation

Most Cited

- Collostructions: Investigating the interaction of words and constructions
  
  Author(s): Anatol Stefanowitsch and Stefan Th. Gries
- Automatic analysis of syntactic complexity in second language writing
  
  Author(s): Xiaofei Lu
- Extending collostructional analysis: A corpus-based perspective on `alternations'
  
  Author(s): Stefan Th. Gries and Anatol Stefanowitsch
- From key words to key semantic domains
  
  Author(s): Paul Rayson
- The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights
  
  Author(s): Mark Davies
- A corpus-driven approach to formulaic language in English
  
  Author(s): Douglas Biber
- Collocations in context: A new perspective on collocation networks
  
  Author(s): Vaclav Brezina, Tony McEnery and Stephen Wattam
- CQPweb — combining power, flexibility and usability in a corpus analysis tool
  
  Author(s): Andrew Hardie
- Dispersions and adjusted frequencies in corpora
  
  Author(s): Stefan Th. Gries
- Comparing Corpora
  
  Author(s): Adam Kilgarriff
More Less

A new computing method for extracting contiguous phraseological sequences from academic text corpora

Abstract

From This Site

Most Read This Month

Most Cited

Collostructions: Investigating the interaction of words and constructions

Automatic analysis of syntactic complexity in second language writing

Extending collostructional analysis: A corpus-based perspective on `alternations'

From key words to key semantic domains

The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights

A corpus-driven approach to formulaic language in English

Collocations in context: A new perspective on collocation networks

CQPweb — combining power, flexibility and usability in a corpus analysis tool

Dispersions and adjusted frequencies in corpora

Comparing Corpora