1887
image of Down-sampling from hierarchically structured corpus data
USD
Buy:$35.00 + Taxes

Abstract

Abstract

Resource constraints often force researchers to downsize the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: year, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 subsamples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.

Loading

Article metrics loading...

/content/journals/10.1075/ijcl.23079.son
2024-03-25
2024-10-04
Loading full text...

Full text loading...

References

  1. Agresti, A.
    (2013) Categorical Data Analysis (3rd ed.). Wiley.
    [Google Scholar]
  2. Baayen, R. H.
    (2008) Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press. 10.1017/CBO9780511801686
    https://doi.org/10.1017/CBO9780511801686 [Google Scholar]
  3. BNC Consortium
    BNC Consortium (2007) British National Corpus (version 3, BNC XML ed.). www.natcorp.ox.ac.uk
    [Google Scholar]
  4. Cox, D. R., & Donnelly, C. A.
    (2011) Principles of Applied Statistics. Cambridge University Press. 10.1017/CBO9781139005036
    https://doi.org/10.1017/CBO9781139005036 [Google Scholar]
  5. Gelman, A., Hill, J., & Vehtari, A.
    (2020) Regression and Other Stories. Cambridge University Press. 10.1017/9781139161879
    https://doi.org/10.1017/9781139161879 [Google Scholar]
  6. Gries, S. T., & Hilpert, M.
    (2010) Modeling diachronic change in the third person singular: A multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics, (), –. 10.1017/S1360674310000092
    https://doi.org/10.1017/S1360674310000092 [Google Scholar]
  7. Jenset, G. B., & McGillivray, B.
    (2017) Quantitative Historical Linguistics: A Corpus Framework. Oxford University Press. 10.1093/oso/9780198718178.001.0001
    https://doi.org/10.1093/oso/9780198718178.001.0001 [Google Scholar]
  8. Kroch, A., Santorini, B., & Delfs, L.
    (2004) The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). https://www.ling.upenn.edu/ppche/ppche-release-2016/PPCEME-RELEASE-3
    [Google Scholar]
  9. Kytö, M.
    (1993) Third-person singular verb inflection in early British and American English. Language Variation and Change, (), –. 10.1017/S0954394500001447
    https://doi.org/10.1017/S0954394500001447 [Google Scholar]
  10. Lohr, S. L.
    (2022) Sampling: Design and Analysis (3rd ed.). CRC Press. 10.1201/9780429298899
    https://doi.org/10.1201/9780429298899 [Google Scholar]
  11. Meyerhoff, M.
    (2011) Introducing Sociolinguistics (2nd ed.). Routledge. 10.4324/9780203874196
    https://doi.org/10.4324/9780203874196 [Google Scholar]
  12. Nevalainen, T., & Raumolin-Brunberg, H.
    (2003) Historical Sociolinguistics: Language Change in Tudor and Stuart England. Pearson Education.
    [Google Scholar]
  13. Rothman, K. J., Greenland, S., & Lash, T. L.
    (2008) Case-control studies. InK. J. Rothman, S. Greenland, & T. L. Lash (Eds.), Modern Epidemiology (3rd ed.) (pp.–). Lippincott Williams & Wilkins. 10.1002/9780470061596.risk0599
    https://doi.org/10.1002/9780470061596.risk0599 [Google Scholar]
  14. Singer, J. D.
    (1991) Types of factors and their structural layouts. InD. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Fundamentals of Exploratory Analysis of Variance (pp.–). Wiley. 10.1002/9780470316832.ch4
    https://doi.org/10.1002/9780470316832.ch4 [Google Scholar]
  15. Smith, N., & Waters, C.
    (2019) Variation and change in a specialized register: A comparison of random and sociolinguistic sampling outcomes in Desert Island Discs. International Journal of Corpus Linguistics, (), –. 10.1075/ijcl.17117.smi
    https://doi.org/10.1075/ijcl.17117.smi [Google Scholar]
  16. Sönning, L.
    (2023) Data from Jenset & McGillivray (2017), adapted for “Down-sampling from hierarchically structured corpus data”. DataverseNO, V1. 10.18710/5KCE4U
    https://doi.org/10.18710/5KCE4U [Google Scholar]
  17. Sönning, L., & Krug, M.
    (2022) Comparing study designs and down-sampling strategies in corpus analysis: The importance of speaker metadata in the BNCs of 1994 and 2014. InO. Schützler & J. Schlüter (Eds.), Data and Methods in Corpus Linguistics: Comparative Approaches (pp.–). Cambridge University Press. 10.1017/9781108589314.006
    https://doi.org/10.1017/9781108589314.006 [Google Scholar]
  18. Vaden, K. I., Halpin, H. R., & Hickok, G. S.
    (2009) Irvine Phonotactic Online Dictionary, (Version 2.0). [Data file]. https://www.iphod.com
    [Google Scholar]
  19. Winter, B., & Grice, M.
    (2021) Independence and generalizability in linguistics. Linguistics, (), –. 10.1515/ling‑2019‑0049
    https://doi.org/10.1515/ling-2019-0049 [Google Scholar]
/content/journals/10.1075/ijcl.23079.son
Loading
/content/journals/10.1075/ijcl.23079.son
Loading

Data & Media loading...

  • Article Type: Research Article
Keywords: data structure ; down-sampling ; study design ; thinning ; methodology
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error