Volume 3, Issue 1
  • ISSN 2542-9477
  • E-ISSN: 2542-9485
Buy:$35.00 + Taxes



This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.


Article metrics loading...

Loading full text...

Full text loading...


  1. Adamzik, Kirsten
    1995Textsorten – Texttypologie. Eine Kommentierte Bibliographie. Münster: Nodus.
    [Google Scholar]
  2. Argamon, Shlomo
    2019 “Computational Register Analysis and Synthesis.” Register Studies1.
    [Google Scholar]
  3. Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan
    2007 “Stylistic Text Classification Using Functional Lexical Features.” Journal of the American Society for Information Science and Technology58 (6). Wiley Online Library: 802–22. 10.1002/asi.20553
    https://doi.org/10.1002/asi.20553 [Google Scholar]
  4. Baker, Mona
    1996 “Corpus-Based Translation Studies: The Challenges That Lie Ahead.” InTerminology, Lsp and Translation: Studies in Language Engineering, edited byHarold Somers. John Benjamins. 10.1075/btl.18.17bak
    https://doi.org/10.1075/btl.18.17bak [Google Scholar]
  5. Baroni, Marco, and Silvia Bernardini
    2006 “A New Approach to the Study of Translationese: Machine-Learning the Difference Between Original and Translated Text.” Literary and Linguistic Computing21 (3): 259–74. 10.1093/llc/fqi039
    https://doi.org/10.1093/llc/fqi039 [Google Scholar]
  6. Benko, Vladimír
    2016 “Two Years of Aranea: Increasing Counts and Tuning the Pipeline.” InProc Lrec. Portorož, Slovenia.
    [Google Scholar]
  7. Biber, Douglas
    1988Variation Across Speech and Writing. Cambridge University Press. 10.1017/CBO9780511621024
    https://doi.org/10.1017/CBO9780511621024 [Google Scholar]
  8. 1995Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press. 10.1017/CBO9780511519871
    https://doi.org/10.1017/CBO9780511519871 [Google Scholar]
  9. Biber, Douglas, and Jesse Egbert
    2016 “Register Variation on the Searchable Web: A Multi-Dimensional Analysis.” Journal of English Linguistics44 (2): 95–137. 10.1177/0075424216628955
    https://doi.org/10.1177/0075424216628955 [Google Scholar]
  10. Biber, Douglas, and Bethany Gray
    2016Grammatical Complexity in Academic English: Linguistic Change in Writing. Cambridge University Press. 10.1017/CBO9780511920776
    https://doi.org/10.1017/CBO9780511920776 [Google Scholar]
  11. Cienki, Alan J.
    1989Spatial Cognition and the Semantics of Prepositions in English, Polish, and Russian. Vol.237. Sagner Munich. 10.3726/b12805
    https://doi.org/10.3726/b12805 [Google Scholar]
  12. Conneau, Alexis, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov
    2020 “Emerging Cross-Lingual Structure in Pretrained Language Models.” InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6022–34. Online: Association for Computational Linguistics. 10.18653/v1/2020.acl‑main.536
    https://doi.org/10.18653/v1/2020.acl-main.536 [Google Scholar]
  13. Crowston, Kevin, Barbara Kwasnik, and Joseph Rubleske
    2010 “Problems in the Use-Centered Development of a Taxonomy of Web Genres.” InGenres on the Web: Computational Models and Empirical Studies, edited byAlexander Mehler, Serge Sharoff, and Marina Santini. Springer. 10.1007/978‑90‑481‑9178‑9_4
    https://doi.org/10.1007/978-90-481-9178-9_4 [Google Scholar]
  14. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
    2018 “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.
    [Google Scholar]
  15. Evert, Stefan
    2006 “How Random Is a Corpus? The Library Metaphor.” Zeitschrift Für Anglistik Und Amerikanistik54 (2): 177–90. 10.1515/zaa‑2006‑0208
    https://doi.org/10.1515/zaa-2006-0208 [Google Scholar]
  16. Ferraresi, Adriano, Eros Zanchetta, Silvia Bernardini, and Marco Baroni
    2008 “Introducing and Evaluating ukWaC, a Very Large Web-Derived Corpus of English.” InThe 4th Web as Corpus Workshop: Can We Beat Google? (At Lrec 2008). Marrakech. clic.cimec.unitn.it/marco/publications/lrec2008/lrec08-ukwac.pdf.
    [Google Scholar]
  17. Forsyth, Richard, and Serge Sharoff
    2014 “Document Dissimilarity Within and Across Languages: A Benchmarking Study.” Literary and Linguistic Computing29: 6–22. 10.1093/llc/fqt002
    https://doi.org/10.1093/llc/fqt002 [Google Scholar]
  18. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville
    2016Deep Learning. MIT Press.
    [Google Scholar]
  19. Görlach, M.
    2004Text Types and the History of English. Berlin: Walter de Gruyter. 10.1515/9783110197167
    https://doi.org/10.1515/9783110197167 [Google Scholar]
  20. Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni
    2018 “Colorless Green Recurrent Networks Dream Hierarchically.” arXiv Preprint arXiv:1803.11138. 10.18653/v1/N18‑1108
    https://doi.org/10.18653/v1/N18-1108 [Google Scholar]
  21. Hearst, Marti A.
    1997 “TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages.” Computational Linguistics23 (1). MIT Press: 33–64.
    [Google Scholar]
  22. Hosmer Jr, David W., Stanley Lemeshow, and Rodney X. Sturdivant
    2013Applied Logistic Regression. John Wiley & Sons. 10.1002/9781118548387
    https://doi.org/10.1002/9781118548387 [Google Scholar]
  23. Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychly, and Vít Suchomel
    2013 “The Tenten Corpus Family.” InProc Corpus Linguistics Conference, 125–27. Lancaster.
    [Google Scholar]
  24. Kanaris, Ioannis, and Efstathios Stamatatos
    2007 “Webpage Genre Identification Using Variable-Length Character N-Grams.” www.icsd.aegean.gr/lecturers/Stamatatos/papers/ICTAI-2007.pdf. 10.1109/ICTAI.2007.107
  25. Karlgren, Jussi, and Douglass Cutting
    1994 “Recognizing Text Genres with Simple Metrics Using Discriminant Analysis.” InCOLING ’94: Proc. of the 15th. International Conference on Computational Linguistics, 1071–5. Kyoto, Japan. 10.3115/991250.991324
    https://doi.org/10.3115/991250.991324 [Google Scholar]
  26. Katinskaya, Anisya, and Serge Sharoff
    2015 “Applying Multi-Dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres.” InProc Bsnlp. Sofia.
    [Google Scholar]
  27. Kessler, Brett, Geoffrey Nunberg, and Hinrich Schütze
    1997 “Automatic Detection of Text Genre.” InProceedings of the 35〖^(th)〗 ACL/8〖^(th)〗 Eacl, 32–38.
    [Google Scholar]
  28. Kilgarriff, Adam
    2001 “The Web as Corpus.” InProc Corpus Linguistics 2001. Lancaster. www.itri.bton.ac.uk/techreports/ITRI-01-14.abs.html.
    [Google Scholar]
  29. Kilgarriff, Adam, and Vít Suchomel
    2013 “Web Spam.” InProc Web as Corpus Workshop (Wac8) at Corpus Linguistics Conference. Lancaster.
    [Google Scholar]
  30. Krippendorff, Klaus
    2004 “Reliability in Content Analysis: Some Common Misconceptions and Recommendations.” Human Communication Research30 (3): 411–33. 10.1111/j.1468‑2958.2004.tb00738.x
    https://doi.org/10.1111/j.1468-2958.2004.tb00738.x [Google Scholar]
  31. Kunilovskaya, Maria, and Serge Sharoff
    2019 “Building Functionally Similar Corpus Resources for Translation Studies.” InProc Ranlp, 583–92. Varna.
    [Google Scholar]
  32. Lee, David
    2001 “Genres, Registers, Text Types, Domains, and Styles: Clarifying the Concepts and Navigating a Path Through the BNC Jungle.” Language Learning and Technology5 (3): 37–72.
    [Google Scholar]
  33. Liu, Bing, and Ian Lane
    2016 “Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling.” arXiv Preprint arXiv:1609.01454. 10.21437/Interspeech.2016‑1352
    https://doi.org/10.21437/Interspeech.2016-1352 [Google Scholar]
  34. Matthiessen, Christian MIM.
    2015 “Register in the Round: Registerial Cartography.” Functional Linguistics2 (1): 1–48. 10.1186/s40554‑015‑0015‑8
    https://doi.org/10.1186/s40554-015-0015-8 [Google Scholar]
  35. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean
    2013 “Efficient Estimation of Word Representations in Vector Space.” InProc. Workshop at Iclr’13.
    [Google Scholar]
  36. Nesi, Hilary, and Sheena Gardner
    2012Genres Across the Disciplines: Student Writing in Higher Education. Cambridge: Cambridge University Press. 10.1017/9781009030199
    https://doi.org/10.1017/9781009030199 [Google Scholar]
  37. Petrenz, Philipp, and Bonnie Webber
    2010 “Stable Classification of Text Genres.” Computational Linguistics34 (4): 285–93.
    [Google Scholar]
  38. Santini, Marina, Alexander Mehler, and Serge Sharoff
    2010 “Riding the Rough Waves of Genre on the Web.” InGenres on the Web: Computational Models and Empirical Studies, edited byAlexander Mehler, Serge Sharoff, and Marina Santini. Berlin/New York: Springer. 10.1007/978‑90‑481‑9178‑9_1
    https://doi.org/10.1007/978-90-481-9178-9_1 [Google Scholar]
  39. Sharoff, Serge
    2018 “Functional Text Dimensions for the Annotation of Web Corpora.” Corpora13 (1): 65–95. 10.3366/cor.2018.0136
    https://doi.org/10.3366/cor.2018.0136 [Google Scholar]
  40. Sharoff, Serge, Dirk Goldhahn, and Uwe Quasthoff
    2017 “Frequency Dictionary: Russian.” In, 9:9–14. Frequency Dictionaries. Leipziger Universitätsverlag.
    [Google Scholar]
  41. Sharoff, Serge, Zhili Wu, and Katja Markert
    2010 “The Web Library of Babel: Evaluating Genre Collections.” InProc Seventh Language Resources and Evaluation Conference, LREC. Malta.
    [Google Scholar]
  42. Sinclair, John
    1991Corpus, Concordance and Collocation. Oxford: OUP.
    [Google Scholar]
  43. Sinclair, John, and Jackie Ball
    1996 “Preliminary Recommendations on Text Typology.” EAG-TCWG-TTYP/P. Expert Advisory Group on Language Engineering Standards document. www.ilc.cnr.it/EAGLES96/texttyp/texttyp.html.
    [Google Scholar]
  44. Sorower, Mohammad S.
    2010 “A Literature Survey on Algorithms for Multi-Label Learning.” Vol.18. Oregon State University.
    [Google Scholar]
  45. Stamatatos, Efstathios, George Kokkinakis, and Nikos Fakotakis
    2000 “Automatic Text Categorization in Terms of Genre and Author.” Computational Linguistics26 (4): 471–95. doi:  10.1162/089120100750105920
    https://doi.org/10.1162/089120100750105920 [Google Scholar]
  46. Straka, Milan, and Jana Straková
    2017 “Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe.” InProc Conll 2017 Shared Task, 88–99. Vancouver, Canada: Association for Computational Linguistics. 10.18653/v1/K17‑3009
    https://doi.org/10.18653/v1/K17-3009 [Google Scholar]
  47. Szmrecsanyi, Benedikt
    2009 “Typological Parameters of Intralingual Variability: Grammatical Analyticity Versus Syntheticity in Varieties of English.” Language Variation and Change21 (3). Cambridge University Press: 319–53. 10.1017/S0954394509990123
    https://doi.org/10.1017/S0954394509990123 [Google Scholar]
  48. Yang, Yiming, and Jan O. Pedersen
    1997 “A Comparative Study on Feature Selection in Text Categorization.” InProc ICML, edited byDouglas H. Fisher, 412–20. Nashville, US.
    [Google Scholar]
  49. Yogatama, Dani, Chris Dyer, Wang Ling, and Phil Blunsom
    2017 “Generative and Discriminative Text Classification with Recurrent Neural Networks.” arXiv Preprint arXiv:1703.01898.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): automatic genre identification; Deep learning; interpreting neural networks
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error