1887
Volume 47, Issue 1
  • ISSN 0378-4169
  • E-ISSN: 1569-9927
USD
Buy:$35.00 + Taxes

Abstract

Summary

(1873–1962) is a French popular science magazine that spanned a large time period and a large range of topics. It is available via ocerized archives so that it forms a corpus that is simultaneously diachronous, heterogeneous, and noisy. Although these characteristics make it complex to analyze, is of great interest for studies on the evolution of thoughts in science, technology, and even politics. The work presented in this article is part of research on the semantic annotation of these archives, which is discovering clues for exploring them. One type of clue that has not been explored in a complex corpus such as is , or more specifically, the that refer to the Linnean classification of life, e.g., . To overcome this complexity, the concept of a , who can detect binomial names even when obsolete, non-standard or defaced by OCR, is introduced. By imitating a Competent Reader, our approach, which we call the (CRI), involves combining a rule-based approach with a frequency argument. We show that this innovative method is robust to numerous variations and consistently achieves an F-measure of about 70% despite diachronicity, heterogeneity, and noise, which are all known to impede named entity recognition. Our method has many potential applications, such as in the study of chemical names and names of scientific and technical artifacts, which could benefit from the Competent Reader imitation approach. Beyond our work on , we hope this paper provides a set of tools and methods that are easily understandable, frugal, and usable for a general public interested in exploring similar historical corpus.

Loading

Article metrics loading...

/content/journals/10.1075/li.00107.mor
2024-10-31
2025-06-18
Loading full text...

Full text loading...

References

  1. Abdalla, M. & Abdalla, M.
    (2021) The grey hoodie project: Big tobacco, big tech, and the threat on academic integrity. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, 287–297. New York, NY, USA: Association for Computing Machinery. 10.1145/3461702.3462563
    https://doi.org/10.1145/3461702.3462563 [Google Scholar]
  2. Abdalla, M., Wahle, J. P., Ruas, T. L., Névéol, A., Ducel, F., Mohammad, S. M. & Fort, K.
    (2023) The elephant in the room: Analyzing the presence of big tech in natural language processing research. InA. Rogers (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023, 13141–13160: Association for Computational Linguistics. 10.18653/v1/2023.acl‑long.734
    https://doi.org/10.18653/v1/2023.acl-long.734 [Google Scholar]
  3. Akella, L. M., Norton, C. & Miller, H.
    (2012) Netineti: Discovery of scientific names from text using machine learning methods. BMC Bioinformatics, 131, 211. 10.1186/1471‑2105‑13‑211
    https://doi.org/10.1186/1471-2105-13-211 [Google Scholar]
  4. Bánki, O., Roskov, Y., Döring, M., Ower, G., Hernández Robles, D., Plata Corredor, C., Stjernegaard Jeppesen, T., Örn, A., Vandepitte, L., Hobern, D., Schalk, P., DeWalt, R., Ma, K., Miller, J., Orrell, T., Aalbu, R., Abbott, J., Adlard, R. & Adriaenssens, E. e. a.
    (2023) Catalogue of Life Checklist.
    [Google Scholar]
  5. Bannour, N., Ghannay, S., Névéol, A. & Ligozat, A.
    (2021) Evaluating the carbon footprint of NLP methods: a survey and analysis of existing tools. InN. S. Moosavi (Eds.), Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, SustaiNLP@EMNLP 2021, Virtual, November 10, 2021, 11–21: Association for Computational Linguistics. 10.18653/v1/2021.sustainlp‑1.2
    https://doi.org/10.18653/v1/2021.sustainlp-1.2 [Google Scholar]
  6. Barrière, C.
    (2016) Natural Language Understanding in a Semantic Web Context. Springer. 10.1007/978‑3‑319‑41337‑2
    https://doi.org/10.1007/978-3-319-41337-2 [Google Scholar]
  7. Becker, C.
    (2023) Insolvent: How to Reorient Computing for Just Sustainability. Cambridge: The MIT Press. 10.7551/mitpress/14668.001.0001
    https://doi.org/10.7551/mitpress/14668.001.0001 [Google Scholar]
  8. Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S.
    (2021) On the dangers of stochastic parrots: Can language models be too big?InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, 610–623. New York, NY, USA: Association for Computing Machinery. 10.1145/3442188.3445922
    https://doi.org/10.1145/3442188.3445922 [Google Scholar]
  9. Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R. & Bao, M.
    (2022) The values encoded in machine learning research. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, 173–184. New York, NY, USA: Association for Computing Machinery. 10.1145/3531146.3533083
    https://doi.org/10.1145/3531146.3533083 [Google Scholar]
  10. Burdick, A., Drucker, J., Lunenfeld, P., Presner, T. & Schnapp, J.
    (2012) Digital Humanities. The MIT Press.
    [Google Scholar]
  11. Castellan, S., Käfer, J. & Tannier, E.
    (2023) Back to the trees: Identifying plants with Human Intelligence. InNinth Computing within Limits 2023: LIMITS. https://limits.pubpub.org/pub/sapyi15v. 10.21428/bf6fb269.265c52ce
    https://doi.org/10.21428/bf6fb269.265c52ce [Google Scholar]
  12. Clark, A., Fox, C. & Lappin, S.
    (2012) The handbook of computational linguistics and natural language processing, volume1181. John Wiley & Sons.
    [Google Scholar]
  13. CNUM
    CNUM (ca. 2000) Conservatoire numérique des Arts et Métiers. HTTP links to scanned fac-simile of LA NATURE: cnum.cnam.fr/CGI/redira.cgi?4KY28
  14. COMETS, Ethics Committee of the CNRS
    COMETS, Ethics Committee of the CNRS (2022) AVIS n 2022–43, Intégrer les enjeux environnementaux à  la conduite de la recherche — Une responsabilité éthique.
    [Google Scholar]
  15. Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L. & Gonçalves, M. A.
    (2021) On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, 58(3), 102481. 10.1016/j.ipm.2020.102481
    https://doi.org/10.1016/j.ipm.2020.102481 [Google Scholar]
  16. DeMillo, R., Lipton, R. & Sayward, F.
    (1978) Hints on test data selection: Help for the practicing programmer. Computer, 11(4), 34–41. 10.1109/C‑M.1978.218136
    https://doi.org/10.1109/C-M.1978.218136 [Google Scholar]
  17. Devlin, J., Chang, M., Lee, K. & Toutanova, K.
    (2019) BERT: pre-training of deep bidirectional transformers for language understanding.
    [Google Scholar]
  18. Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M. & Doucet, A.
    (2021) Named entity recognition and classification on historical documents: A survey. CoRR, abs/2109.11406.
    [Google Scholar]
  19. Eltyeb, S. & Salim, N.
    (2014) Chemical named entities recognition: a review on approaches and applications. Cheminform. 6:(17).
    [Google Scholar]
  20. Gabrys, J., Pritchard, H. & Barratt, B.
    (2016) Just good enough data: Figuring data citizenships through air pollution sensing and data stories. Big Data & Society, 3(2), 2053951716679677. 10.1177/2053951716679677
    https://doi.org/10.1177/2053951716679677 [Google Scholar]
  21. Gargominy, O., Tercerie, S., Régnier, C., Ramage, T., Dupont, P., Daszkiewicz, P. & Poncet, L.
    (2021) TAXREF v15, référentiel taxonomique pour la France : méthodologie, mise en œuvre et diffusion.
    [Google Scholar]
  22. Gerner, M., Nenadic, G. & Bergman, C. M.
    (2010) Linnaeus: a species name identification system for biomedical literature. BMC Bioinformatics, 11(1), 1–17. 10.1186/1471‑2105‑11‑85
    https://doi.org/10.1186/1471-2105-11-85 [Google Scholar]
  23. Gundersen, O. E., Gil, Y. & Aha, D. W.
    (2018) On reproducible AI: Towards reproducible research, open science, and digital scholarship in AI publications. AI Magazine, 391. 10.1609/aimag.v39i3.2816
    https://doi.org/10.1609/aimag.v39i3.2816 [Google Scholar]
  24. Gupta, U., Kim, Y. G., Lee, S., Tse, J., Lee, H.-H. S., Wei, G.-Y., Brooks, D. & Wu, C.-J.
    (2020) Chasing carbon: The elusive environmental footprint of computing.
    [Google Scholar]
  25. Jurafsky, D. & Martin, J. H.
    (2009) Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall.
    [Google Scholar]
  26. Koning, D., Sarkar, I. N. & Moritz, T.
    (2005) Taxongrab: Extracting taxonomic names from text. Biodiversity Informatics, 21, 79–82. 10.17161/bi.v2i0.17
    https://doi.org/10.17161/bi.v2i0.17 [Google Scholar]
  27. Kuhn, T. S.
    (1962) The Structure of Scientific Revolutions. Chicago: University of Chicago Press.
    [Google Scholar]
  28. Labusch, K., Neudecker, C. & Zellhofer, D.
    (2019) Bert for named entity recognition in contemporary and historic german. InProceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, 1–9. Erlangen, Germany: German Society for Computational Linguistics & Language Technology.
    [Google Scholar]
  29. Lannelongue, L., Grealey, J. & Inouye, M.
    (2021) Green algorithms: Quantifying the carbon footprint of computation. Advanced Science, 8(12), 2100707. 10.1002/advs.202100707
    https://doi.org/10.1002/advs.202100707 [Google Scholar]
  30. Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L. & Schwab, D.
    (2020) Flaubert: Unsupervised language model pre-training for french. InN. Calzolari (Eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11–16, 2020, 2479–2490: European Language Resources Association.
    [Google Scholar]
  31. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H. & Kang, J.
    (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. 10.1093/bioinformatics/btz682
    https://doi.org/10.1093/bioinformatics/btz682 [Google Scholar]
  32. Little, D.
    (2020) Recognition of Latin scientific names using artificial neural networks. Applications in Plant Sciences, 81. 10.1002/aps3.11378
    https://doi.org/10.1002/aps3.11378 [Google Scholar]
  33. Luccioni, A. S., Viguier, S. & Ligozat, A.-L.
    (2023) Estimating the carbon footprint of BLOOM, a 176b parameter language model. Journal of Machine Learning Research, 24(253), 1–15.
    [Google Scholar]
  34. Martin, L., Muller, B., Ortiz Suárez, P. J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D. & Sagot, B.
    (2020) CamemBERT: a tasty French language model. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7203–7219. Online: Association for Computational Linguistics. 10.18653/v1/2020.acl‑main.645
    https://doi.org/10.18653/v1/2020.acl-main.645 [Google Scholar]
  35. Morand, C. & Ridoux, O.
    (2023) Extraction dans des textes anciens d’entités nommées de type binômes de la classification linnéenne du vivant : une étude de cas. Revue des Nouvelles Technologies de l’Information, Extraction et Gestion des Connaissances, RNTI-E-39, 417–424.
    [Google Scholar]
  36. Mozzherin, D., Myltsev, A. & Patterson, D.
    (2017) “gnparser”: A powerful parser for scientific names based on parsing expression grammar. BMC Bioinformatics, 181. 10.1186/s12859‑017‑1663‑3
    https://doi.org/10.1186/s12859-017-1663-3 [Google Scholar]
  37. Nadeau, D. & Sekine, S.
    (2007) A survey of named entity recognition and classification. Lingvisticæ Investigationes, 301, 3–26. 10.1075/li.30.1.03nad
    https://doi.org/10.1075/li.30.1.03nad [Google Scholar]
  38. Nasar, Z., Jaffry, S. W. & Malik, M.
    (2021) Named entity recognition and relation extraction: State of the art. ACM Computing Surveys, 541.
    [Google Scholar]
  39. NCBI
    NCBI (2008) The national center for biotechnology information taxonomy.
    [Google Scholar]
  40. Nédellec, C., Bessières, P., Bossy, R. R., Kotoujansky, A. & Manine, A.-P.
    (2006) Annotation guidelines for machine learning-based named entity recognition in microbiology. InProceeding of Data and Text Mining for Integrative Biology Workshop 17.European Conference on Machine Learning 10. European Conference on Principles and Practice of Knowledge Discovery in Databases, Workshop on data and text mining for integrative biology. Springer.
    [Google Scholar]
  41. Nguyen, N. T. H., Gabud, R. & Ananiadou, S.
    (2019) Copious: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal. 10.3897/BDJ.7.e29626
    https://doi.org/10.3897/BDJ.7.e29626 [Google Scholar]
  42. Pafilis, E., Frankild, S. P., Fanini, L., Faulwetter, S., Pavloudi, C., Vasileiadou, A., Arvanitidis, C. & Jensen, L. J.
    (2013) The species and organisms resources for fast and accurate identification of taxonomic names in text. PloS one, 8(6), e65390. 10.1371/journal.pone.0065390
    https://doi.org/10.1371/journal.pone.0065390 [Google Scholar]
  43. Sacco, G. M. & Tzitzikas, Y.
    (Eds.) (2009) Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience, volume251ofThe Information Retrieval Series. Springer. 10.1007/978‑3‑642‑02359‑0
    https://doi.org/10.1007/978-3-642-02359-0 [Google Scholar]
  44. Santarius, T., Bieser, J. C. T., Frick, V., Höjer, M., Gossen, M., Hilty, L. M., Kern, E., Pohl, J., Rohde, F. & Lange, S.
    (2022) Digital sufficiency: conceptual considerations for icts on a finite planet. Annals of Telecommunications, 78(5–6), 277–295. 10.1007/s12243‑022‑00914‑x
    https://doi.org/10.1007/s12243-022-00914-x [Google Scholar]
  45. Sautter, G., Böhm, K. & Agosti, D.
    (2006) A combining approach to find all taxon names (FAT). Biodiversity Informatics, 31. 10.17161/bi.v3i0.34
    https://doi.org/10.17161/bi.v3i0.34 [Google Scholar]
  46. Schwartz, R., Dodge, J., Smith, N. A. & Etzioni, O.
    (2020) Green AI. Commun. ACM, 63(12), 54–63. 10.1145/3381831
    https://doi.org/10.1145/3381831 [Google Scholar]
  47. Seideh, M. A. F., Fehri, H. & Haddar, K.
    (2016) Named entity recognition from arabic-french herbalism parallel corpora. InT. Okrut (Eds.), Automatic Processing of Natural-Language Electronic Texts with NooJ, 191–201. Cham: Springer International Publishing. 10.1007/978‑3‑319‑42471‑2_17
    https://doi.org/10.1007/978-3-319-42471-2_17 [Google Scholar]
  48. Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M. & Villalobos, P.
    (2022) Compute trends across three eras of machine learning. In2022 International Joint Conference on Neural Networks (IJCNN), 1–8. 10.1109/IJCNN55064.2022.9891914
    https://doi.org/10.1109/IJCNN55064.2022.9891914 [Google Scholar]
  49. Smil, V.
    (2021) Grand Transitions: How the Modern World Was Made. Oxford: OUP. 10.1093/oso/9780190060664.001.0001
    https://doi.org/10.1093/oso/9780190060664.001.0001 [Google Scholar]
  50. Strubell, E., Ganesh, A. & McCallum, A.
    (2019) Energy and policy considerations for deep learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650. Florence, Italy: Association for Computational Linguistics. 10.18653/v1/P19‑1355
    https://doi.org/10.18653/v1/P19-1355 [Google Scholar]
  51. Thompson, N., Greenewald, K., Lee, K. & Manso, G. F.
    (2023) The Computational Limits of Deep Learning. InNinth Computing within Limits 2023: LIMITS. https://limits.pubpub.org/pub/wm1lwjce. 10.21428/bf6fb269.1f033948
    https://doi.org/10.21428/bf6fb269.1f033948 [Google Scholar]
  52. Tissandier, G.
    (1873–1962) LA NATURE : Revue des sciences et de leurs applications aux arts et à l’industrie.
    [Google Scholar]
  53. Turland, N. J., Wiersema, J. H., Barrie, F. R., Greuter, W., Hawksworth, D. L., Herendeen, P. S., Knapp, S., Kusber, W.-H., Li, D.-Z., Marhold, K., May, T. W., McNeill, J., Monro, A. M., Prado, J., Price, M. J. & Smith, G. F.
    (Eds.) (2018) International Code of Nomenclature for algae, fungi, and plants (Shenzhen Code). Glashütten: Koeltz Botanical Books. 10.12705/Code.2018
    https://doi.org/10.12705/Code.2018 [Google Scholar]
  54. Vautrin, G.
    (2018) Histoire de la vulgarisation scientifique avant 1900 (History of Science Popularization before 1900 — in France). EDP sciences.
    [Google Scholar]
  55. Yu, P. & Wang, X.
    (2020) Bert-based named entity recognition in chinese twenty-four histories. InG. Wang (Eds.), Web Information Systems and Applications, 289–301. Cham: Springer International Publishing. 10.1007/978‑3‑030‑60029‑7_27
    https://doi.org/10.1007/978-3-030-60029-7_27 [Google Scholar]
/content/journals/10.1075/li.00107.mor
Loading
/content/journals/10.1075/li.00107.mor
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): binomial names; digital sufficiency; historical corpus; named-entity recognition
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error