Volume 11, Issue 1
  • ISSN 2213-8706
  • E-ISSN: 2213-8714
Buy:$35.00 + Taxes



Tokenization significantly influences language models (LMs)’ performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while controlling complexity. Despite subword tokenizers like Byte Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter difficulties in handling non-Latin languages and depend heavily on extensive training data and computational resources to grasp the nuances of multiword expressions (MWEs). This article argues that tokenizers, more than mere technical tools, should drawing inspiration from the cognitive science about human language processing. This study then introduces the “Principle of Least Effort” from cognitive science, that humans naturally seek to reduce cognitive effort, and discusses the benefits of this principle for tokenizer development. Based on this principle, the paper proposes that the Less-is-Better (LiB) model could be a new approach for LLM tokenizer. The LiB model can autonomously learn an integrated vocabulary consisting of subwords, words, and MWEs, which effectively reduces both the numbers of tokens and types. Comparative evaluations show that the LiB tokenizer outperforms existing word and BPE tokenizers, presenting an innovative method for tokenizer development, and hinting at the possibility of future cognitive science-based tokenizers being more efficient.


Article metrics loading...

Loading full text...

Full text loading...


  1. Arnon, I., & Priva, U. C.
    (2013) More than words: The effect of multi-word frequency and constituency on phonetic duration. Lang. Speech, 56(Pt 3), 349–371. 10.1177/0023830913484891
    https://doi.org/10.1177/0023830913484891 [Google Scholar]
  2. Baker, A.
    (2022) Simplicity. InE. N. Zalta (Ed.), The Stanford encyclopedia of philosophy (Summer 2022). https://plato.stanford.edu/archives/sum2022/entries/simplicity/; Metaphysics Research Lab, Stanford University.
    [Google Scholar]
  3. Beinborn, L., & Pinter, Y.
    (2023) Analyzing cognitive plausibility of subword tokenization. InH. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp.4478–4486). Association for Computational Linguistics. 10.18653/v1/2023.emnlp‑main.272
    https://doi.org/10.18653/v1/2023.emnlp-main.272 [Google Scholar]
  4. Brugnara, F., Falavigna, D., & Omologo, M.
    (1993) Automatic segmentation and labeling of speech based on hidden markov models. Speech Commun., 12(4), 357–370. 10.1016/0167‑6393(93)90083‑W
    https://doi.org/10.1016/0167-6393(93)90083-W [Google Scholar]
  5. Chater, N.
    (1999) The search for simplicity: A fundamental cognitive principle?Q. J. Exp. Psychol. A, 52A(2), 273–302. 10.1080/713755819
    https://doi.org/10.1080/713755819 [Google Scholar]
  6. Chater, N., & Vitányi, P.
    (2003) Simplicity: A unifying principle in cognitive science?Trends Cogn. Sci., 7(1), 19–22. 10.1016/S1364‑6613(02)00005‑0
    https://doi.org/10.1016/S1364-6613(02)00005-0 [Google Scholar]
  7. Delétang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., Hutter, M., & Veness, J.
    (2023) Language modeling is compression. arxiv.org/abs/2309.10668
    [Google Scholar]
  8. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K.
    (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 10.18653/v1/N19‑1423
    https://doi.org/10.18653/v1/N19-1423 [Google Scholar]
  9. Feldman, J.
    (2016) The simplicity principle in perception and cognition. Wiley Interdiscip. Rev. Cogn. Sci., 7(5), 330–340. 10.1002/wcs.1406
    https://doi.org/10.1002/wcs.1406 [Google Scholar]
  10. Gage, P. [Google Scholar]
  11. Goldwater, S., Griffiths, T. L., & Johnson, M.
    (2009) A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1), 21–54. 10.1016/j.cognition.2009.03.008
    https://doi.org/10.1016/j.cognition.2009.03.008 [Google Scholar]
  12. Gruver, N., Finzi, M., Qiu, S., & Wilson, A. G.
    (2023) Large language models are Zero-Shot time series forecasters. arxiv.org/abs/2310.07820
    [Google Scholar]
  13. Isbilen, E. S., & Christiansen, M. H.
    (2020) Chunk-Based memory constraints on the cultural evolution of language. Top. Cogn. Sci., 12(2), 713–726. 10.1111/tops.12376
    https://doi.org/10.1111/tops.12376 [Google Scholar]
  14. Isbilen, E. S., McCauley, S. M., Kidd, E., & Christiansen, M. H.
    (2020) Statistically induced chunking recall: A Memory-Based approach to statistical learning. Cogn. Sci., 44(7), e12848. 10.1111/cogs.12848
    https://doi.org/10.1111/cogs.12848 [Google Scholar]
  15. Kudo, T.
    (2018) Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. InI. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp.66–75). Association for Computational Linguistics. 10.18653/v1/P18‑1007
    https://doi.org/10.18653/v1/P18-1007 [Google Scholar]
  16. Kudo, T., & Richardson, J.
    (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66–71. 10.18653/v1/D18‑2012
    https://doi.org/10.18653/v1/D18-2012 [Google Scholar]
  17. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R.
    (2020, February8). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. 10.48550/arXiv.1909.11942
    https://doi.org/10.48550/arXiv.1909.11942 [Google Scholar]
  18. Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P.
    (1998) Gradient-based learning applied to document recognition. Proc. IEEE, 86(11), 2278–2324. 10.1109/5.726791
    https://doi.org/10.1109/5.726791 [Google Scholar]
  19. Lieber, O., Sharir, O., Lenz, B., & Shoham, Y.
    (2021) Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 11.
    [Google Scholar]
  20. Meltzoff, A. N., Kuhl, P. K., Movellan, J., & Sejnowski, T. J.
    (2009) Foundations for a new science of learning. Science, 325(5938), 284–288. 10.1126/science.1175626
    https://doi.org/10.1126/science.1175626 [Google Scholar]
  21. Mikolov, T., Chen, K., Corrado, G., & Dean, J.
    (2013) Efficient estimation of word representations in vector space. arxiv.org/abs/1301.3781
    [Google Scholar]
  22. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R.
    (2022, March4). Training language models to follow instructions with human feedback. 10.48550/arXiv.2203.02155
    https://doi.org/10.48550/arXiv.2203.02155 [Google Scholar]
  23. Pennington, J., Socher, R., & Manning, C. D.
    (2014) GloVe: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. 10.3115/v1/D14‑1162
    https://doi.org/10.3115/v1/D14-1162 [Google Scholar]
  24. Perruchet, P., & Vinter, A.
    (1998) PARSER: A model for word segmentation. J. Mem. Lang., 39(2), 246–263. 10.1006/jmla.1998.2576
    https://doi.org/10.1006/jmla.1998.2576 [Google Scholar]
  25. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J.
    (2019) Exploring the limits of transfer learning with a unified Text-to-Text transformer. arxiv.org/abs/1910.10683
    [Google Scholar]
  26. Rissanen, J.
    (1978) Modeling by shortest data description. Automatica, 14(5), 465–471. 10.1016/0005‑1098(78)90005‑5
    https://doi.org/10.1016/0005-1098(78)90005-5 [Google Scholar]
  27. Ruoss, A., Delétang, G., Genewein, T., Grau-Moya, J., Csordás, R., Bennani, M., Legg, S., & Veness, J.
    (2023) Randomized positional encodings boost length generalization of transformers. arxiv.org/abs/2305.16843. 10.18653/v1/2023.acl‑short.161
    https://doi.org/10.18653/v1/2023.acl-short.161 [Google Scholar]
  28. Schapiro, A. C., Turk-Browne, N. B., Norman, K. A., & Botvinick, M. M.
    (2016) Statistical learning of temporal community structure in the hippocampus. Hippocampus, 26(1), 3–8. 10.1002/hipo.22523
    https://doi.org/10.1002/hipo.22523 [Google Scholar]
  29. Schuster, M., & Nakajima, K.
    (2012) Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5149–5152. 10.1109/ICASSP.2012.6289079
    https://doi.org/10.1109/ICASSP.2012.6289079 [Google Scholar]
  30. Sennrich, R., Haddow, B., & Birch, A.
    (2016) Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725. 10.18653/v1/P16‑1162
    https://doi.org/10.18653/v1/P16-1162 [Google Scholar]
  31. Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., & Wu, H.
    (2019, April19). ERNIE: Enhanced Representation through Knowledge Integration. 10.48550/arXiv.1904.09223
    https://doi.org/10.48550/arXiv.1904.09223 [Google Scholar]
  32. Tian, Y., James, I., & Son, H.
    (2023) How Are Idioms Processed Inside Transformer Language Models?InA. Palmer & J. Camacho-collados (Eds.), Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023) (pp.174–179). Association for Computational Linguistics. 10.18653/v1/2023.starsem‑1.16
    https://doi.org/10.18653/v1/2023.starsem-1.16 [Google Scholar]
  33. Yang, J.
    (2022) Discovering the units in language cognition: From empirical evidence to a computational model [PhD thesis, Radboud University & Max Planck Institute for Psycholinguistics]. 10.13140/RG.2.2.35086.84804
    https://doi.org/10.13140/RG.2.2.35086.84804 [Google Scholar]
  34. Yang, J., Cai, Q., & Tian, X.
    (2020) How do we segment text? Two-stage chunking operation in reading. eNeuro, 7(3). 10.1523/ENEURO.0425‑19.2020
    https://doi.org/10.1523/ENEURO.0425-19.2020 [Google Scholar]
  35. Yang, J., Frank, S. L., & van den Bosch, A.
    (2020) Less is Better: A cognitively inspired unsupervised model for language segmentation. Proceedings of the Workshop on the Cognitive Aspects of the Lexicon, 33–45. https://www.aclweb.org/anthology/2020.cogalex-1.4
    [Google Scholar]
  36. Yang, J., van den Bosch, A., & Frank, S. L.
    (2022) Unsupervised text segmentation predicts eye fixations during reading. Frontiers in Artificial Intelligence, 51. 10.3389/frai.2022.731615
    https://doi.org/10.3389/frai.2022.731615 [Google Scholar]
  37. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V.
    (2020, January2). XLNet: Generalized Autoregressive Pretraining for Language Understanding. 10.48550/arXiv.1906.08237
    https://doi.org/10.48550/arXiv.1906.08237 [Google Scholar]
  38. Zipf, G. K.
    (1949) Human behavior and the principle of least effort (Vol.5731). Addison-Wesley Press. https://psycnet.apa.org/fulltext/1950-00412-000.pdf
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): language model; tokenization; tokenizer
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error