Volume 24, Issue 1
  • ISSN 0929-9971
  • E-ISSN: 1569-9994
Buy:$35.00 + Taxes


In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.


Article metrics loading...

Loading full text...

Full text loading...


  1. Asahara, Masayuki , Kazuya Kawahara , Yuya Takei , Hideto Masuoka , Yasuko Ohba , Yuki Torii , Toru Morii , Yuki Tanaka , Kikuo Maekawa , Sachi Kato , and Hikari Konishi
    2016 “‘BonTen’ Corpus Concordance System for ‘NINJAL Web Japanese Corpus’.” InProceedings of COLING 2016, The 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, 25–29.
    [Google Scholar]
  2. Asahara, Masayuki , Kikuo Maekawa , Mizuho Imada , Sachi Kato , and Hikari Konishi
    2014 “Archiving and Analysing Techniques of the Ultra-large-scale Web-based Corpus Project of NINJAL, Japan.” Alexandria: The Journal of National and International Library and Information Issues25 (1–2): 129–148. doi: 10.7227/ALX.0024
    https://doi.org/10.7227/ALX.0024 [Google Scholar]
  3. Asahara, Masayuki , and Yuji Matsumoto
    2003IPADIC version 2.7.0 User’s Manual (in Japanese). Nara Institute of Science and Technology, Japan. Information Science Division. Technical Report.
    [Google Scholar]
  4. Baroni, Marco , and Motoko Ueyama
    2006 “Building General- and Special-purpose Corpora by Web Crawling.” InProceedings of the 13th NIJL International Symposium, Language Corpora: Their Compilation and Application. Tokyo, Japan, 31–40.
    [Google Scholar]
  5. Bojanowski, Piotr , Edouard Grave , Armand Joulin , and Tomas Mikolov
    2016Enriching Word Vectors with Subword Information (https://arxiv.org/pdf/1607.04606.pdf). Accessed18 January 2018.
    [Google Scholar]
  6. Cardellino, Cristian
    2016Spanish Billion Words Corpus and Embeddings. (crscardellino.me/SBWCE/). Accessed18 January 2018.
    [Google Scholar]
  7. Den, Yasuharu , Junichi Nakamura , Toshinobu Ogiso , and Hideki Ogura
    2008 “A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation.” InProceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), 1019–1024, Marrakech, Morocco.
    [Google Scholar]
  8. Kawahara, Daisuke , and Sadao Kurohashi
    2006 “Case Frame Compilation from the Web Using High-performance Computing.” InProceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, 1344–1347.
    [Google Scholar]
  9. Kilgarriff, Adam , Siva Reddy , Jan Pomikálek , and Avinesh Pvs
    2010 “A Corpus Factory for Many Languages.” InProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010), Malta, 904–910.
    [Google Scholar]
  10. Kokuritsu Kokugo Kenkyusho
    Kokuritsu Kokugo Kenkyusho 1964Word List by Semantic Principles, 1st Edition. Shuei Shuppan, Kokuritsu Kokugo Kenkyusho Shiryo-shu 6.
    [Google Scholar]
  11. Kokuritsu Kokugo Kenkyusho
    Kokuritsu Kokugo Kenkyusho 2004Word List by Semantic Principles, Revised and Enhanced VersionDainippon Tosho, Kokuritsu Kokugo Kenkyusho Shiryo-shu 14,
    [Google Scholar]
  12. Kudo, Taku , and Yuji Matsumoto
    2002 “Japanese Dependency Analysis using Cascaded Chunking.” InProceedings of CoNLL 2002: Proceedings of the 6th Conference on Natural Language Learning 2002 (COLING 2002 Post-Conference Workshops), 63–69. Taipei, Taiwan.
    [Google Scholar]
  13. Kudo, Taku , Kaoru Yamamoto , and Yuji Matsumoto
    2004 “Applying Conditional Random Fields to Japanese Morphological Analysis”. InProceedings of EMNLP 2004. 230–237. Barcelona, Spain.
    [Google Scholar]
  14. Mikolov, Tomas , Kai Chen , Greg Corrado , and Jeffrey Dean
    2013 “Efficient Estimation of Word Representations in Vector Space.” InWorkshop Proceedings of the International Conference on Learning Representations (ICLR), 1–12. Scottsdale, Arizona. (https://arxiv.org/abs/1301.3781). Accessed18 January 2018.
    [Google Scholar]
  15. Morita, Hajime , Daisuke Kawahara , and Sadao Kurohashi
    2015 “Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model.” InProceedings of EMNLP 2015. 2292–2297. Lisbon, Portugal.
    [Google Scholar]
  16. Murawaki, Yugo , and Sadao Kurohashi
    2008 “Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints.” InProceedings of EMNLP 2008 Honolulu, pp.429–437. (www.aclweb.org/anthology/D08-1045). Accessed18 January 2018. doi: 10.3115/1613715.1613770
    https://doi.org/10.3115/1613715.1613770 [Google Scholar]
  17. 2010a “Online Japanese Unknown Morpheme Detection using Orthographic Variation.” InProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010). 832–839. Malta.
    [Google Scholar]
  18. 2010b “Semantic Classification of Automatically Acquired Nouns using Lexico-Syntactic Clues.” InProceedings of COLING 2010. 876–884. Beijing, China.
    [Google Scholar]
  19. Pennington, Jeffery , Richard Socher , and Christopher D. Manning
    2014 “GloVe: Global Vectors for Word Representation.” InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1532–1543. doi: 10.3115/v1/D14‑1162
    https://doi.org/10.3115/v1/D14-1162 [Google Scholar]
  20. Pomikálek, Jan , and Vít Suchomel
    2012 “Efficient Web Crawling for Large Text Corpora.” InProceedings of the Seventh Web as Corpus Workshop (WAC7), 39–43. Lyon, France.
    [Google Scholar]
  21. Shinzato, Keiji , Tomohide Shibata , Daisuke Kawahara , Chikara Hashimoto , and Sadao Kurohashi
    2008 ‘TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access.” InProceedings of Third International Joint Conference on Natural Language Processing (IJCNLP2008), Hyderabad, India, 189–196.
    [Google Scholar]
  22. Srdanović, E. Irena , Erjavec Tomaž , and Adam Kilgarriff
    2008 “A Web Corpus and Word-sketches for Japanese.” Shizen gengo shori (Journal of Natural Language Processing) 15 (2): 137–159.
    [Google Scholar]
  23. Thomee, Bart , David A. Shamma , Gerald Friedland , Benjamin Elizalde , Karl Ni , Douglas Poland , Damian Borth , and Li-Jia Li
    2016Yfcc100m: The New Data in Multimedia Research59: 64–73.
    [Google Scholar]
  24. Ueyama, Motoko , and Marco Baroni
    2005 “Automated Construction and Evaluation of Japanese Web-based Reference Corpora,” InProceedings of Corpus Linguistics 2005. Birmingham, UK. (clic.cimec.unitn.it/marco/publications/cl2005/Ueyama_Baroni_CL05.pdfhttps://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Thewebasacorpus/UeyamaBaroni2.doc). Accessed18 January 2018.
    [Google Scholar]
  25. Yata, Susumu
    2010nwc-toolkit. (https://code.google.com/archive/p/nwc-toolkit/). Accessed18 January 2018.
    [Google Scholar]

Data & Media loading...

  • Article Type: Research Article
Keyword(s): Japanese language; thesaurus; web corpus; word embedding
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error