1887
Volume 24, Issue 1
  • ISSN 0929-9971
  • E-ISSN: 1569-9994
USD
Buy:$35.00 + Taxes

Abstract

In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.

Loading

Article metrics loading...

/content/journals/10.1075/term.00011.asa
2018-05-31
2019-10-22
Loading full text...

Full text loading...

References

  1. Asahara, Masayuki , Kazuya Kawahara , Yuya Takei , Hideto Masuoka , Yasuko Ohba , Yuki Torii , Toru Morii , Yuki Tanaka , Kikuo Maekawa , Sachi Kato , and Hikari Konishi
    2016 “‘BonTen’ Corpus Concordance System for ‘NINJAL Web Japanese Corpus’.” InProceedings of COLING 2016, The 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, 25–29.
    [Google Scholar]
  2. Asahara, Masayuki , Kikuo Maekawa , Mizuho Imada , Sachi Kato , and Hikari Konishi
    2014 “Archiving and Analysing Techniques of the Ultra-large-scale Web-based Corpus Project of NINJAL, Japan.” Alexandria: The Journal of National and International Library and Information Issues25 (1–2): 129–148. doi: 10.7227/ALX.0024
    https://doi.org/10.7227/ALX.0024 [Google Scholar]
  3. Asahara, Masayuki , and Yuji Matsumoto
    2003IPADIC version 2.7.0 User’s Manual (in Japanese). Nara Institute of Science and Technology, Japan. Information Science Division. Technical Report.
    [Google Scholar]
  4. Baroni, Marco , and Motoko Ueyama
    2006 “Building General- and Special-purpose Corpora by Web Crawling.” InProceedings of the 13th NIJL International Symposium, Language Corpora: Their Compilation and Application. Tokyo, Japan, 31–40.
    [Google Scholar]
  5. Bojanowski, Piotr , Edouard Grave , Armand Joulin , and Tomas Mikolov
    2016Enriching Word Vectors with Subword Information (https://arxiv.org/pdf/1607.04606.pdf). Accessed18 January 2018.
    [Google Scholar]
  6. Cardellino, Cristian
    2016Spanish Billion Words Corpus and Embeddings. (crscardellino.me/SBWCE/). Accessed18 January 2018.
    [Google Scholar]
  7. Den, Yasuharu , Junichi Nakamura , Toshinobu Ogiso , and Hideki Ogura
    2008 “A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation.” InProceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), 1019–1024, Marrakech, Morocco.
    [Google Scholar]
  8. Kawahara, Daisuke , and Sadao Kurohashi
    2006 “Case Frame Compilation from the Web Using High-performance Computing.” InProceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, 1344–1347.
    [Google Scholar]
  9. Kilgarriff, Adam , Siva Reddy , Jan Pomikálek , and Avinesh Pvs
    2010 “A Corpus Factory for Many Languages.” InProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010), Malta, 904–910.
    [Google Scholar]
  10. Kokuritsu Kokugo Kenkyusho
    Kokuritsu Kokugo Kenkyusho 1964Word List by Semantic Principles, 1st Edition. Shuei Shuppan, Kokuritsu Kokugo Kenkyusho Shiryo-shu 6.
    [Google Scholar]
  11. Kokuritsu Kokugo Kenkyusho
    Kokuritsu Kokugo Kenkyusho 2004Word List by Semantic Principles, Revised and Enhanced VersionDainippon Tosho, Kokuritsu Kokugo Kenkyusho Shiryo-shu 14,
    [Google Scholar]
  12. Kudo, Taku , and Yuji Matsumoto
    2002 “Japanese Dependency Analysis using Cascaded Chunking.” InProceedings of CoNLL 2002: Proceedings of the 6th Conference on Natural Language Learning 2002 (COLING 2002 Post-Conference Workshops), 63–69. Taipei, Taiwan.
    [Google Scholar]
  13. Kudo, Taku , Kaoru Yamamoto , and Yuji Matsumoto
    2004 “Applying Conditional Random Fields to Japanese Morphological Analysis”. InProceedings of EMNLP 2004. 230–237. Barcelona, Spain.
    [Google Scholar]
  14. Mikolov, Tomas , Kai Chen , Greg Corrado , and Jeffrey Dean
    2013 “Efficient Estimation of Word Representations in Vector Space.” InWorkshop Proceedings of the International Conference on Learning Representations (ICLR), 1–12. Scottsdale, Arizona. (https://arxiv.org/abs/1301.3781). Accessed18 January 2018.
    [Google Scholar]
  15. Morita, Hajime , Daisuke Kawahara , and Sadao Kurohashi
    2015 “Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model.” InProceedings of EMNLP 2015. 2292–2297. Lisbon, Portugal.
    [Google Scholar]
  16. Murawaki, Yugo , and Sadao Kurohashi
    2008 “Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints.” InProceedings of EMNLP 2008 Honolulu, pp.429–437. (www.aclweb.org/anthology/D08-1045). Accessed18 January 2018. doi: 10.3115/1613715.1613770
    https://doi.org/10.3115/1613715.1613770 [Google Scholar]
  17. 2010a “Online Japanese Unknown Morpheme Detection using Orthographic Variation.” InProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010). 832–839. Malta.
    [Google Scholar]
  18. 2010b “Semantic Classification of Automatically Acquired Nouns using Lexico-Syntactic Clues.” InProceedings of COLING 2010. 876–884. Beijing, China.
    [Google Scholar]
  19. Pennington, Jeffery , Richard Socher , and Christopher D. Manning
    2014 “GloVe: Global Vectors for Word Representation.” InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1532–1543. doi: 10.3115/v1/D14‑1162
    https://doi.org/10.3115/v1/D14-1162 [Google Scholar]
  20. Pomikálek, Jan , and Vít Suchomel
    2012 “Efficient Web Crawling for Large Text Corpora.” InProceedings of the Seventh Web as Corpus Workshop (WAC7), 39–43. Lyon, France.
    [Google Scholar]
  21. Shinzato, Keiji , Tomohide Shibata , Daisuke Kawahara , Chikara Hashimoto , and Sadao Kurohashi
    2008 ‘TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access.” InProceedings of Third International Joint Conference on Natural Language Processing (IJCNLP2008), Hyderabad, India, 189–196.
    [Google Scholar]
  22. Srdanović, E. Irena , Erjavec Tomaž , and Adam Kilgarriff
    2008 “A Web Corpus and Word-sketches for Japanese.” Shizen gengo shori (Journal of Natural Language Processing) 15 (2): 137–159.
    [Google Scholar]
  23. Thomee, Bart , David A. Shamma , Gerald Friedland , Benjamin Elizalde , Karl Ni , Douglas Poland , Damian Borth , and Li-Jia Li
    2016Yfcc100m: The New Data in Multimedia Research59: 64–73.
    [Google Scholar]
  24. Ueyama, Motoko , and Marco Baroni
    2005 “Automated Construction and Evaluation of Japanese Web-based Reference Corpora,” InProceedings of Corpus Linguistics 2005. Birmingham, UK. (clic.cimec.unitn.it/marco/publications/cl2005/Ueyama_Baroni_CL05.pdfhttps://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Thewebasacorpus/UeyamaBaroni2.doc). Accessed18 January 2018.
    [Google Scholar]
  25. Yata, Susumu
    2010nwc-toolkit. (https://code.google.com/archive/p/nwc-toolkit/). Accessed18 January 2018.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.1075/term.00011.asa
Loading
/content/journals/10.1075/term.00011.asa
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): Japanese language , thesaurus , web corpus and word embedding
This is a required field
Please enter a valid email address
Approval was successful
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error