Release
WAP's AI Research Institutes Announce Industry-Academia Joint Research Results
2022/02/25
(Head Office: Chiyoda-ku, Tokyo; CEO: Osamu Hata) announced today that Works Applications Group's AI research institute, Works Tokushima Artificial Intelligence NLP Laboratory, has conducted joint research with the National Institute for Japanese Language and Linguistics (NINJAL) of the National Institutes for the Humanities, an Inter-University Research Institution (hereinafter referred to as the "NIJAL"), and announced the results of industry-government collaboration. The NLP Research Institute for Artificial Intelligence has developed a large-scale Japanese pre-training model called "chiTra," which was trained using the National Institute for the Japanese Language (NWJC), the largest Japanese language database in Japan, and the morphological analyzer "Sudachi" of the NLP Research Institute for Artificial Intelligence, Works Tokushima. We are pleased to announce that we have developed a large-scale Japanese pre-training model "chiTra", which was learned by using the morphological analyzer "Sudachi" of the NLP Research Institute of Artificial Intelligence, and released it today as open data without charge.

1. Background of the release
The Works Tokushima NLP Laboratory for Artificial Intelligence has released large-scale language resources for Japanese language processing free of charge, including the large-scale Japanese morphological analysis dictionary "SudachiDict "*1 and the large-scale Japanese word distributed representation "chiVe "*2. These language resources are used not only by Works Applications Group, but also by a wide range of companies, organizations, and groups to improve the accuracy of information retrieval and text classification in big data applications.
In order to promote further development of natural language processing research and business applications, we have developed a large-scale Japanese pre-training model "chiTra" and released it free of charge under a commercially-available license.
URL for public release:https://github.com/WorksApplications/SudachiTra
2. Large-scale Japanese language pre-learning model The license is free of charge and can be used for commercial purposes. What is chiTra」 What is chiTra?
Japanese pre-learning models can improve the accuracy of text understanding by AI by learning the language using large data sets and enabling the prediction of words that come next to words. This technology has been attracting attention in recent years because it is highly versatile and can be applied to a variety of natural language processing, but it is extremely costly to train a large-scale model from scratch on one's own.
The "chiTra" released free of charge this time analyzes the largest Japanese language corpus in Japan, the National Institute for Japanese Language and Linguistics (NWJC)*3, using Sudachi*4, which has the largest vocabulary in Japan, and uses BERT (Bidirectional Encoder Representations from Transformers) *5. (BERT)*5, which is a large-scale and practical pre-training model for Japanese, and enables the realization of advanced natural language processing more easily.
Large-scale Japanese language pre-learning model " chiTra Features of "chiTra
Support for diverse documents
By using the largest Japanese language corpus in Japan, the National Institute for Japanese Language (NWJC) Japanese Web Corpus (NWJC) as training data, it supports diverse expressions and documents from a variety of domains.
■ Support for diverse vocabulary
In Japanese, the same word is expressed in various ways, such as "mikka" and "moving (dwelling, office, etc.)," and these distortions can have a negative impact on pre-training models. chiTra uses the morphological analyzer Sudachi's rich vocabulary and notation normalization functions to suppress the negative effects of distortions in notation.
■ Easy-to-use packaging
It is compatible with Hugging Face*6, a deep learning framework for natural language processing, and can be used smoothly for various NLP tasks. In addition, chiTra's models are publicly available on Open Data on AWS*7 for easy access.
In the future, we will address the following issues and continuously update the model to better capture the nature of the Japanese language.
■ Tokenization suitable for Japanese graphemes
Instead of simple character-based segmentation, we will use Sudachi's multi-granular segmentation feature to create a model more suited to Japanese graphemes through subword segmentation that takes into account Japanese word structure and script types.
■ Support for diverse expressions
By using information from the Sudachi synonym dictionary, we will attempt to supplement various expressions that do not appear in the training data, and attempt to integrate machine learning and human knowledge.
3. Paid Support Services
We provide paid maintenance services for open source software and language resources provided by Works Tokushima NLP Research Institute for Artificial Intelligence, as well as consulting services for the use of natural language processing with these resources. We support the business use of natural language processing including "chiTra" by providing consulting services for the use of SudachiDict and chiVe to improve search accuracy.
For inquiries about paid support services For inquiries regarding paid support services, please contact.
SaaS Business Division, Works Applications Enterprise, Inc.
E-Mail : [email protected]
1 Large-scale Japanese morphological analysis dictionary SudachiDict
A high-quality dictionary for Japanese natural language processing that contains a vocabulary of approximately 3 million words maintained by experts. Through continuous expansion, the dictionary also supports "Sending kana," "Abbreviations," "Old Kanji," "Synonyms," and "Misnomers.
https://github.com/WorksApplications/SudachiDict
*2 Large-scale Japanese Word Distributed Representation chiVe
Large-scale word distributed representation resource, with the largest corpus of 25.8 billion words in Japan for training
https://github.com/WorksApplications/chiVe
*3 National Language Institute Japanese Web Corpus (NWJC)
The NWJC is a corpus of more than 10 billion words of Japanese text on the Web, which is the largest corpus in Japan with 25.8 billion words. A large-scale corpus built with the aim of opening up the possibility of investigating rare language phenomena from linguistic, psychological, and information processing perspectives by collecting samples of more than 10 billion words from Japanese texts on the Web.
4 Sudachi
https://github.com/WorksApplications/Sudachi
*5 BERT (Bidirectional Encoder Representations from Transformers)
A machine learning method for natural language processing announced by Google in 2018. It became a hot topic because it broke the record for the highest accuracy at that time in various natural language processing tasks.
6 Hugging Face
A deep learning framework specialized for natural language processing provided by Hugging Face
https://huggingface.co/
*7 Open Data on AWS
A sponsorship program where AWS hosts publicly available data with public benefit value
https://aws.amazon.com/jp/opendata/
About Works Tokushima Artificial Intelligence NLP Laboratory
We are a research institute specializing in natural language processing (NLP), established in Tokushima Prefecture in February 2017. We are conducting research and development to propose a new way of working by realizing business efficiency and productivity improvement through the use of artificial intelligence, especially natural language processing.
The research results are used in the ERP package software "HUE" and SaaS product "HUE Works Suite" developed by the Works Applications Group. In addition, some of the results have been released as open source software under a license that allows commercial use, and are being used by many companies and research institutions.
About Works Applications Group
Since its establishment in 1996, Works Applications Group has been providing Products / Services mainly to major Japanese companies as Japan's first packaged software company for business applications. Based on the corporate philosophy of changing the concept of "work," making work more creative, increasing corporate productivity, and expanding corporate value, Works Applications Group aims for further development as a solution provider centered on ERP, to be a partner in promoting DX for small, medium, and start-up companies in addition to major corporations. We will continue to develop further as a solution provider with ERP at its core.
*Company names, product names and service names are trademarks or registered trademarks of their respective companies.
*The information in this release is current as of the date of publication, and is subject to change or withdrawal without notice. Please be aware that the forecasts and other forward-looking information in this release are based on uncertainties and may differ from actual results.
For inquiries about this article, please contact:
Public Relations, Works Applications, Inc.
TEL : 03-3512-1400 03-3512-1400
FAX : 03-3512-1401
E-mail: [email protected]