Title: GSK: Development and Distribution of Resources
1GSK Development and Distribution of Resources
Licensing and Distribution of Resources and
Applications
- Hitoshi ISAHARA
- GSK Gengo Shigen Kyokai (Language Resource
Association) - National Institute of Information and
Communications Technology (NICT)
2Organizing Creation Utilization of Language
Corpora
- Creation of language corpora needs some cost.
- Utilization needs a system to distribute corpora.
- Some activities started early in 1990s.
-
- 1992 LDC in U.S.A.
- 1995 ELRA in Europe
3Japanese Activities
- GSK Gengo Shigen Kyokai
- (Language Resource Association)
- Launched in 1999,
- Reformed as an NPO in 2003,
- Project accepted in 2005 for 3 years,
- Text corpora are its main concern at present.
- NII-SRC distributes speech corpora.
4GSK and NII-SRC
- Language Resource Association (GSK)
- A nonprofit organization collecting and
distributing text and speech corpora. - http//www.gsk.or.jp/
- NII-Speech Resources Consortium (NII-SRC)
- Collects and distributes most major speech
corpora. - http//research.nii.ac.jp/src/eng/
- These two organizations try to play central roles
for collecting and distributing speech and
language corpora in Japan. -
5JEITA (Japan Electronics and Information
Technology Industries Association)
GSK
NII-SRC
Knowledge Information Processing Technologies
Committee
NII National Institute of Informatics
NICT National Institute of Information and
Communications Technology
Language Resource Sub-committee
TCL
Natural Language Processing Portal Site
SHACHI Language Resource Metadata DB
6Purpose of GSK
- Collection, distribution, investigation,
research, and standardization of electronic data
and software tools necessary for the promotion of
science, technology, education and industry
concerning natural language.
7GSK Organization
- President
- Two vice presidents
- 11 board members
- 25 steering committee members
- All are voluntary workers.
8No-fee Distribution
Corpus
Provider
User
Distribution permission
GSK
Payment
Agreement
As a rule, the cost of handling corpora falls on
the user, though the corpus itself is free of
charge.
9Agency
Agency
Request
GSK
Provider
User
Form
Commission
Payment
Agreement
The providers of the corpora entrust GSK with
requests received from users. GSK mediates
between users and providers.
10Advertizing
Provider
User
Ad request
GSK
Publicity
Ad rate
Payment
Agreement
Corpora providers entrust GSK with advertizing
useful information on their data or corpora.
11Some Examples of GSK Corpora
- JEITA Multimodal Corpus
- Japanese Web N-ram Version 1
- CICC Multilingual Dictionary
- IPAL Lexicon of Basic Japanese
12 JEITA Multimodal Corpus
- A corpus of collected person-to-person
task-oriented dialogues. 80 min. of video for 9
conversations concerning topics of faces and
travel included. Speech data transcribed and
provided with annotations indicating morphemes,
dialogue structure and prosody. Contained in 1
DVD-R (800 MB).
13 Japanese Web N-gram Version 1
- N-grams that have been extracted from Google
crawling publicly available Japanese webpages.
Pages requiring special permission to brows or
indicated with nonarchaive/noindex are not
included. N-grams (1-7) with frequency greater
than 20 were extracted from approximately 20
billion sentences. - Contained in 6 DVD-Rs (26 GB after gzip
compression).
14 CICC Multilingual Dictionary
- A collection of Malay, Indonesian, Chinese, and
Thai Dictionaries containing 50,000 basic words,
POS tags some contains English translations.
Technical Term Dictionary for each language is
also available. - Contained in 1 CD-ROM for each language.
- CICC Center for the International Cooperation
for Computation
15 IPAL Lexicon of Basic Japanese
- Containing
- 861 verbs, 136 adjectives, and 1,081 Nouns and
glossary. English translations also provided for
nouns contained in glossary. -
- Contained in 1 CD-ROM.
16 Summary
- 1. There are several distributers of language
resources in Japan. - 2. GSK is the only consortium of language
resources qualified as NPO in Japan. - 3. GSK plans to collaborate with Language Grid
Project.