GSK: Development and Distribution of Resources

About This Presentation

Title:

GSK: Development and Distribution of Resources

Description:

National Institute of Information and Communications Technology (NICT) ... provided with annotations indicating morphemes, dialogue structure and prosody. ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 17

Provided by: yama3

Category:

more less

Transcript and Presenter's Notes

Title: GSK: Development and Distribution of Resources

1
GSK Development and Distribution of Resources
Licensing and Distribution of Resources and
Applications

Hitoshi ISAHARA
GSK Gengo Shigen Kyokai (Language Resource
Association)
National Institute of Information and
Communications Technology (NICT)

2
Organizing Creation Utilization of Language
Corpora

Creation of language corpora needs some cost.
Utilization needs a system to distribute corpora.
Some activities started early in 1990s.
1992 LDC in U.S.A.
1995 ELRA in Europe

3
Japanese Activities

GSK Gengo Shigen Kyokai
(Language Resource Association)
Launched in 1999,
Reformed as an NPO in 2003,
Project accepted in 2005 for 3 years,
Text corpora are its main concern at present.
NII-SRC distributes speech corpora.

4
GSK and NII-SRC

Language Resource Association (GSK)
A nonprofit organization collecting and
distributing text and speech corpora.
http//www.gsk.or.jp/
NII-Speech Resources Consortium (NII-SRC)
Collects and distributes most major speech
corpora.
http//research.nii.ac.jp/src/eng/
These two organizations try to play central roles
for collecting and distributing speech and
language corpora in Japan.

5
JEITA (Japan Electronics and Information
Technology Industries Association)
GSK
NII-SRC
Knowledge Information Processing Technologies
Committee
NII National Institute of Informatics
NICT National Institute of Information and
Communications Technology
Language Resource Sub-committee
TCL
Natural Language Processing Portal Site
SHACHI Language Resource Metadata DB
6
Purpose of GSK

Collection, distribution, investigation,
research, and standardization of electronic data
and software tools necessary for the promotion of
science, technology, education and industry
concerning natural language.

7
GSK Organization

President
Two vice presidents
11 board members
25 steering committee members
All are voluntary workers.

8
No-fee Distribution
Corpus
Provider
User
Distribution permission
GSK
Payment
Agreement
As a rule, the cost of handling corpora falls on
the user, though the corpus itself is free of
charge.
9
Agency
Agency
Request
GSK
Provider
User
Form
Commission
Payment
Agreement
The providers of the corpora entrust GSK with
requests received from users. GSK mediates
between users and providers.
10
Advertizing
Provider
User
Ad request
GSK
Publicity
Ad rate
Payment
Agreement
Corpora providers entrust GSK with advertizing
useful information on their data or corpora.
11
Some Examples of GSK Corpora

JEITA Multimodal Corpus
Japanese Web N-ram Version 1
CICC Multilingual Dictionary
IPAL Lexicon of Basic Japanese

12
JEITA Multimodal Corpus

A corpus of collected person-to-person
task-oriented dialogues. 80 min. of video for 9
conversations concerning topics of faces and
travel included. Speech data transcribed and
provided with annotations indicating morphemes,
dialogue structure and prosody. Contained in 1
DVD-R (800 MB).

13
Japanese Web N-gram Version 1

N-grams that have been extracted from Google
crawling publicly available Japanese webpages.
Pages requiring special permission to brows or
indicated with nonarchaive/noindex are not
included. N-grams (1-7) with frequency greater
than 20 were extracted from approximately 20
billion sentences.
Contained in 6 DVD-Rs (26 GB after gzip
compression).

14
CICC Multilingual Dictionary

A collection of Malay, Indonesian, Chinese, and
Thai Dictionaries containing 50,000 basic words,
POS tags some contains English translations.
Technical Term Dictionary for each language is
also available.
Contained in 1 CD-ROM for each language.
CICC Center for the International Cooperation
for Computation

15
IPAL Lexicon of Basic Japanese

Containing
861 verbs, 136 adjectives, and 1,081 Nouns and
glossary. English translations also provided for
nouns contained in glossary.
Contained in 1 CD-ROM.