GSK: Development and Distribution of Resources - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

GSK: Development and Distribution of Resources

Description:

National Institute of Information and Communications Technology (NICT) ... provided with annotations indicating morphemes, dialogue structure and prosody. ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 17
Provided by: yama3
Category:

less

Transcript and Presenter's Notes

Title: GSK: Development and Distribution of Resources


1
GSK Development and Distribution of Resources
Licensing and Distribution of Resources and
Applications
  • Hitoshi ISAHARA
  • GSK Gengo Shigen Kyokai (Language Resource
    Association)
  • National Institute of Information and
    Communications Technology (NICT)

2
Organizing Creation Utilization of Language
Corpora
  • Creation of language corpora needs some cost.
  • Utilization needs a system to distribute corpora.
  • Some activities started early in 1990s.
  • 1992 LDC in U.S.A.
  • 1995 ELRA in Europe

3
Japanese Activities
  • GSK Gengo Shigen Kyokai
  • (Language Resource Association)
  • Launched in 1999,
  • Reformed as an NPO in 2003,
  • Project accepted in 2005 for 3 years,
  • Text corpora are its main concern at present.
  • NII-SRC distributes speech corpora.

4
GSK and NII-SRC
  • Language Resource Association (GSK)
  • A nonprofit organization collecting and
    distributing text and speech corpora.
  • http//www.gsk.or.jp/
  • NII-Speech Resources Consortium (NII-SRC)
  • Collects and distributes most major speech
    corpora.
  • http//research.nii.ac.jp/src/eng/
  • These two organizations try to play central roles
    for collecting and distributing speech and
    language corpora in Japan.

5
JEITA (Japan Electronics and Information
Technology Industries Association)
GSK
NII-SRC
Knowledge Information Processing Technologies
Committee
NII National Institute of Informatics
NICT National Institute of Information and
Communications Technology
Language Resource Sub-committee
TCL
Natural Language Processing Portal Site
SHACHI Language Resource Metadata DB
6
Purpose of GSK
  • Collection, distribution, investigation,
    research, and standardization of electronic data
    and software tools necessary for the promotion of
    science, technology, education and industry
    concerning natural language.

7
GSK Organization
  • President
  • Two vice presidents
  • 11 board members
  • 25 steering committee members
  • All are voluntary workers.

8
No-fee Distribution
Corpus
Provider
User
Distribution permission
GSK
Payment
Agreement
As a rule, the cost of handling corpora falls on
the user, though the corpus itself is free of
charge.
9
Agency
Agency
Request
GSK
Provider
User
Form
Commission
Payment
Agreement
The providers of the corpora entrust GSK with
requests received from users. GSK mediates
between users and providers.
10
Advertizing
Provider
User
Ad request
GSK
Publicity
Ad rate
Payment
Agreement
Corpora providers entrust GSK with advertizing
useful information on their data or corpora.
11
Some Examples of GSK Corpora
  • JEITA Multimodal Corpus
  • Japanese Web N-ram Version 1
  • CICC Multilingual Dictionary
  • IPAL Lexicon of Basic Japanese

12
JEITA Multimodal Corpus
  • A corpus of collected person-to-person
    task-oriented dialogues. 80 min. of video for 9
    conversations concerning topics of faces and
    travel included. Speech data transcribed and
    provided with annotations indicating morphemes,
    dialogue structure and prosody. Contained in 1
    DVD-R (800 MB).

13
Japanese Web N-gram Version 1
  • N-grams that have been extracted from Google
    crawling publicly available Japanese webpages.
    Pages requiring special permission to brows or
    indicated with nonarchaive/noindex are not
    included. N-grams (1-7) with frequency greater
    than 20 were extracted from approximately 20
    billion sentences.
  • Contained in 6 DVD-Rs (26 GB after gzip
    compression).

14
CICC Multilingual Dictionary
  • A collection of Malay, Indonesian, Chinese, and
    Thai Dictionaries containing 50,000 basic words,
    POS tags some contains English translations.
    Technical Term Dictionary for each language is
    also available.
  • Contained in 1 CD-ROM for each language.
  • CICC Center for the International Cooperation
    for Computation

15
IPAL Lexicon of Basic Japanese
  • Containing
  • 861 verbs, 136 adjectives, and 1,081 Nouns and
    glossary. English translations also provided for
    nouns contained in glossary.
  • Contained in 1 CD-ROM.

16
Summary
  • 1. There are several distributers of language
    resources in Japan.
  • 2. GSK is the only consortium of language
    resources qualified as NPO in Japan.
  • 3. GSK plans to collaborate with Language Grid
    Project.
Write a Comment
User Comments (0)
About PowerShow.com