Title: Infrastructures for the Korean Language
1Infrastructures for the Korean Language
2Academic Society
- SIG-Korean Language Computing under Korea
Information Science Society - 300 members
- Korea Information Society
- linguistics oriented
3KIBS Korea Information Base and Systems
- Purpose
- To improve Korean Language Processing Technology
- To promote Korean Software Industry
- in the planning phase (1993), targetted to Hangul
Wordprocessor, Machine Translation and Korean
Linguistic Research - 1995 - 1997 (Phase 1) word
- Two ministry joint project Industry
- Ministry of ScienceTechnology, Ministry of
Culture - 1998 - 2000 (Phase 2) sentence
- Only by Ministry of ScienceTechnology Industry
- will be evaluated in October, 2000
- 2001 - 2003 (Phase 3) discourse - not decided
- http//kibs.kaist.ac.kr/
4King Sejong Project
- Purpose
- To promote the Korean Language Research in the
linguistics side - To prepare for the language planning
- for Unification of South-/North-Korea
- for International use of Korean
- Sponsor Ministry of Culture
- Period 1998 - 2007 (10 years)
- Items
- corpus, dictionary, internationalization,
terminology, education, font, old Korean, old
Chinese characters - http//www.sejong.or.kr/
5KIBS Architecture
Terminology DB
User(Dictionary)
6KIBS Introduction
- Title of Project
- KIBS I Integrated Korean Information Base
- KIBS II On Development of Deep-Level Processing
and Quality Management Technology for Very Large
Korean Information Base - Outline
- Term 1994.12.4 2004.9.30 (10 years)
- Sponsor Ministry of Science and Technology
- Staff 50 person/year
7The Goal of First step
The Development of an Integrated, Environment
and Support Management System
- Standard Module Interface
- Corpus and Electronic Dictionary Development and
Management System - Korean Part-of-Speech Tagging System
- Korean Syntactic Tagging System
- Korean/English Alignment System
The Standardization the Specification for
Korean Information Base
- Terminological Data Base Development and
Management System - Standard Korean Input/Output Environment
- Standardized Methodology for the Construction of
a Balanced Corpus - Part-Of-Speech Transfer Dictionary Rules and an
Example Package
The Construction of Korean Information Base
- Tree-Tagged Corpus
- Word-Level Narrative Speech Data Base
- Hand-written Hangul scripts of high frequency
8The Goal of Second step
Development/Management System of Electronic
Dictionary for Sentence Analysis/Generation
(100,000 entries)
- Syntactic Information Base for Syntactic
Analysis/Generation - Semantic Information Base for Semantic
Analysis/Generation - Additional Information on Language and GUI for
Developing Applications
Terminology Dictionary and Development/Managem
ent System
- Terminology Entries
- Domain-specific Corpus for Terminology Building
- Sublanguage Analysis and Extraction of Terminology
Quality Management System for Language
Information Processing
- Development/Management System for Information
Base - Development of Integrated Management System for
Distributed Resources
9Development Tools
- Korean Concordance Program (KCP)
- Compound Noun Browser
- Corpus Browser
- Corpus Browser by Category
- Automatic English-to-Korean Transliteration
System (TLEK) - KAIST Ontology Browser
- Korean Morphological Analyser
- Korean Tagger
- Korean Syntactic Analyser
- Editing Support Tools to Electronic Dictionary
10Results Distribution
- Major Results
- The first (KIBS I) 1997.6. present (80 site)
- Text corpus 10 million word phrases
- POS tagged corpus 1 million word phrases
- Syntactic structure tagged corpus 10 thousands
sentences - TDMS, Speech DB samples, Hand-written character
DB samples - The second (KIBS II) 1998.12. present (140
site) - Raw corpus 10 million word phrases, POS tagged
corpus 200 thousands word phrases - The third (KIBS III) 2000 (pending)
- Proper noun 10 thousands entries, Compound noun
20 thousands entries, Verb sentence pattern
dictionary 3 thousands entries, ... - Plan to maintain and distribute ...
11Integration of Electronic Dictionaries
- Dictionaries total 420K entries (estimated now)
- Machine Readable Dictionary (Hangul Society)
200K entries - Compound Noun, Proper Noun Classification,
Internal Semantic Structure 50K entries - Searched Compound Noun, Proper Noun open
- Verb Subcategorization 10K frames (K-J
comparison) - Thesaurus Korean-Japanese-Chinese-English not
so good quality 150K entries - Usage from corpus for each sense
- Functional words
- Problem
- Sense classification standardization
- Character code Korean, Japanese, Chinese,
(most important problem) now under unicode
transfer
12Open through web
- Corpus KWIC for Korean and Japanese
- http//morph.kaist.ac.kr/kcp/
- Korean morphological analysis service
- http//morph.kaist.ac.kr/
- By email, if send a text file, then reply its POS
tagging - Graphic editor/debugger for Korean morphology
- Project Status
- http//kibs.kaist.ac.kr/
13KORTERM
Korea Terminology Center for Language and
Knowledge Engineering
http//korterm.org/ (English) http//korterm.or.kr
/ (Korean) http//eafterm.org/ (East Asian
Terminology)
14Goals of KORTERM
- Through World-Wide Terminology Collection and
Their Standardization and Harmonization in Local
Society - Distribution, Publication and Application in
Language and Knowledge Engineering are promoted. - Through Education and Consultation of Terminology
RD Methodology for Each Subject Field, - High-Quality, High-Reliable Terminology and Its
Infrastructure and System are achieved.
Center of Terminology and Knowledge Engineering
15Phases and Subjects of KORTERM
Phase 4 (2008 - ) Maintenance and Extension
Phase 3 (2004-2007) Operation
- Continuous Extension and Management
- Terminology Study Promotion
- Distribution of Terminology Information Base
- Continuous Terminology Extension and Management
- Multi-lingual Terminology Integration
- Terminology Collection (Humanity and Social
Science) - Maintenance and Extension
- Large-Scale Knowledge Base for Terminology
- Terminology Education Curriculum Development
- Application Product Development
Phase 2 (2001-2003) Value-Added Working System
- Value-Added Terminology Integration
- Terminology Collection (Extended ST)
- Extension Maintenance (Industry Standards)
- High-Quality Terminology
- Application in Language Industry
- Verification for High-Reliability and Distribution
Phase 1 (1998-2000) RD Environment and Basic
Data Collection
- Integration of Working Terminology
- Terminology Collection (Basic ST, Industry
Standard, - Economics)
- Electronic Terminology (Publication)
- RD Environment (System Standardization)
- Terminology Theory and Education Infrastructure
16R D (1)
- Basic Data (Corpus)
- Corpus for Each Subject Domain
- Electronic Dictionary for Basic Vocabulary
- Everyday Vocabulary consists of General
Vocabulary and Everyday Terminology - Internationalization of Korean Language
- South-North Korean Terminology Standardization,
Korean language Input Methods - Korean Language Engineering
- Standardized Term Use for Information Retrieval,
Machine Translation and Document Classification
17R D (2)
- Language Engineering
- Information Retrieval
- Effective Internet Information Creation and
Information/Knowledge Acquisition - Multi-lingualism
- Machine Translation
- Efficient Information Generation through
Terminology and Vocabulary Collection and
Standardization - Wordprocessor
- High Productivity by Spelling Correction,
Summarization and Efficient Use.
18R D (3)
- Language, Information and Terminology
- Language Education
- Technical Thinking and Technical Communication
- Terminology-based Education
- Language Study
- Domain-specific Language Study
19Terminology Sponsors
- Support from Government, Organization and
Industry according to each specialty - Ministry of Culture and Tourism (KORTERM Center
Operation) - Ministry of Science and Technology (RD Fund)
- Ministry of Information and Telecommunication
(RD Fund) - Ministry of Diplomacy and Trade
- Ministry of Industry and Resource
- Ministry of Education
- Korea Science and Technology Foundation (Event
Support)
20Task Configuration
RD Industry Living Communication
Use
Terminology Information Environment
Application
Application-Specific Dictionary
Language Education Adaptable to Student
LanguageEducationEnvironment
Language Knowledge Product
TerminologySymbolization
Grid Size Controller
Terminology Access Standard Channel
RD Environment
International Term Standard
Terminology Standard
TerminologicalConceptualSpace
Standardization Harmonization
Terminology Base (Collection)Non-standards
21Large-Scale Speech/Language/Image DB Construction
and Evaluation
Supported by Ministry of Science and
Technology Two Year Project (1999.10-2001.10)
22Goals
Speech/Language/Image Evaluation Standardization
Final Goal
Organization
- Working Group Organization
- Survey and Planning
- IR Test Suite and Evaluation Model Recommend
- MT Test Suite and Evaluation Model Recommend
Language
Specification Standardization
- Sentence-unit Speech DB
- Prosody for Speech Synthesis
Speech
- Image Attribute Format
- Color-Lexical Entry
- MPEG7 Specification
Image
- IR/QA 90 query/200K doc, MT 5,000 sentences
Language
Test Suite
Speech
- word-unit telephone speech DB 100 token 500
Image
- Image 300 kinds - Meta Data
23Question-Answering IR Test Suites
- Test Suites for IR/QA
- Documents
- 207,067 records (370MB)
- Newspapers
- Query Generation
- 90 queries (through 300 quiz query analysis)
- Queries for WH-question and other various types
of answers - for NLP problem solving
- relevent document set to include the answer
- by using four kinds of commercialized IR systems
by 16 kinds of methods
24English-Korean MT Test Suites
- Type Classification About 300 Kinds
- Test Sentences and Test Query 5,000 Records
- Extracted from Textbook and Grammar books
(1999-2000) - will be extracted from the Real usage like web,
newspapers (2000-2001) - Evaluation by Yes/No Question
- Tested for 4 Commercialized English-Korean MT
Systems
25MT Evaluation Workbench
26Image Meta Data Editor
Meta data Input Workbench by XML
27Image Retrieval by Meta data
28http//korterm.kaist.ac.kr/ksurimal/
  Â