Title: Virach Sornlertlamvanich
1Thai Linguistic Resources
Virach Sornlertlamvanich Information RD Division
(iTech) National Electronics and Computer
Technology Center (NECTEC) THAILAND virach_at_nectec.
or.th Symposium on Language Resources in
Asia 19 January 2001, Tokyo, Japan
2How Important !
Language Processing
- Linguistic resources are necessary even in
top-down and bottom-up design
- Exploitable in modeling and evaluation
3In Thai (1)
Thai Morphology
??????????????????????????????????????????????????
??????????? ?????????????????????????????
?????????????????????????????????????????
?????????????????????????? ???????????????????????
???? ?????????????????????????????????????????????
??????? ???????? ?????????????????????????????????
????? ???????????????? ???????????????????????????
?????????????????????????
5 tone marks
44 consonants
- No word boundary Ex GODISNOWHERE
1) God is now here. 2) God is nowhere.
3) God is no where.
21 vowels
- No explicit sentence marker Space character
for pausing
4In Thai (2)
Thai Syntax
- Simple grammar
- - Writing and speaking texts are not much
different. - Sentence pattern - (S)(V)(O) Ex I
love you. ??? ??? ??? - No inflection forms for- tenses ltgt auxiliary
verbs- plural or singular nouns ltgt
quantifiers, classifiers or determiners-
subject-verb agreements - No syntactic marker- word position
5In Thai (3)
Thai Phonology
A Thai syllable / C(C) V(V) C
T /
tonal level
initial consonant
final consonant
vowel
21 consonants, 18 vowels, 5 tonal levels
Different tones convey different meanings
/s ua j 4/ beautiful /s ua j 0/
terrible
No liaison A word has the same
pronunciation, no matter where it is.
No strict pronunciation rule
?????? /t u k 3/ k a 1/ t a 0/ ?????? /t u
k 3/ k x 0/
6What we need ?
- Lexicon / Dictionary (30k)
- Tagged Text (2MB) / Speech Corpora
- Word Extraction (ML p85 r56)
- Word Segmentation / POS tagger (ML 96-97)
- Sentence Segmentation (ML 85-89)
- Grapheme-to-Phoneme Conversion (PGLR 73-90)
- Word Sense Disambiguation
- Corpus / UNL / UW (concept) Editor
- MT (ParSit http//come.to/parsit) / UNL
- Speech Recognition / Synthesis
7Our Workbench
8Open Linguistic Resources
- LEXiTRON v 1.1 (a corpus based T-E
dictionary, 1994) - About 11,000 Thai entries 9,000 English entries
- http//www.links.nectec.or.th/lexit
- Thai Royal Institute Dictionary (T-T
dictionary) - Basic term 32,000 entries
- Technical term 15,339 entries
- http//www.royin.go.th/
- ORCHID POS-Tagged Corpus (supported by CRL,
1997) - 160 documents 2MB text 400K words
- XML tagged for Paragraph, Sentence, Word,
Part-of-Speech (47 tags) - http//www.links.nectec.or.th/orchid
ParSit (http//come.to/parsit, 2000)
9Ongoing Thai Speech Corpus 1
Scope (2001)
Large Vocabulary Continuous Speech Recognition
(LVCSR) Corpus - Phonetically-balanced
sentences - 5K vocabulary coverage sentences
Corpus for Text-to-Speech Synthesis - 400
phonetically and prosodic-balanced sentences -
For probabilistic prosody generation
Dialog speech corpus (collaboration with
ATR) - 50 conversations, 2,099 sentences -
5,000 words, 866 phonetically-balanced
sentences - 40 speakers (males and females)
10Ongoing Thai Speech Corpus 2
Procedure
11Ongoing Thai Speech Corpus 3
Tools
Corpus Editor
XML Corpus
Plain Text
12Ongoing Thai Speech Corpus 4
Text Sources
Technology Promotion Association
(Thailand-Japan) Amarin Printing Co., Ltd.
Matichon Public Co., Ltd.
Project Collaboration
Kasetsart University Thammasat University
Kings Mongkut University of Technology
Thonburi Prince of Songkhla University
13Ongoing Thai Speech Corpus 5
14Ongoing LEXiTRON v 2.0 1
Scope (2001)
Procedure
Entries - 25,000 Thai - English - 25,000
English - Thai
Fields - Translation - Phonetics - Root of
vocabulary - Part-of-speech - Synonym -
Antonym - Sentence sample
15Ongoing LEXiTRON v 2.0 2
Wordnet
Tools
Dictionary DB
Phonetic Symbols
Corpus-based Sample Sentences
16Discussion
- Language difficulties 13 Tai-family languages
- Text sources
- Common tagset
- Resource center
- Institutional collaboration