Virach Sornlertlamvanich - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Virach Sornlertlamvanich

Description:

Virach Sornlertlamvanich. Information R&D Division (iTech) ... A word has the same pronunciation, no matter where it is. No strict pronunciation rule: ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 17
Provided by: nec4
Category:

less

Transcript and Presenter's Notes

Title: Virach Sornlertlamvanich


1
Thai Linguistic Resources
Virach Sornlertlamvanich Information RD Division
(iTech) National Electronics and Computer
Technology Center (NECTEC) THAILAND virach_at_nectec.
or.th Symposium on Language Resources in
Asia 19 January 2001, Tokyo, Japan
2
How Important !
Language Processing
  • Linguistic resources are necessary even in
    top-down and bottom-up design
  • Exploitable in modeling and evaluation

3
In Thai (1)
Thai Morphology
  • A Thai paragraph

??????????????????????????????????????????????????
??????????? ?????????????????????????????
?????????????????????????????????????????
?????????????????????????? ???????????????????????
???? ?????????????????????????????????????????????
??????? ???????? ?????????????????????????????????
????? ???????????????? ???????????????????????????
?????????????????????????
5 tone marks
  • Alphabetical system

44 consonants
  • No word boundary Ex GODISNOWHERE
    1) God is now here. 2) God is nowhere.
    3) God is no where.

21 vowels
  • No explicit sentence marker Space character
    for pausing

4
In Thai (2)
Thai Syntax
  • Simple grammar
  • - Writing and speaking texts are not much
    different.
  • Sentence pattern - (S)(V)(O) Ex I
    love you. ??? ??? ???
  • No inflection forms for- tenses ltgt auxiliary
    verbs- plural or singular nouns ltgt
    quantifiers, classifiers or determiners-
    subject-verb agreements
  • No syntactic marker- word position

5
In Thai (3)
Thai Phonology
A Thai syllable / C(C) V(V) C
T /
tonal level
initial consonant
final consonant
vowel
21 consonants, 18 vowels, 5 tonal levels
Different tones convey different meanings
/s ua j 4/ beautiful /s ua j 0/
terrible
No liaison A word has the same
pronunciation, no matter where it is.
No strict pronunciation rule
?????? /t u k 3/ k a 1/ t a 0/ ?????? /t u
k 3/ k x 0/
6
What we need ?
  • Lexicon / Dictionary (30k)
  • Tagged Text (2MB) / Speech Corpora
  • Language Model
  • Word Extraction (ML p85 r56)
  • Word Segmentation / POS tagger (ML 96-97)
  • Sentence Segmentation (ML 85-89)
  • Grapheme-to-Phoneme Conversion (PGLR 73-90)
  • Word Sense Disambiguation
  • Corpus / UNL / UW (concept) Editor
  • MT (ParSit http//come.to/parsit) / UNL
  • Text Summarization
  • Speech Recognition / Synthesis

7
Our Workbench
8
Open Linguistic Resources
  • LEXiTRON v 1.1 (a corpus based T-E
    dictionary, 1994)
  • About 11,000 Thai entries 9,000 English entries
  • http//www.links.nectec.or.th/lexit
  • Thai Royal Institute Dictionary (T-T
    dictionary)
  • Basic term 32,000 entries
  • Technical term 15,339 entries
  • http//www.royin.go.th/
  • ORCHID POS-Tagged Corpus (supported by CRL,
    1997)
  • 160 documents 2MB text 400K words
  • XML tagged for Paragraph, Sentence, Word,
    Part-of-Speech (47 tags)
  • http//www.links.nectec.or.th/orchid

ParSit (http//come.to/parsit, 2000)
9
Ongoing Thai Speech Corpus 1
Scope (2001)
Large Vocabulary Continuous Speech Recognition
(LVCSR) Corpus - Phonetically-balanced
sentences - 5K vocabulary coverage sentences
Corpus for Text-to-Speech Synthesis - 400
phonetically and prosodic-balanced sentences -
For probabilistic prosody generation
Dialog speech corpus (collaboration with
ATR) - 50 conversations, 2,099 sentences -
5,000 words, 866 phonetically-balanced
sentences - 40 speakers (males and females)
10
Ongoing Thai Speech Corpus 2
Procedure
11
Ongoing Thai Speech Corpus 3
Tools
Corpus Editor
XML Corpus
Plain Text
12
Ongoing Thai Speech Corpus 4
Text Sources
Technology Promotion Association
(Thailand-Japan) Amarin Printing Co., Ltd.
Matichon Public Co., Ltd.
Project Collaboration
Kasetsart University Thammasat University
Kings Mongkut University of Technology
Thonburi Prince of Songkhla University
13
Ongoing Thai Speech Corpus 5
14
Ongoing LEXiTRON v 2.0 1
Scope (2001)
Procedure
Entries - 25,000 Thai - English - 25,000
English - Thai
Fields - Translation - Phonetics - Root of
vocabulary - Part-of-speech - Synonym -
Antonym - Sentence sample
15
Ongoing LEXiTRON v 2.0 2
Wordnet
Tools
Dictionary DB
Phonetic Symbols
Corpus-based Sample Sentences
16
Discussion
  • Language difficulties 13 Tai-family languages
  • Text sources
  • Common tagset
  • Resource center
  • Institutional collaboration
Write a Comment
User Comments (0)
About PowerShow.com