Title: Building a Dictionary from WWW
1Building a Dictionary from WWW
- Virach Sornlertlamvanich
- Information Research and Development Division
- National Electronics and Computer Technology
Center - Thailand
- virach_at_nectec.or.th
- Languages and Cultures of the East and West
(LACEW) - July 25-27, 2001, Tsukuba University
2Motivations
- WWW is an only one huge, up-to-date,
language-and-area thoroughness online resource - Lexicon terminology database needed in
- Electronic Dictionary
- Machine Translation
- Text Summarization, etc.
- Lack of open sharable resource
- No standard formats
- Legal issues
Collaborative Open Lexicon Development
3Concepts
- Open Format
- XML-based
- Open Protocol
- XML-based request/response
- DICT (RFC 2229)
- Open Participation
- Data entry
- Approver
- Open Source
- Software Tools
- Dictionary Content
- Corpus-based
- To reflect the uses and meanings of terms in real
life - To assist human thinking process
4Format Standards Survey
- Models
- ISO 126201999 Terminology data categories
- ISO 122001999 Machine-Readable Terminology
Interchange Format (MARTIF) - Variations
- OTELO OLIF (Open Lexicon Interchange
Format) A format for MT dictionary interchange - OSCAR (LISA) TMX (Translation Memory eXchange
format) - SALT Standards-based Access to multilingual
Lexicons and Terminologies
XLT (XML representation of Lexicons and
Terminologies)
5Development Procedure
WWW
Text Corpus
- Document collection with robot (w/ language
identification)
Robot
Sample Texts
- Term candidate extraction (C4.5 on MI,
Entropy, etc.)
Terms
Term Candidate Extraction
- Context-based concept classification (text
classification)
Concept Classification
Term-Concepts
Syntactic Structure Analysis
Annotated Concepts
- Syntactic structure extraction (POS tagger)
Concept Correlation Discovery
Ontology
- Semantic correlation discovery (Ontology)
6A Thai Running Text
?????????? ??????????? ??????????????
??????????????????????????????????????????????????
?? ???????????????????????????????????????????????
??? ??????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????? 1989
Word/Sentence Segmentation
?????? ???? ?? ???? ????? ??????????????
???????? ???? ??????????? ???? ????? ??? ?????
???? ???????? ????? ????????? ??????????????
??? ??????????? ???? ???? ?? ????? ???? ???
????? ?? ???? ??? ???????? ???? ???????? ???????
??? ??? ?? ????? ???????? ??????? ????? ??? ?????
???? ??????? ?? ?? 1989
.
.
.
.
7Writing System
- 46 consonants 18 vowels4 tones 9 symbols 10
digitswritten 4 levels - No punctuation
- No word/sentence marker
- No upper/lower case letter
- No inflection
baseline
Hard to identify (single/compound)
word/phrase/sentence
8Sentence Extraction
Input paragraph
Training POS tagged corpus
Word segmentation andPOS tagging
Winnow(Feature-based ML)
Word sequence with tagged POS
Winnow
Trained network
Paragraph with sentence break
9Accuracy in Word/Sentence Segmentation
- Word Segmentation
- Longest matching (92)
- Maximal matching (93)
- POS tri-gram (96)
- Machine learning (97)
- Sentence Segmentation
- POS tri-gram (85)
- Machine learning (89)
Supervised approaches
10Term Candidate Extraction
- Virach Sornlertlamvanich et. al. (COLING 2000)
- Automatic Corpus-Based Thai Word Extraction with
the C4.5 Learning Algorithm - C4.5-trained decision tree for determining
potential word boundary from MI, Entropy,
Linguistic information - Capable of discovering new words in document
without assistance from static dictionary
11Mutual Information
y z
x y
x
z
where x is the leftmost character of string
xyz y is the middle substring of xyz z is the
rightmost character of string xyz p( ) is the
probability function.
High mutual information implies that xyz
co-occurs more than expected by chance. If xyz is
a word, its Lm and Rm must be high.Efunction
and ...Function...
12Entropy
y
x
y
z
where A is the set of characters x is the
leftmost character of string xyz y is the middle
substring of xyz z is the rightmost character
of string xyz p( ) is the probability function.
Entropy shows the variety of characters before
and after a word. If y is a word, its left and
right entropy must be high....?function... ,
...?unction...
13Other Features
- Frequency Words tend to be used more often than
non-word string sequences. - Length Short strings are likely to happen by
chance. The long and short strings should be
treated differently. - Functional Words Functional words are used
mostly in phrases. They are useful to
disambiguate words and phrases.
Result of subjective test Word
precision 85 Word recall 56
14Evaluation Result of Word Extraction
RID Royal Institute Dictionary (30,000
words of Thai-Thai dictionary)
15Concept Classification
- Word and their contexts in the corpora
- Manual word-sense disambiguation
- Unsupervised word sense disambiguation (Yarowsky
1995)
16Concept Classification
17Syntactic Structure Analysis
- Sentence/word segmentation by POS trigram tagger
- POS assignment
- Word co-occurrence
- Parser
- Pattern of usages
18Ontologies
- EDR
- Approach Word description as employed in
dictionaries - Problem Ambiguities and incomputability
- Wordnet
- Approach Synonym set and simple
semantic relations to other words - Problem Ambiguities
- UW
- Approach Headwords and semantic restrictions
- Advantage Computability and no ambiguity
19Ontologies
Representation of concept tired in different
schemes
EDR Wordnet 1.5 UW
- having or displaying a need for rest-
having lost of interest- lack of imagination
- A1 tired (vs. rested)- A2 bromidic,
commonplace, hackneyed, - V1 tire,
pall, grow weary, fatigue- V2 tire, wear upon,
fag out- V3 run down, exhaust, sap, - V4
bore, tire, ...
- tired- tired(iclgtphysical)-
tired(iclgtmental)
20Universal Word (UW)
- UW format ltheadwordgt ( ltlist of
restrictionsgt ) e.g. book (icl gt do, obj gt
room) - Headword An English word roughly describes
the UW sense. - Restrictions
- Inclusion (icl ) indicates the class of the
sensee.g. car ( icl gt movable thing)
21Universal Word (UW)
- Restrictions (continued)
- UNL semantic relationse.g. eat ( agt gt
volitional thing, obj gt food )The agent of this
UW is restricted to be volitional thing.The
object of this UW is restricted to be food.
UW Class Hierarchy
22Architecture
- Centralized activities
- For data integrity consistency
- Distributed sites
- For open participation
- For backing up
- Job-based
- Jobs generated by corpus analysis tools
- Participants download jobs to work off-line and
submit back when done
23Job-based Working
Job Billboard
Job A Generator
Job B Generator
Job A Pool
Job B Pool
Participant
Participant
Job A Submittal
Job B Submittal
Job A Acceptor
Job B Acceptor
Job A Approval Pool
Job B Approval Pool
Approver
Approver
Job A Approved Pool
Job B Approved Pool
Job A Approval Acceptor
Job B Approval Acceptor
Central DB
24Network Connection
- Committee nodes
- Replicate same database
- Closely synchronized
- Provide service to neighbor participants
- Agents (optional)
- Propagate communications between committee and
participants
25Contents to Develop
- Thai word list
- Corpus-based Thai lexicon
- Co-occurrence dictionary
- Thai ontology
26SNLPO-COCOSDA
The Fifth Symposium on Natural Language
Processing Oriental COCOSDA Workshop 20029-11
May 2002Hua Hin, Prachuapkirikhan,
Thailandhttp//kind.siit.tu.ac.th/snlp-o-cocosda
2002/ orhttp//www.links.nectec.or.th/itech/snlp-
o-cocosda2002/