Building a Dictionary from WWW - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Building a Dictionary from WWW

Description:

Languages and Cultures of the East and West (LACEW) July 25-27, 2001, Tsukuba University ... No inflection. consonant. tone. vowel. vowel. Hard to identify ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: theppitakk
Category:

less

Transcript and Presenter's Notes

Title: Building a Dictionary from WWW


1
Building a Dictionary from WWW
  • Virach Sornlertlamvanich
  • Information Research and Development Division
  • National Electronics and Computer Technology
    Center
  • Thailand
  • virach_at_nectec.or.th
  • Languages and Cultures of the East and West
    (LACEW)
  • July 25-27, 2001, Tsukuba University

2
Motivations
  • WWW is an only one huge, up-to-date,
    language-and-area thoroughness online resource
  • Lexicon terminology database needed in
  • Electronic Dictionary
  • Machine Translation
  • Text Summarization, etc.
  • Lack of open sharable resource
  • No standard formats
  • Legal issues

Collaborative Open Lexicon Development
3
Concepts
  • Open Format
  • XML-based
  • Open Protocol
  • XML-based request/response
  • DICT (RFC 2229)
  • Open Participation
  • Data entry
  • Approver
  • Open Source
  • Software Tools
  • Dictionary Content
  • Corpus-based
  • To reflect the uses and meanings of terms in real
    life
  • To assist human thinking process

4
Format Standards Survey
  • Models
  • ISO 126201999 Terminology data categories
  • ISO 122001999 Machine-Readable Terminology
    Interchange Format (MARTIF)
  • Variations
  • OTELO OLIF (Open Lexicon Interchange
    Format) A format for MT dictionary interchange
  • OSCAR (LISA) TMX (Translation Memory eXchange
    format)
  • SALT Standards-based Access to multilingual
    Lexicons and Terminologies
    XLT (XML representation of Lexicons and
    Terminologies)

5
Development Procedure
WWW
Text Corpus
  • Document collection with robot (w/ language
    identification)

Robot
Sample Texts
  • Term candidate extraction (C4.5 on MI,
    Entropy, etc.)

Terms
Term Candidate Extraction
  • Context-based concept classification (text
    classification)

Concept Classification
Term-Concepts
Syntactic Structure Analysis
Annotated Concepts
  • Syntactic structure extraction (POS tagger)

Concept Correlation Discovery
Ontology
  • Semantic correlation discovery (Ontology)

6
A Thai Running Text
?????????? ??????????? ??????????????
??????????????????????????????????????????????????
?? ???????????????????????????????????????????????
??? ??????????????????????????????????????????????
??????????????????????????????????????????????????
??????????????? 1989
Word/Sentence Segmentation
?????? ???? ?? ???? ????? ??????????????
???????? ???? ??????????? ???? ????? ??? ?????
???? ???????? ????? ????????? ??????????????
??? ??????????? ???? ???? ?? ????? ???? ???
????? ?? ???? ??? ???????? ???? ???????? ???????
??? ??? ?? ????? ???????? ??????? ????? ??? ?????
???? ??????? ?? ?? 1989
.
.
.
.
7
Writing System
  • 46 consonants 18 vowels4 tones 9 symbols 10
    digitswritten 4 levels
  • No punctuation
  • No word/sentence marker
  • No upper/lower case letter
  • No inflection

baseline
Hard to identify (single/compound)
word/phrase/sentence
8
Sentence Extraction
Input paragraph
Training POS tagged corpus
Word segmentation andPOS tagging
Winnow(Feature-based ML)
Word sequence with tagged POS
Winnow
Trained network
Paragraph with sentence break
9
Accuracy in Word/Sentence Segmentation
  • Word Segmentation
  • Longest matching (92)
  • Maximal matching (93)
  • POS tri-gram (96)
  • Machine learning (97)
  • Sentence Segmentation
  • POS tri-gram (85)
  • Machine learning (89)

Supervised approaches
10
Term Candidate Extraction
  • Virach Sornlertlamvanich et. al. (COLING 2000)
  • Automatic Corpus-Based Thai Word Extraction with
    the C4.5 Learning Algorithm
  • C4.5-trained decision tree for determining
    potential word boundary from MI, Entropy,
    Linguistic information
  • Capable of discovering new words in document
    without assistance from static dictionary

11
Mutual Information
y z
x y
x
z
where x is the leftmost character of string
xyz y is the middle substring of xyz z is the
rightmost character of string xyz p( ) is the
probability function.
High mutual information implies that xyz
co-occurs more than expected by chance. If xyz is
a word, its Lm and Rm must be high.Efunction
and ...Function...
12
Entropy
y
x
y
z
where A is the set of characters x is the
leftmost character of string xyz y is the middle
substring of xyz z is the rightmost character
of string xyz p( ) is the probability function.
Entropy shows the variety of characters before
and after a word. If y is a word, its left and
right entropy must be high....?function... ,
...?unction...
13
Other Features
  • Frequency Words tend to be used more often than
    non-word string sequences.
  • Length Short strings are likely to happen by
    chance. The long and short strings should be
    treated differently.
  • Functional Words Functional words are used
    mostly in phrases. They are useful to
    disambiguate words and phrases.

Result of subjective test Word
precision 85 Word recall 56
14
Evaluation Result of Word Extraction
RID Royal Institute Dictionary (30,000
words of Thai-Thai dictionary)
15
Concept Classification
  • Word and their contexts in the corpora
  • Manual word-sense disambiguation
  • Unsupervised word sense disambiguation (Yarowsky
    1995)

16
Concept Classification
17
Syntactic Structure Analysis
  • Sentence/word segmentation by POS trigram tagger
  • POS assignment
  • Word co-occurrence
  • Parser
  • Pattern of usages

18
Ontologies
  • EDR
  • Approach Word description as employed in
    dictionaries
  • Problem Ambiguities and incomputability
  • Wordnet
  • Approach Synonym set and simple
    semantic relations to other words
  • Problem Ambiguities
  • UW
  • Approach Headwords and semantic restrictions
  • Advantage Computability and no ambiguity

19
Ontologies
Representation of concept tired in different
schemes
EDR Wordnet 1.5 UW
- having or displaying a need for rest-
having lost of interest- lack of imagination
- A1 tired (vs. rested)- A2 bromidic,
commonplace, hackneyed, - V1 tire,
pall, grow weary, fatigue- V2 tire, wear upon,
fag out- V3 run down, exhaust, sap, - V4
bore, tire, ...
- tired- tired(iclgtphysical)-
tired(iclgtmental)
20
Universal Word (UW)
  • UW format ltheadwordgt ( ltlist of
    restrictionsgt ) e.g. book (icl gt do, obj gt
    room)
  • Headword An English word roughly describes
    the UW sense.
  • Restrictions
  • Inclusion (icl ) indicates the class of the
    sensee.g. car ( icl gt movable thing)

21
Universal Word (UW)
  • Restrictions (continued)
  • UNL semantic relationse.g. eat ( agt gt
    volitional thing, obj gt food )The agent of this
    UW is restricted to be volitional thing.The
    object of this UW is restricted to be food.

UW Class Hierarchy
22
Architecture
  • Centralized activities
  • For data integrity consistency
  • Distributed sites
  • For open participation
  • For backing up
  • Job-based
  • Jobs generated by corpus analysis tools
  • Participants download jobs to work off-line and
    submit back when done

23
Job-based Working
Job Billboard
Job A Generator
Job B Generator
Job A Pool
Job B Pool
Participant
Participant
Job A Submittal
Job B Submittal
Job A Acceptor
Job B Acceptor
Job A Approval Pool
Job B Approval Pool
Approver
Approver
Job A Approved Pool
Job B Approved Pool
Job A Approval Acceptor
Job B Approval Acceptor
Central DB
24
Network Connection
  • Committee nodes
  • Replicate same database
  • Closely synchronized
  • Provide service to neighbor participants
  • Agents (optional)
  • Propagate communications between committee and
    participants

25
Contents to Develop
  • Thai word list
  • Corpus-based Thai lexicon
  • Co-occurrence dictionary
  • Thai ontology

26
SNLPO-COCOSDA
The Fifth Symposium on Natural Language
Processing Oriental COCOSDA Workshop 20029-11
May 2002Hua Hin, Prachuapkirikhan,
Thailandhttp//kind.siit.tu.ac.th/snlp-o-cocosda
2002/ orhttp//www.links.nectec.or.th/itech/snlp-
o-cocosda2002/
Write a Comment
User Comments (0)
About PowerShow.com