Lexicons, Corpora - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Lexicons, Corpora

Description:

Restricted vocabulary of a(NLP) system. A list of all expected or allowed ... Root.GeneralMeaning Patten.specificMeaning. Only further Circumfixation allowed ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 33
Provided by: yousefsi
Category:

less

Transcript and Presenter's Notes

Title: Lexicons, Corpora


1
Lexicons, Corpora Morphology
  • Yousef S. I. Elarian

2
(No Transcript)
3
Outline
  • Lexicons
  • Lexicon
  • Lexicon Extraction
  • Corpora
  • Corpus
  • Evaluation (Zipfs Law).
  • Morphology
  • Arabic Morphology
  • Templative
  • Roots, Patterns, Stems
  • Concatenative
  • Concatenative vs. Templative
  • Practical Stuff
  • Xerox
  • Buckwalters

4
Lexicon
  • Restricted vocabulary of a(NLP) system
  • A list of all expected or allowed valid words.
  • backbone of any NLP application.
  • Generated
  • Manually (many people)
  • With Computers (Todays trend)
  • Extract from corpora
  • Reduce (Stem)
  • Synthesized??
  • Examples
  • Bare
  • With description

One Two Three Four Five Here Mars Days Name Go
conjunctions ? ?? Pref-Wa and
ltposgtwa/CONJlt/posgt ? ?? Pref-Wa andso
ltposgtfa/CONJlt/posgt
5
Lexicon Extraction
  • Computational-linguistic community is converging
    to extract the lexicon from naturally used text
    (newspaper, phone call).
  • A large amount of representative text is gathered
    and processed (Corpus).
  • Typically involves normalizing surface-words into
    a common basic form (e.g. roots or stems)
  • Reduce the number of entries.
  • Need Morphology!

Corpus
Tokenize
Stem
Add if new
Lexicon
6
Lexicon Extraction from Corpora
  • Corpus
  • pl. corpuses or corpora
  • A very large amount of NL representative text.
  • Typically (but not exclusively) from newspapers.
  • Pros
  • Capture the frequencies of NL. (Utterances.)
  • Cons
  • Never complete.
  • Typos.
  • Example
  • CCA

??? ??? ?? ???? ?? ???? ?? ????? ?????? ?? ??
?? ???? ?? ???? ????? ?? ????? ?????? ???????
??????? ????? ?? ???? ?? ???? ????? ?? ????
?????? ?? ????? ?? ???? ???? ?? ???? ?? ???
?????? ??????? ???????? ??? ?? ??????? ??? ?????
"???????" ????? ??????? ??? ??? ?????? ???
??????? ??? ???! ?? ???? ???? ??????? ??? ??????
?????? ???? ???? ????? ???? ????? ??? ????.
7
Evaluating CorporaZipfs law
  • Empirical law
  • Measures corpus quality
  • Theory
  • f r k.
  • log-log plot

8
Morphology
9
Morphology
  • ??? ?????
  • The (grammatical) study of the (internal)
    structure of words.
  • A morpheme is defined as the minimal meaningful
    unit of a language.
  • Types (by August Schleicher)
  • Analytic (Isolating)
  • Concatenative (Agglutinative)
  • Prefix informal ?????
  • Suffix formalize ?????
  • Circumfix informalize ??????
  • Templatic (Fusional)
  • Root mouse ???
  • Pattern (infix). mice ????

Chinese English Arabic Turkish
10
Templatic Morphology
  • Starts from Roots Patterns
  • Examples

11
Roots
Templatic Morphology
  • Primary lexical unit of a word
  • Carries semantic content.
  • Cannot be reduced.
  • Left when all, including internal,
    morphologically added structure has been wrung
    out.
  • In Arabic
  • An ordered sequence of 3, 4, or 5 letters.
  • bare verb.

12
Patterns
Templatic Morphology
  • AKA measures or forms.
  • Inflectional morphemes
  • Non(purely)-concatenative.
  • General moulds.
  • A sequence of constant and variable characters.
  • Variable characters (?? ?? ?) (1, 2, 3).
  • To be substituted by the letters of the Arabic
    root.
  • ??? ????

13
Concatenative Morphology
14
Concatenative Morphology
  • Starts from stems.
  • Minimal surface-form
  • Nouns, verbs, Particles.
  • But not all surface-words are stems.
  • Roots Patterns
  • Root.GeneralMeaning Patten.specificMeaning
  • Only further Circumfixation allowed
  • No further infixation.

15
Concatenative Morphology
  • Noun Examples

16
Concatenative Morphology (Cont. )
  • Verb Examples

17
Arabic Morphology
  • Statistics

5,000
400
6,000
700
37 (!)
90,000
130,000
Beesley
Chalabi
150,000
Xerox
DINAR.1
61010 (!)
El-Sadany et al.
18
Concatenative vs. Templatic
Templatic vs. Concatenative
19
Practical Stuff
20
Xerox
  • Worked on both Morph. Anal
  • Internal (Root-pattern)
  • Beesley
  • ((Roots patterns) circumfixes)
  • 5,000 Root 8 400 pattern
  • External (Stems)
  • Buckwalter
  • (Stems circumfixes)
  • Starts from over 80,000 stems
  • Has a very popular
  • transliteration system.
  • Named after Buckwalter.
  • A semi-standard now.

21
Buckwalters AraMorphas an example
22
Goals
  • Morphotactics morphophonemic rules built in the
    lexicon
  • A single lexicon of prefixes/suffixes including
    all valid concatenations
  • Orthographic variations additional dictionary
    entries.
  • Lexical tagging
  • Stems rather than root and patterns.

23
Buckwalters AraMorph
  • Availabe for free
  • Original Perl version
  • Java version (Pierrick Brihaye)
  • morphotactics and orthographic rules built-in (in
    lexicons).
  • E.g. contains ?? ??? ??
  • 3 Morpheme Lexicons
  • Stems, prefixes, suffixes.
  • 3 Compatibility tables
  • specify allowed concatenations
  • Prefix-Stem
  • Stem-Suffix
  • Prefix-Suffix

24
Buckwalters Files
  • Abstract

25
Buckwalters Files
  • Sample from dictPrefixes

w wa Pref-Wa and ltposgtwa/CONJlt/posgt f fa Pref-Wa
andso ltposgtfa/CONJlt/posgt b bi NPref-Bi bywith
ltposgtbi/PREPlt/posgt k ka NPref-Bi likesuch as
ltposgtka/PREPlt/posgt wb wabi NPref-Bi and
by/with ltposgtwa/CONJbi/PREPlt/posgt fb fabi NPref-
Bi and by/with ltposgtfa/CONJbi/PREPlt/posgt wk wa
ka NPref-Bi and like/such as ltposgtwa/CONJka/PRE
Plt/posgt fk faka NPref-Bi and like/such as
ltposgtfa/CONJka/PREPlt/posgt Al Al NPref-Al the
ltposgtAl/DETlt/posgt
26
Buckwalters Files
  • Sample from dictStems

--- ktb katab-u_1 ktb katab PV write ktb kotub
IV write ktb kutib PV_Pass be writtenbe
fatedbe destined ktb kotab IV_Pass_yu be
writtenbe fatedbe destined
kAtab_1 kAtb kAtab PV correspond
with kAtb kAtib IV_yu correspond with
gtakotab_1 gtktb gtakotab PV dictatemake
write Aktb gtakotab PV dictatemake
write ktb kotib IV_yu dictatemake
write ktb kotab IV_Pass_yu be dictated
kitAboxAnap_1ktAbxAn kitAboxAn NapAt librarybook
store
27
Buckwalters Files
  • Sample from dictSuffixes

p ap NSuff-ap fem.sg.
ltposgtap/NSUFF_FEM_SGlt/posgt ty atayo NSuff-tay two
ltposgtatayo/NSUFF_FEM_DU_ACCGEN_POS
Slt/posgt tyh atayohi NSuff-tay his/its two
ltposgtatayo/NSUFF_FEM_DU_ACCGEN_POSShu/POSS_PRON_
3MSlt/posgt tyhmA atayohimA NSuff-tay their two
ltposgtatayo/NSUFF_FEM_DU_ACCGEN_POSShumA/POSS
_PRON_3Dlt/posgt tyhm atayohim NSuff-tay their two
ltposgtatayo/NSUFF_FEM_DU_ACCGEN_POSShum/PO
SS_PRON_3MPlt/posgt tyhA atayohA NSuff-tay its/their
/her two ltposgtatayo/NSUFF_FEM_DU_ACCGEN_POSShA/P
OSS_PRON_3FSlt/posgt tyhn atayohina NSuff-tay their
two ltposgtatayo/NSUFF_FEM_DU_ACCGEN_POSS
huna/POSS_PRON_3FPlt/posgt
28
Buckwalters Files
  • Sample from
  • TableAB TableAC and TableBC

29
1st Step existance
  • Arabic dictionary look-up consists of asking, for
    each segmentation
  • does the prefix exist in the lexicon of prefixes?
  • if so, does the stem exist in the lexicon of
    stem?
  • if so, does the suffix exist in the lexicon of
    suffixes

30
2nd Step Compatibility
  • If all three word elements (prefix, stem, suffix)
    are found, ask
  • is the morphological category of the prefix
    compatible with the morphological category of the
    stem?
  • if so, is the morphological category of the
    prefix compatible with the morphological category
    of the suffix?
  • if so, is the morphological category of the stem
    compatible with the morphological category of the
    suffix?

31
Links
  • http//www.qamus.org/morphology.htm
  • http//students.cs.byu.edu/jonsafar/cgi-bin/aramo
    rph_fast.cgi
  • http//www.AraMorph.nongnu.org

32
Finally
  • A Program run.
  • Questions?
  • Main References
  • Elarian YS. Lexicon Generation for Arabic Optical
    Text Recognition dissertation. Jordanian
    University of Science and Technology 2006,
    August.
  • Habash N. Introduction to Arabic natural language
    processing. ACL05 Tutorial 2005 June 25 Ann
    Arbor, USA.
  • Wikipedia, the free encyclopedia. Online
    Accessed 2006 December. Available from URL
    http//en.wikipedia.org/wiki/.
  • Thanks.
Write a Comment
User Comments (0)
About PowerShow.com