A%20computational%20Lexicon%20for%20Contemporary%20Hebrew - PowerPoint PPT Presentation

About This Presentation
Title:

A%20computational%20Lexicon%20for%20Contemporary%20Hebrew

Description:

A computational Lexicon for Contemporary Hebrew Alon Itai CS Technion Shuly Wintner CS Haifa University Shlomo Yona CS Haifa University – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 19
Provided by: Alon153
Category:

less

Transcript and Presenter's Notes

Title: A%20computational%20Lexicon%20for%20Contemporary%20Hebrew


1
A computational Lexicon for Contemporary Hebrew
  • Alon Itai CS Technion
  • Shuly Wintner CS Haifa University
  • Shlomo Yona CS Haifa University

2
Outlook
  • Modern Hebrew
  • What is a lexicon?
  • What is in our lexicon?
  • Why do we need it?
  • How did we acquire it?

3
Modern Hebrew
  • Official Language of the State of Israel
  • Spoken by 7 M people
  • Related, but linguistically distinct, from
    Biblical Hebrew.

4
Semitic Word Formation
  • root pattern ? word

pattern



hitCaCCeC
CaCaC
yiCCoC
root
hitkatteb (corresponded)
ktb
yiktob (he will write)
katab (he wrote(
hištabber (refract)
Å¡br
Å¡abar (he broke)
yišbor (he will break)
5
Writing System
  • Most vowels are omitted
  • Particles are prepended to words,
  • Example
  • h definite article,
  • b preposition (in)
  • w conjunction (and)
  • wbbyt w b ha byt
  • and in the house

6
Morphological Ambiguity
  • Most words are morphologically ambiguous
  • Example Å¡bth ????
  • Å¡avta Å¡bt CaCCa stopped working
  • Å¡avta Å¡bh CaCCa took prisoner
  • Å¡abatah her Saturday
  • Å¡e-b-te that in tea
  • Å¡e-b-ha-te that in the tea
  • Å¡e-bit-h that her daughter

7
How to morphologically parse?
  • Create all patterns
  • Given a token check whether it fits a
    pattern.
  • Creates a lot of superfluous parses.
  • Use a lexicon to reduce the number of parses

Example In English xxxs ? xxx (noun) shouses
? house
bosses ? bosse
bosse lexicon
8
Acquisition
  • Started with lexicons of previous morphological
    analyzers (HSPELL, Segal).
  • Added missing conjugations, such as passives, and
    nomalizations (manually verified).
  • Parsed corpora and listed tokens that had no
    morphologically valid parse. (Mainly proper
    names). Added them (manually to the lexicon).

9
GUI for editing the lexicon
10
(No Transcript)
11
Size of the lexicon by part of speech
100 preposition 10332 noun
62 conjunction 4485 verb
60 pronoun 4227 Proper Name
40 interjection 1612 adjective
9 interrogative 352 adverb
6 negation 132 quantifier
Total 21,417
12
Organization
  • Ordered by lexeme, not root.
  • Similar to nearly all dictionaries.
  • Most laymen cannot identify the root.
  • The semantics is associated with the lexeme and
    only loosely with the root
  • paqad visited hitpaqqednifqad
    missing hifqid -depositedpiqqed -- commanded

13
Structure of an entry
  • Unique ID
  • Nominals (nouns, adjectives)
  • The lexical item dotted, undotted,
    transliterated
  • POS
  • Gender / number
  • Plural suffix (im, ot).
  • Inflection base (if different)
  • Exceptions (if inflection has exceptions)

14
Structure of an entry (2)
  • Verbs
  • Root
  • Inflection pattern binyan pattern of 1st
    binyanškb tiCCC ? tiškb (tiškav) psl tiCCC
    ? tipsol (tifsol)
  • Valency

15
XML
  • The lexicon is represented in XML
  • Readable both by machines and by humans
  • Enables using off-shelf tools for on screen
    presentation and validation
  • EXAMPLE
  • -ltitem id17580 scriptformal
    transliteratedbwqr undotted????
    dotted?????? gt
  • ltnoun gendermasculine numbersingular
    pluralimgt
  • ltreplace gendermasculine
    numberplural scriptformal
    transliteratedbqarim undotted?????/gt
  • lt/noungt
  • lt/itemgt

Info for the morphological parser
16
License
  • Available under GPL Gnu Public License. You get
    it for free if all products derived from it are
    also under GPL.
  • Can get a non-exclusive license for commercial
    use.

17
Conclusions
  • Created a comprehensive lexicon of Modern Hebrew.
  • Identify 96 of all tokens in corpus.
  • Missing Proper names, typos, nonstandard
    spelling,
  • Open for research under GPL
  • Created within the Knowledge Center for
    Processing Hebrew

18
Acknowlodgements
  • Knowledge Center for Processing Hebrew
  • Israel Ministry for Science and Technology
  • PeopleShuly Wintner Haifa UniversityShlomo
    Yona Haifa University
  • Yoad Winter TechnionShira Schwartz
    lexicographerDalia Bojan software engineer
Write a Comment
User Comments (0)
About PowerShow.com