Arabic WordNet: Semi-automatic Extensions using Bayesian Inference - PowerPoint PPT Presentation

About This Presentation
Title:

Arabic WordNet: Semi-automatic Extensions using Bayesian Inference

Description:

... due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ?) ... relying when possible on dictionaries containing diacritics ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 21
Provided by: davi967
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Arabic WordNet: Semi-automatic Extensions using Bayesian Inference


1
Arabic WordNet Semi-automatic Extensions using
Bayesian Inference
  • H. Rodríguez1, D. Farwell1, J. Farreres1, M.
    Bertran1, M. Alkhalifa2, M.A. Martí2

1 Talp Research Center, UPC, Barcelona, Spain 2
UB, Barcelona, Spain
2
Index of the talk
  • The AWN project
  • Semi-automatic Extensions of AWN
  • Intuitive basis
  • Previous work using heuristics
  • Using Bayesian Networks
  • Empirical evaluation
  • Conclusions

3
The AWN project
  • USA REFLEX program funded (2005-2007)
  • Partners
  • Universities
  • Princeton, Manchester, UPC, UB
  • Companies
  • Articulate Software, Irion
  • Description
  • Black et al, 2006
  • Elkateb et al, 2006
  • Rodríguez et al, 2008

4
The AWN project
  • Objectives
  • 10,000 synsets including some amount of domain
    specific data
  • linked to PWN 2.0
  • finally to PWN 3.0
  • linked to SUMO
  • 1,000 NE
  • manually built (or revised)
  • vowelized entries
  • including root of each entry

5
The AWN project
  • Current figures

Arabic synsets 11270
Arabic words 23496
pos DB content
adj 661
nouns 7961
adv 110
verbs 2538
Named entities
Synsets that are named entities 1142
Synsets that are not named entities 10028
Words in synsets that are named entities 1656
6
Semi-automatic Extensions of AWN
  • Intuitive basis
  • In Arabic (and other Semitic Languages) many
    words having a common root (i.e. a sequence of
    typically three consonants) have related meanings
    and can be derived from a base verbal form by
    means of a reduced set of lexical rules

7
Semi-automatic Extensions of AWN
8
Semi-automatic Extensions of AWN
  • Lexical rules
  • regular verbal derivative forms
  • regular nominal and adjectival derivative forms
  • masdar (nominal verb)
  • masculine and feminine active and passive
    participles
  • inflected verbal forms

9
Semi-automatic Extensions of AWN
  • Procedure for generating a set of likely ltArabic
    word, English synset, scoregt
  • produce an initial list of candidate word forms
  • filter out the less likely candidates from this
    list
  • generate an initial list of attachments
  • score the reliability of these candidates
  • manually review the best scored candidates and
    include the valid associations in AWN.

10
Semi-automatic Extensions of AWN
  • Resources
  • PWN
  • AWN
  • LOGOS database of conjugated Arabic verbs
  • NMSU bilingual Arabic-English lexicon
  • Arabic Gigaword Corpus
  • UN (2000-2002) bilingual Arabic-English Corpus

11
Semi-automatic Extensions of AWN
  • Score the reliability of the candidates
  • build a graph representing the words, synsets and
    their associations
  • associations synset-synset
  • explicit in WN2.0
  • path-based
  • apply a set of heuristic rules that use directly
    the structure of the graph
  • GWC 2008
  • apply Bayesian inference
  • LREC 2008

12
Using Bayesian Inference
13
Using Bayesian Inference
14
Using Bayesian Inference
  • Building the CPT for each node in the BN
  • edges EW ? AW
  • probabilities from statistical translation models
    built from the UN corpus using GIZA (word-word
    probabilities) filtered to avoid pairs having
    Arabic expressions with invalid Buckwalter
    encodings.
  • all the mass probability is distributed between
    pairs occurring in the BN
  • other edges (EW ? S, S ? S)
  • linear distribution on priors
  • noisy or model

15
Using Bayesian Inference
  • Performing Bayesian Inference in the BN
  • Assign probability 1 to nodes in layer 1
  • Infer the probabilities of nodes in layer 3
  • Select for each word in layer 1 select as
    candidates the synsets in layer 3 connected to it
    and with probability over a threshold
  • Score the candidate pair with this probability
  • Select the candidates scored over a threshold

16
Empirical Evaluation
  • 10 verbs randomly selected from AWN ???

17
Empirical Evaluation
  • Results

18
Conclusions
  • the BN approach doubles the number of candidates
    of the previous HEU approach (554 vs 272).
  • The sample is clearly insufficient.
  • The overlaping of Heu BN seems to improve the
    results
  • An analysis of the errors shows a substantial
    number were due to the lack of the shadda
    diacritic or the feminine ending form (ta
    marbuta, ?).

19
Further work
  • Repeat the entire procedure relying when possible
    on dictionaries containing diacritics
  • Refine the scoring procedure by assigning
    different weights to the different relations.
  • Include additional relations (e.g. path-based)
  • Use additional Knowledge Sources for weighting
    the relations
  • related entries already included in AWN
  • SUMO
  • Magnini's domain codes

20
  • Thank you for your attention
Write a Comment
User Comments (0)
About PowerShow.com