Title: Arabic WordNet: Semi-automatic Extensions using Bayesian Inference
1Arabic WordNet Semi-automatic Extensions using
Bayesian Inference
- H. Rodríguez1, D. Farwell1, J. Farreres1, M.
Bertran1, M. Alkhalifa2, M.A. Martí2
1 Talp Research Center, UPC, Barcelona, Spain 2
UB, Barcelona, Spain
2Index of the talk
- The AWN project
- Semi-automatic Extensions of AWN
- Intuitive basis
- Previous work using heuristics
- Using Bayesian Networks
- Empirical evaluation
- Conclusions
3The AWN project
- USA REFLEX program funded (2005-2007)
- Partners
- Universities
- Princeton, Manchester, UPC, UB
- Companies
- Articulate Software, Irion
- Description
- Black et al, 2006
- Elkateb et al, 2006
- Rodríguez et al, 2008
4The AWN project
- Objectives
- 10,000 synsets including some amount of domain
specific data - linked to PWN 2.0
- finally to PWN 3.0
- linked to SUMO
- 1,000 NE
- manually built (or revised)
- vowelized entries
- including root of each entry
5The AWN project
Arabic synsets 11270
Arabic words 23496
pos DB content
adj 661
nouns 7961
adv 110
verbs 2538
Named entities
Synsets that are named entities 1142
Synsets that are not named entities 10028
Words in synsets that are named entities 1656
6Semi-automatic Extensions of AWN
- Intuitive basis
- In Arabic (and other Semitic Languages) many
words having a common root (i.e. a sequence of
typically three consonants) have related meanings
and can be derived from a base verbal form by
means of a reduced set of lexical rules
7Semi-automatic Extensions of AWN
8Semi-automatic Extensions of AWN
- Lexical rules
- regular verbal derivative forms
- regular nominal and adjectival derivative forms
- masdar (nominal verb)
- masculine and feminine active and passive
participles - inflected verbal forms
9Semi-automatic Extensions of AWN
- Procedure for generating a set of likely ltArabic
word, English synset, scoregt - produce an initial list of candidate word forms
- filter out the less likely candidates from this
list - generate an initial list of attachments
- score the reliability of these candidates
- manually review the best scored candidates and
include the valid associations in AWN.
10Semi-automatic Extensions of AWN
- Resources
- PWN
- AWN
- LOGOS database of conjugated Arabic verbs
- NMSU bilingual Arabic-English lexicon
- Arabic Gigaword Corpus
- UN (2000-2002) bilingual Arabic-English Corpus
11Semi-automatic Extensions of AWN
- Score the reliability of the candidates
- build a graph representing the words, synsets and
their associations - associations synset-synset
- explicit in WN2.0
- path-based
- apply a set of heuristic rules that use directly
the structure of the graph - GWC 2008
- apply Bayesian inference
- LREC 2008
12Using Bayesian Inference
13Using Bayesian Inference
14Using Bayesian Inference
- Building the CPT for each node in the BN
- edges EW ? AW
- probabilities from statistical translation models
built from the UN corpus using GIZA (word-word
probabilities) filtered to avoid pairs having
Arabic expressions with invalid Buckwalter
encodings. - all the mass probability is distributed between
pairs occurring in the BN - other edges (EW ? S, S ? S)
- linear distribution on priors
- noisy or model
15Using Bayesian Inference
- Performing Bayesian Inference in the BN
- Assign probability 1 to nodes in layer 1
- Infer the probabilities of nodes in layer 3
- Select for each word in layer 1 select as
candidates the synsets in layer 3 connected to it
and with probability over a threshold - Score the candidate pair with this probability
- Select the candidates scored over a threshold
16Empirical Evaluation
- 10 verbs randomly selected from AWN ???
17Empirical Evaluation
18Conclusions
- the BN approach doubles the number of candidates
of the previous HEU approach (554 vs 272). - The sample is clearly insufficient.
- The overlaping of Heu BN seems to improve the
results - An analysis of the errors shows a substantial
number were due to the lack of the shadda
diacritic or the feminine ending form (ta
marbuta, ?).
19Further work
- Repeat the entire procedure relying when possible
on dictionaries containing diacritics - Refine the scoring procedure by assigning
different weights to the different relations. - Include additional relations (e.g. path-based)
- Use additional Knowledge Sources for weighting
the relations - related entries already included in AWN
- SUMO
- Magnini's domain codes
20- Thank you for your attention