Arabic WordNet: Semi-automatic Extensions using Bayesian Inference

About This Presentation

Title:

Arabic WordNet: Semi-automatic Extensions using Bayesian Inference

Description:

... due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ?) ... relying when possible on dictionaries containing diacritics ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 21

Provided by: davi967

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: Arabic WordNet: Semi-automatic Extensions using Bayesian Inference

1
Arabic WordNet Semi-automatic Extensions using
Bayesian Inference

H. Rodríguez1, D. Farwell1, J. Farreres1, M.
Bertran1, M. Alkhalifa2, M.A. Martí2

1 Talp Research Center, UPC, Barcelona, Spain 2
UB, Barcelona, Spain
2
Index of the talk

The AWN project
Semi-automatic Extensions of AWN
Intuitive basis
Previous work using heuristics
Using Bayesian Networks
Empirical evaluation
Conclusions

3
The AWN project

USA REFLEX program funded (2005-2007)
Partners
Universities
Princeton, Manchester, UPC, UB
Companies
Articulate Software, Irion
Description
Black et al, 2006
Elkateb et al, 2006
Rodríguez et al, 2008

4
The AWN project

Objectives
10,000 synsets including some amount of domain
specific data
linked to PWN 2.0
finally to PWN 3.0
linked to SUMO
1,000 NE
manually built (or revised)
vowelized entries
including root of each entry

5
The AWN project

Current figures

Arabic synsets 11270
Arabic words 23496
pos DB content
adj 661
nouns 7961
adv 110
verbs 2538
Named entities
Synsets that are named entities 1142
Synsets that are not named entities 10028
Words in synsets that are named entities 1656
6
Semi-automatic Extensions of AWN

Intuitive basis
In Arabic (and other Semitic Languages) many
words having a common root (i.e. a sequence of
typically three consonants) have related meanings
and can be derived from a base verbal form by
means of a reduced set of lexical rules

7
Semi-automatic Extensions of AWN
8
Semi-automatic Extensions of AWN

Lexical rules
regular verbal derivative forms
regular nominal and adjectival derivative forms
masdar (nominal verb)
masculine and feminine active and passive
participles
inflected verbal forms

9
Semi-automatic Extensions of AWN

Procedure for generating a set of likely ltArabic
word, English synset, scoregt
produce an initial list of candidate word forms
filter out the less likely candidates from this
list
generate an initial list of attachments
score the reliability of these candidates
manually review the best scored candidates and
include the valid associations in AWN.

10
Semi-automatic Extensions of AWN

Resources
PWN
AWN
LOGOS database of conjugated Arabic verbs
NMSU bilingual Arabic-English lexicon
Arabic Gigaword Corpus
UN (2000-2002) bilingual Arabic-English Corpus

11
Semi-automatic Extensions of AWN

Score the reliability of the candidates
build a graph representing the words, synsets and
their associations
associations synset-synset
explicit in WN2.0
path-based
apply a set of heuristic rules that use directly
the structure of the graph
GWC 2008
apply Bayesian inference
LREC 2008

12
Using Bayesian Inference
13
Using Bayesian Inference
14
Using Bayesian Inference

Building the CPT for each node in the BN
edges EW ? AW
probabilities from statistical translation models
built from the UN corpus using GIZA (word-word
probabilities) filtered to avoid pairs having
Arabic expressions with invalid Buckwalter
encodings.
all the mass probability is distributed between
pairs occurring in the BN
other edges (EW ? S, S ? S)
linear distribution on priors
noisy or model

15
Using Bayesian Inference

Performing Bayesian Inference in the BN
Assign probability 1 to nodes in layer 1
Infer the probabilities of nodes in layer 3
Select for each word in layer 1 select as
candidates the synsets in layer 3 connected to it
and with probability over a threshold
Score the candidate pair with this probability
Select the candidates scored over a threshold

16
Empirical Evaluation

10 verbs randomly selected from AWN ???

17
Empirical Evaluation

Results

18
Conclusions

the BN approach doubles the number of candidates
of the previous HEU approach (554 vs 272).
The sample is clearly insufficient.
The overlaping of Heu BN seems to improve the
results
An analysis of the errors shows a substantial
number were due to the lack of the shadda
diacritic or the feminine ending form (ta
marbuta, ?).

19
Further work

Repeat the entire procedure relying when possible
on dictionaries containing diacritics
Refine the scoring procedure by assigning
different weights to the different relations.
Include additional relations (e.g. path-based)
Use additional Knowledge Sources for weighting
the relations
related entries already included in AWN
SUMO
Magnini's domain codes