Diapositive 1 - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Diapositive 1

Description:

A multi-word term extraction program for Arabic language Siham Boulaknadel, B atrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 28
Provided by: siham
Category:

less

Transcript and Presenter's Notes

Title: Diapositive 1


1
A multi-word term extraction program for Arabic
language
LREC 28-30 May 2008 Marrakech
Siham Boulaknadel, Béatrice Daille and Driss
Aboutajdine LINA University of Nantes GSCM_LRIT Un
iversity of Rabat
2
Outline
  • Multi-word term
  • Motivation
  • Approach
  • Comparing statistical methods
  • Conclusion and future work

3
Terms
  • Refer to a defined concept ... (ISO 704).
  • Represent a limited number of part of speech
    nouns, verbs, adjectives, and adverbs.
  • Given subject domain

4
Multi-word terms
  • ????? ?????? ?????????? ????? ????? ??????
    ???????? ???? ??? ?? ????? ??????? ???????
    wikipidea
  • Nitrogen oxides consists of all combustion
    processes taking place at high temperature
  • MWTs extracted
  • ?????? ??????????
  • ????? ??????? ???????
  • ?????? ????????

5
Motivation
  • Frequent MWTs
  • Application
  • for building index from unstructured documents
  • for enhancing document retrieval system

6
MWT extraction system Concept extraction
Corpus
Identification of Term Candidates - linguistic
filtering (shallow parsing)
Filtering of Term Candidates - statistical
significance (LLR, FLR, MI3,T-score)
Candidate list
7
MWT evaluation
  • unithood measure the strengh of association of
    the constituents of MWU
  • United nations environment domain
  • Unithood
  • termhood measure relatedness to existing domain
    specific concepts.
  • Soil degradation environment domain
  • Termhood Unithood

8
MWT patterns
Pattern Sub-pattern Arabic MWT English translation
N ADJ ?????? ????????? Chemical pollution
N1 N2 ???? ????? Water pollution
N1 PREP N2 N1 ? N2 ?????? ??????? Pollution with lead
N1 PREP N2 N1 ? N2 ?????? ??????? Exposure to diseases
N1 PREP N2 N1 ?? N2 ?????? ?? ???????? Waste disposal
9
MWT variations
  • Multiple forms for the same concept
  • Variations types
  • Inflexional morphology
  • Number
  • N1 N2 / N1 N2 suffix(??, ??)
  • ???? ?????? ocean pollution 
  • ???? ???????? oceans pollution 
  • Definite form
  • N Adj / Prefix(??) N prefix(??) Adj
  • ???? ??????? chemical polution
  • ?????? ????????? the chemical pollution
  • Derivational morphosyntactic phenomena
  • N1 ADJ /N1 PREP N2
  • ??? ???? gt ??? ?? ????? oil well
  • Syntactically (modification postposition)
  • N1 N2 / N1 N2 ADJ
  • ???? ??????? degree of temperature
  • ???? ??????? ??????? high degree of temperature

10
Comparing statistical filtering
  • Mutual Information (MI3) (Daille, 1994) as
    baseline
  • Loglikelihood (Dunning, 1994)
  • t-Score (Church, 1991)
  • FLR (Nakagawa and Mori, 2003)

11
Experiment Data
  • Arabic specific domain corpus on environment
  • Compiled from the web Al-Khat Alakhdar Akhbar
    Albiae from 2004-2006
  • 475,148 words
  • Motivation
  • The no-availability of Arabic specific domain
    corpora

12
Gold standard
  • Reference list
  • Arabic environment terminology Agrovoc
  • Total 65,000 unique known terms ( single and
    MWT)
  • Dynamic search
  • Eurodicautom

13
Preprocessing
  • Moving diacritics
  • Buckwalters transliteration
  • Diabs parsing (Diab, 2004)
  • Input
  • wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA' SHyHp
    Avr Erqlp dAxl AlmnTqp mn qbl AlysAndrw.
  • Output
  • w/CC lm/RP yHtsb/VBP Al/DT Hkm/NN Al/DT mjry/JJ
    sAndwr/NNP bwl/NNP rklp/NN jzA'/NN SHyHp/JJ
    Avr/IN Erqlp/NN dAxl/IN Al/DT mnTqp/NN mn/IN
    qbl/NN Al/DT ysAndrw/NNP ./PUNC

14
Evaluation and results
Precision
LLR 85
FLR 60
IM3 26
T-score 57
  • For each association score
  • Examine the first candidates term
  • Compute precision (termhood) for 100 candidates
    term
  • Precision (termhood) is quotient of attested MWT
    and all extracted sequences.
  • the loglikelihood is the best measure

LLR list Eng. transl
???? ??????? ?????? ??????? ?????? ??????? ???? ??????? ?????? ??????? ?????? ?????? Acidity degree Dioxide Waste water Sewage Climate change Nervous system
Agrovoc Eurodicautom Acidity degree Dioxide Waste water Sewage Climate change Nervous system
15
Summary future work
  • Develop MWT extraction for Arabic
  • Define MWT patterns and variations
  • Obtain best results than european languages
  • Improvement of system
  • Adding new variation
  • Improve lemmatisation

16
Introduction
  • MWTs are sufficiently informative to help human
    readers get a feel of the essential topics
  • Use in many text related applications
  • Text clustering
  • Document similarity
  • Document summarization

17
Related Work
  • Linguistic Approach
  • Based on linguistic pre-processing and
    annotations (result of taggers, shallow parsers)
  • Detect recurrent syntactic term formation
    patterns
  • Noun Noun
  • (Adj Noun) Noun

18
Systems based on linguistics
  • Ananiadou, S. (1994) recognises single-word terms
    from domain of Immunology based on morphological
    analysis of term formation patterns (internal
    term make up)
  • Justeson Katz (1995, TERMS) extract complex
    terms based on two characteristics (which
    distinguishes them from non terms)
  • the syntactic patterns are restricted
  • terms appear with the same form throughout the
    text, omissions of modifiers are avoided

19
Systems based on linguistics
  • The text is tagged a filter is applied to
    extract terms
  • ((AN) ((AN) (N P)?) (AN))
    N
  • AN / NA / AAN / ANN / NAN / NNN / NPN
  • Filtering based on simple POS pattern
  • A pattern must occur above a certain threshold to
    be considered a valid term pattern.
  • Recall 71 Precision 71 -- 96
  • LEXTER (Bourigault, 1994)
  • Extracts French compound terms based on surface
    syntactic analysis and text heuristics
  • Terms are identified according to certain
    syntactic patterns

20
  • Uses a boundary method to identify the extent of
    terms
  • categories or sequences of categories that are
    not found in term patterns form the boundaries
    e.g. verbs, any preposition (except de and à)
    followed by a determiner. Non productive
    sequences become boundaries.
  • Precision 95 although tests have shown that
    lots of noise is generated

21
Approaches using statistical information
  • Main measures used
  • Frequency of occurrence
  • Mutual Information
  • C/NC value
  • Experiments also with loglike coefficient
    Dunning, 1993

22
Frequency of occurrence
  • Simplest and most popular method for Domain
    independent, requires no external resources
  • Some filtering is used in form of syntactic
    patterns
  • Systems using frequency of occurrence
  • Dagan Church (TERMIGHT, 1994)
  • Enguehard Pantera (1994)
  • Lauriston (TERMINO, 1996)

23
Mutual Information
  • The amount of information provided by the
    occurrence of the event represented by yi about
    the occurrence of the event represented by xk is
    defined as
  • I(xk,yi) ? log P(xk,yi) / P(xk)
    P(,yi)
    Fano (196127-28)
  • This measure is about how much a word tells us
    about the other.
  • Problems for MI come from data sparseness
  • Damerau (1993) and Daille (1994) used MI for the
    extraction of candidate terms (only for two-word
    candidate terms)

24
C/NC value (Frantzi Ananiadou)
  • C/value
  • total frequency of occurrence of string in
    corpus
  • frequency of string as part of longer candidate
    terms
  • number of these longer candidate terms
  • length of string (in number of words)

25
NC value
  • NC-value(a) 0.8 C-value(a) 0.2 CF(a)
  • a is the candidate term,
  • C-value(a) is the C-value for the candidate term
    a,
  • CF(a) is the context factor for the candidate
    term a
  • we obtain the CF by summing up the weights for
    its term context words, multiplied by their
    frequency appearing with this candidate term.

26
Hybrid approaches
  • Combination of linguistic information (filters),
    shallow parsing results and statistical measures
  • Daille, B., Frantzi Ananiadou

27
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com