AUKBC Research Centre - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

AUKBC Research Centre

Description:

Anaphora, in discourse, is a device for making an abbreviated reference ... The Lappin and Leass(1994) anaphora resolution algorithm uses salience weight in ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 33
Provided by: bred
Category:
Tags: aukbc | centre | research

less

Transcript and Presenter's Notes

Title: AUKBC Research Centre


1
AU-KBC Research Centre
Innovation Through Research
  • Madras Institute of Technology,
  • Anna University, Chennai
  • www.au-kbc.org

2
Pronominal Resolution in Tamil Using Machine
Learning techniques
  • Narayana Murthy, K Sobha, L Muthukumari, B

3
Pronominal Resolution in Tamil Using Machine
Learning techniques
  • What is Anaphora
  • What are the types of Anaphora
  • Why Anaphora Resolution is required
  • About Tamil Language
  • Pronominal Resolution in Tamil
  • The Salience Factor approach
  • The Machine Learning Approach

4
What is Anaphora?
  • Any entity that refers back is called an Anaphor
  • Anaphora, in discourse, is a device for making an
    abbreviated reference (containing fewer bits of
    disambiguating information, rather than being
    lexically or phonetically shorter) to some entity
    (or entities) in the expectation that the
    receiver of the discourse will be able to
    disabbreviate the reference and, thereby,
    determine the identity of the entity. (Hirst
    1981)
  • eg The Prime Minister is yet to arrive and he is
    expected at the central hall at any time. The
    Times of India, Feb 2001

5
Types of Anaphora
  • Possessive, Reflexive Pronoun, Demonstrative
    Pronoun, Relative Pronoun
  • he, him, she, her, it, they, them- 3rd person
    personal
  • his,her, hers, its their, theirs- possessive
  • himself,herself,itself,themselves- reflexive
  • this, that,these, those- demonstrative
  • who,whom,which,whose- relative
  • What is a pronominal anaphor
  • Vajpayee hits back forcefully when he told the
    opposition today sometimes we fall prey to the
    media and sometimes you do. Indian Express 2001

6
About Tamil Language
  • Tamil belongs to the South Dravidian family of
    languages.
  • Verb final language and allows scrambling.
  • It has post-positions,
  • The genitive precedes the head noun in the
    genitive phrase,
  • The complementizer follows the embedded clause.
  • Adjective, participial adjectives and free
    relatives precede the head noun.
  • It is a nominative-accusative language like the
    other Dravidian languages.

7
Tamil Contn
  • The subject of a Tamil sentence is mostly
    nominative,
  • Certain verbs require dative subjects.
  • Possessive Noun phrases as subject.
  • png agreement between the subject and the verb  

   
8
The Pronominals in Tamil
9
Examples
  • mohan avanutaya kulantayai kantan enRu
    raman connar.
  • mohan he(poss) child(acc) see(pst) compl
    raman say(pst).
  • (Raman said that mohan saw his child).
  • sita avalai atittal enRu kavita
    connar.
  • sita she(acc) beat(pst) compl kavita
    say(pst).
  • (Kavita said that Sita hit her)

10
Salience Factor Approach
  • Lappin and Leass (1994) Anaphora resolution
    Algorithm
  • The Lappin and Leass(1994) anaphora resolution
    algorithm uses salience weight in determining
    the antecedent to the pronominals.
  • It requires as input a fully parsed sentence
    structure and uses hierarchy in identifying the
    subject, object etc.
  • This algorithm uses syntactic criteria to rule
    out noun phrases that cannot possibly corefer
    with it. The antecedent is then chosen according
    to a ranking based on salience weights.

11
Salience Weights
12
How salience weight is calculated
  • The candidates which pass the png agreement are
    ranked according to their salience. A salience
    value is then calculated. Consider the sentence
    Mary sawBill, the salience of Mary is
  •  
  • Mary Wsent Wsubj Whead Wnonadv
  • 100 80 80 50
  • 210
  • The candidate with the highest salience is
    considered as the antecedent for the pronominal.


13
Salience Weights for Tamil
  • Deviated substantially from Lappin and Leass
    Approach
  • No In-depth Parsing
  • Exploit the Rich Morphology of the Language
  • The analysis depends on the salience weight of
    the candidate (NP) for the antecedent-hood of an
    anaphor from a list of probable candidates.

14
Salience factor weights for Tamil
15
Definitions of the factors
16
How it works
  • The salience weight to an NP is assigned in the
    following way
  • Identify the Pronoun
  • Consider Four sentences above the sentence
    containing the Pronoun
  • All the NPs preceding the Pronoun, are
    considered as possible candidate ( This is he
    general rule)
  • Here we take some NPs which follow the the
    Pronoun since Tamil ( All Indian languages) is a
    relatively free word Order
  • Assign Salience Weights.
  • The NP which gets the maximum salience weight and
    agrees in png with the anaphor is considered as
    the antecedent to the anaphor.

17
The System
Salience Factor Assigner
POS tagger NP Chunker, Clause Identifier
Input
Output
18
The Pre Processors
  • oru aalaaka oru velaiyai ceyarpatuttuvataivita
    palarum
  • one person(adv) one task(acc)
    enforce(vbnaccpsp) many_people(inc)
  • kuuti atai niraiverruvatanaal payanum niramba
    untu
  • gather it(acc) perform(vbnins) benefit(inc)
    lot have.
  • Instead of enforcing a task as a single person,
    performing it by many people together gives lot
    of benefit.

19
Pre-processing Contn..
Morphological Analyser oru/Q aalaaka/CNADV
oru/Q velaiyai/CN3SNACC ceyarpatuttuvataivita/CN
3SNACCDRDPSP palarum/CN3PNINC kuuti
/CNVVBP atai/PR3SNACC niraiverruvatanaal/VCND
payanum/VPST niramba/ADV untu/V3SN POS
Tagger   oru/Q aalaaka/ADV oru/Q
velaiyai/CNACC ceyarpatuttuvataivita/CN
palarum/CN kuuti/CN atai/PRACC
niraiverruvatanaal/CN payanum/VPST niramba/ADV
untu/V
20
Pre-processing contn
NP Chunker  NP oruQ aalaakaCN3SNADV/NP
NP oruQ velaiyaiCN3SNACC
ltANTECEDE1,2gt/NPNP ceyarpatuttuvataivitaCN3S
NACCDRDPSP/NP NPpalarumCN3PNINC/NP
VPkuutiVVBP/VPNPataiPR3SNACCltANAPHOR1gt
/NPVPniraiverruvatanaalVCND/VPNPpayanum
CN3SNINC/NP VP nirambaADV/VP VP
untuV3SN/VP Clause Marking  SCL1NPoru
QaalaakaCN3SNADV/NPNPoruQvelaiyaiCN3S
NACCltANTECEDE1,2gt/NPNPceyarpatuttuvataivita
CN3SNACCDRDPSP/NPNPpalarumCN3PNINC/N
PVPkuutiVVBP/VPNPataiPR3SNACCltANAPHO
R1gt/NPVPniraiverruvatanaalVCND/VP/SCL1
MCLNPpayanumCN3SNINC/NPVPnirambaADV/
VPVPuntuV3SN/VP PNC/MCL
21
The Result
  • The total number of sentences 900
  • Total Pronouns taken 233
  • Number of Possible NP candidates4391
  • Number of correct antecedent identified170
  • Number of wrong antecedent identified 63
  • The precision is 170/223x100 76.2

22
Machine Learning Approach
  • Multiple Linear Regression is used as a two-class
    classifier.
  • The factors, called features here, are treated as
    binary for example the candidate NP is in the
    current sentence or not.
  • During Training, the weights of these features
    are computed using regression the weights
    minimize the sqaured error

23
MLR
  • Multiple Linear Regression
  • Regression analysis
  • A statistical technique for investigating and
    modeling the relationship between variables in
    a system.
  • Suitable for a small number of features
  • Suppose there are k features. Let xij
    denote ith the observation of feature xj where i
    1,2...n and j 1,2...k. Let yi be the ith
    observed value of the decision variable.
  • yi ß0 ß1 xi1 ß2 xi2 ßk xik

24
MLR Continued
  • where ßj , j 1,2...k are regression
    coefficients and ?i , i 1,2...n are called
    error terms or residuals
  • In matrix notation,
  • Y ß X ?
  • Where y is an n1 vector of observations, X is
    the np matrix of feature values where is p
    k1, ß is the p1 matrix of parameters and ? is
    the n1 vector of error terms

25
MLR Continued
  • We estimate the values of the parameters ß
    using the least square method

26
Training
  • The training data consists of candidate Nps, the
    binary feature values and the target, that is, if
    it actually an antecedent or not.
  • The weights show the relative significance of the
    various features given the training data
  • Stability of the weights can be checked by
    training on different sets of data
  • Target value (y) is assigned symmetrical values
    for the two cases

27
Testing
  • During testing, the target value is computed and
    checked for sign to decide whether it is an
    antecedent or not
  • Performance can be measured in terms of accuracy
  • By introducing a threshold for the unkown case,
    we can trade Recall for Precision

28
Experiments and Results
  • Initially, it was observed that the training data
    was highly imbalanced about 95 of the candidate
    Nps were not antecedents. Blind guess can give us
    90 plus overall accuracy although no antecedent
    will be correctly recognized!
  • When trained and tested on 1800 data items,
    overall accuracy was 81.98 but only 39.3 of the
    antecedents were identified.

29
Experiments and Results
  • Whenever the NP is in the current clause, it is
    also in the current sentence, although the
    vice-versa is not true. Current sentence factor
    was largely redundant in the training data and by
    removing this feature, 55.7 of the antecedents
    could be identified correctly

30
Experiments and Results
  • By training on a somewhat more balanced training
    data, 143 out of 168, that is, 85.1 of the
    antecedents could be correctly recognized on a
    test data of 768 items.
  • There is scope for Precision-Recall Trade-off,
    Sensitivity Analysis, Dimensionality Reduction,
    etc.

31
Conclusions
  • The machine learning approach holds promise
  • Further experimentation with sensitivity analysis
    of weights is planned
  • Dimensionality reduction by elimination of least
    contributing features may actually boost
    performance as noise will be reduced
  • The system can be used to develop larger tagged
    data

32
Thank You
Write a Comment
User Comments (0)
About PowerShow.com