AUKBC Research Centre - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

AUKBC Research Centre

Description:

Anaphora, in discourse, is a device for making an abbreviated reference ... The Lappin and Leass(1994) anaphora resolution algorithm uses salience weight in ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 33

Provided by: bred

Category:

more less

Transcript and Presenter's Notes

Title: AUKBC Research Centre

1
AU-KBC Research Centre
Innovation Through Research

Madras Institute of Technology,
Anna University, Chennai
www.au-kbc.org

2
Pronominal Resolution in Tamil Using Machine
Learning techniques

Narayana Murthy, K Sobha, L Muthukumari, B

3
Pronominal Resolution in Tamil Using Machine
Learning techniques

What is Anaphora
What are the types of Anaphora
Why Anaphora Resolution is required
About Tamil Language
Pronominal Resolution in Tamil
The Salience Factor approach
The Machine Learning Approach

4
What is Anaphora?

Any entity that refers back is called an Anaphor
Anaphora, in discourse, is a device for making an
abbreviated reference (containing fewer bits of
disambiguating information, rather than being
lexically or phonetically shorter) to some entity
(or entities) in the expectation that the
receiver of the discourse will be able to
disabbreviate the reference and, thereby,
determine the identity of the entity. (Hirst
1981)
eg The Prime Minister is yet to arrive and he is
expected at the central hall at any time. The
Times of India, Feb 2001

5
Types of Anaphora

Possessive, Reflexive Pronoun, Demonstrative
Pronoun, Relative Pronoun
he, him, she, her, it, they, them- 3rd person
personal
his,her, hers, its their, theirs- possessive
himself,herself,itself,themselves- reflexive
this, that,these, those- demonstrative
who,whom,which,whose- relative
What is a pronominal anaphor
Vajpayee hits back forcefully when he told the
opposition today sometimes we fall prey to the
media and sometimes you do. Indian Express 2001

6
About Tamil Language

Tamil belongs to the South Dravidian family of
languages.
Verb final language and allows scrambling.
It has post-positions,
The genitive precedes the head noun in the
genitive phrase,
The complementizer follows the embedded clause.
Adjective, participial adjectives and free
relatives precede the head noun.
It is a nominative-accusative language like the
other Dravidian languages.

7
Tamil Contn

The subject of a Tamil sentence is mostly
nominative,
Certain verbs require dative subjects.
Possessive Noun phrases as subject.
png agreement between the subject and the verb

8
The Pronominals in Tamil
9
Examples

mohan avanutaya kulantayai kantan enRu
raman connar.
mohan he(poss) child(acc) see(pst) compl
raman say(pst).
(Raman said that mohan saw his child).
sita avalai atittal enRu kavita
connar.
sita she(acc) beat(pst) compl kavita
say(pst).
(Kavita said that Sita hit her)

10
Salience Factor Approach

Lappin and Leass (1994) Anaphora resolution
Algorithm
The Lappin and Leass(1994) anaphora resolution
algorithm uses salience weight in determining
the antecedent to the pronominals.
It requires as input a fully parsed sentence
structure and uses hierarchy in identifying the
subject, object etc.
This algorithm uses syntactic criteria to rule
out noun phrases that cannot possibly corefer
with it. The antecedent is then chosen according
to a ranking based on salience weights.

11
Salience Weights
12
How salience weight is calculated

The candidates which pass the png agreement are
ranked according to their salience. A salience
value is then calculated. Consider the sentence
Mary sawBill, the salience of Mary is
Mary Wsent Wsubj Whead Wnonadv
100 80 80 50
210
The candidate with the highest salience is
considered as the antecedent for the pronominal.

13
Salience Weights for Tamil

Deviated substantially from Lappin and Leass
Approach
No In-depth Parsing
Exploit the Rich Morphology of the Language
The analysis depends on the salience weight of
the candidate (NP) for the antecedent-hood of an
anaphor from a list of probable candidates.

14
Salience factor weights for Tamil
15
Definitions of the factors
16
How it works

The salience weight to an NP is assigned in the
following way
Identify the Pronoun
Consider Four sentences above the sentence
containing the Pronoun
All the NPs preceding the Pronoun, are
considered as possible candidate ( This is he
general rule)
Here we take some NPs which follow the the
Pronoun since Tamil ( All Indian languages) is a
relatively free word Order
Assign Salience Weights.
The NP which gets the maximum salience weight and
agrees in png with the anaphor is considered as
the antecedent to the anaphor.

17
The System
Salience Factor Assigner
POS tagger NP Chunker, Clause Identifier
Input
Output
18
The Pre Processors

oru aalaaka oru velaiyai ceyarpatuttuvataivita
palarum
one person(adv) one task(acc)
enforce(vbnaccpsp) many_people(inc)
kuuti atai niraiverruvatanaal payanum niramba
untu
gather it(acc) perform(vbnins) benefit(inc)
lot have.
Instead of enforcing a task as a single person,
performing it by many people together gives lot
of benefit.

19
Pre-processing Contn..
Morphological Analyser oru/Q aalaaka/CNADV
oru/Q velaiyai/CN3SNACC ceyarpatuttuvataivita/CN
3SNACCDRDPSP palarum/CN3PNINC kuuti
/CNVVBP atai/PR3SNACC niraiverruvatanaal/VCND
payanum/VPST niramba/ADV untu/V3SN POS
Tagger oru/Q aalaaka/ADV oru/Q
velaiyai/CNACC ceyarpatuttuvataivita/CN
palarum/CN kuuti/CN atai/PRACC
niraiverruvatanaal/CN payanum/VPST niramba/ADV
untu/V
20
Pre-processing contn
NP Chunker NP oruQ aalaakaCN3SNADV/NP
NP oruQ velaiyaiCN3SNACC
ltANTECEDE1,2gt/NPNP ceyarpatuttuvataivitaCN3S
NACCDRDPSP/NP NPpalarumCN3PNINC/NP
VPkuutiVVBP/VPNPataiPR3SNACCltANAPHOR1gt
/NPVPniraiverruvatanaalVCND/VPNPpayanum
CN3SNINC/NP VP nirambaADV/VP VP
untuV3SN/VP Clause Marking SCL1NPoru
QaalaakaCN3SNADV/NPNPoruQvelaiyaiCN3S
NACCltANTECEDE1,2gt/NPNPceyarpatuttuvataivita
CN3SNACCDRDPSP/NPNPpalarumCN3PNINC/N
PVPkuutiVVBP/VPNPataiPR3SNACCltANAPHO
R1gt/NPVPniraiverruvatanaalVCND/VP/SCL1
MCLNPpayanumCN3SNINC/NPVPnirambaADV/
VPVPuntuV3SN/VP PNC/MCL
21
The Result

The total number of sentences 900
Total Pronouns taken 233
Number of Possible NP candidates4391
Number of correct antecedent identified170
Number of wrong antecedent identified 63
The precision is 170/223x100 76.2

22
Machine Learning Approach

Multiple Linear Regression is used as a two-class
classifier.
The factors, called features here, are treated as
binary for example the candidate NP is in the
current sentence or not.
During Training, the weights of these features
are computed using regression the weights
minimize the sqaured error

23
MLR

Multiple Linear Regression
Regression analysis
A statistical technique for investigating and
modeling the relationship between variables in
a system.
Suitable for a small number of features
Suppose there are k features. Let xij
denote ith the observation of feature xj where i
1,2...n and j 1,2...k. Let yi be the ith
observed value of the decision variable.
yi ß0 ß1 xi1 ß2 xi2 ßk xik

24
MLR Continued

where ßj , j 1,2...k are regression
coefficients and ?i , i 1,2...n are called
error terms or residuals
In matrix notation,
Y ß X ?
Where y is an n1 vector of observations, X is
the np matrix of feature values where is p
k1, ß is the p1 matrix of parameters and ? is
the n1 vector of error terms

25
MLR Continued

We estimate the values of the parameters ß
using the least square method

26
Training

The training data consists of candidate Nps, the
binary feature values and the target, that is, if
it actually an antecedent or not.
The weights show the relative significance of the
various features given the training data
Stability of the weights can be checked by
training on different sets of data
Target value (y) is assigned symmetrical values
for the two cases

27
Testing

During testing, the target value is computed and
checked for sign to decide whether it is an
antecedent or not
Performance can be measured in terms of accuracy
By introducing a threshold for the unkown case,
we can trade Recall for Precision

28
Experiments and Results

Initially, it was observed that the training data
was highly imbalanced about 95 of the candidate
Nps were not antecedents. Blind guess can give us
90 plus overall accuracy although no antecedent
will be correctly recognized!
When trained and tested on 1800 data items,
overall accuracy was 81.98 but only 39.3 of the
antecedents were identified.

29
Experiments and Results

Whenever the NP is in the current clause, it is
also in the current sentence, although the
vice-versa is not true. Current sentence factor
was largely redundant in the training data and by
removing this feature, 55.7 of the antecedents
could be identified correctly

30
Experiments and Results

By training on a somewhat more balanced training
data, 143 out of 168, that is, 85.1 of the
antecedents could be correctly recognized on a
test data of 768 items.
There is scope for Precision-Recall Trade-off,
Sensitivity Analysis, Dimensionality Reduction,
etc.

31
Conclusions

The machine learning approach holds promise
Further experimentation with sensitivity analysis
of weights is planned
Dimensionality reduction by elimination of least
contributing features may actually boost
performance as noise will be reduced
The system can be used to develop larger tagged
data

32
Thank You

Write a Comment

User Comments (0)