Title: AUKBC Research Centre
1AU-KBC Research Centre
Innovation Through Research
- Madras Institute of Technology,
- Anna University, Chennai
- www.au-kbc.org
2Pronominal Resolution in Tamil Using Machine
Learning techniques
- Narayana Murthy, K Sobha, L Muthukumari, B
3Pronominal Resolution in Tamil Using Machine
Learning techniques
- What is Anaphora
- What are the types of Anaphora
- Why Anaphora Resolution is required
- About Tamil Language
- Pronominal Resolution in Tamil
- The Salience Factor approach
- The Machine Learning Approach
4What is Anaphora?
- Any entity that refers back is called an Anaphor
- Anaphora, in discourse, is a device for making an
abbreviated reference (containing fewer bits of
disambiguating information, rather than being
lexically or phonetically shorter) to some entity
(or entities) in the expectation that the
receiver of the discourse will be able to
disabbreviate the reference and, thereby,
determine the identity of the entity. (Hirst
1981) - eg The Prime Minister is yet to arrive and he is
expected at the central hall at any time. The
Times of India, Feb 2001
5Types of Anaphora
- Possessive, Reflexive Pronoun, Demonstrative
Pronoun, Relative Pronoun - he, him, she, her, it, they, them- 3rd person
personal - his,her, hers, its their, theirs- possessive
- himself,herself,itself,themselves- reflexive
- this, that,these, those- demonstrative
- who,whom,which,whose- relative
- What is a pronominal anaphor
- Vajpayee hits back forcefully when he told the
opposition today sometimes we fall prey to the
media and sometimes you do. Indian Express 2001
6About Tamil Language
- Tamil belongs to the South Dravidian family of
languages. - Verb final language and allows scrambling.
- It has post-positions,
- The genitive precedes the head noun in the
genitive phrase, - The complementizer follows the embedded clause.
- Adjective, participial adjectives and free
relatives precede the head noun. - It is a nominative-accusative language like the
other Dravidian languages.
7Tamil Contn
- The subject of a Tamil sentence is mostly
nominative, - Certain verbs require dative subjects.
- Possessive Noun phrases as subject.
- png agreement between the subject and the verb
8The Pronominals in Tamil
9Examples
- mohan avanutaya kulantayai kantan enRu
raman connar. - mohan he(poss) child(acc) see(pst) compl
raman say(pst). - (Raman said that mohan saw his child).
- sita avalai atittal enRu kavita
connar. - sita she(acc) beat(pst) compl kavita
say(pst). - (Kavita said that Sita hit her)
10Salience Factor Approach
- Lappin and Leass (1994) Anaphora resolution
Algorithm - The Lappin and Leass(1994) anaphora resolution
algorithm uses salience weight in determining
the antecedent to the pronominals. - It requires as input a fully parsed sentence
structure and uses hierarchy in identifying the
subject, object etc. - This algorithm uses syntactic criteria to rule
out noun phrases that cannot possibly corefer
with it. The antecedent is then chosen according
to a ranking based on salience weights.
11Salience Weights
12How salience weight is calculated
- The candidates which pass the png agreement are
ranked according to their salience. A salience
value is then calculated. Consider the sentence
Mary sawBill, the salience of Mary is -
- Mary Wsent Wsubj Whead Wnonadv
- 100 80 80 50
- 210
- The candidate with the highest salience is
considered as the antecedent for the pronominal.
13Salience Weights for Tamil
- Deviated substantially from Lappin and Leass
Approach - No In-depth Parsing
- Exploit the Rich Morphology of the Language
- The analysis depends on the salience weight of
the candidate (NP) for the antecedent-hood of an
anaphor from a list of probable candidates.
14Salience factor weights for Tamil
15Definitions of the factors
16How it works
- The salience weight to an NP is assigned in the
following way - Identify the Pronoun
- Consider Four sentences above the sentence
containing the Pronoun - All the NPs preceding the Pronoun, are
considered as possible candidate ( This is he
general rule) - Here we take some NPs which follow the the
Pronoun since Tamil ( All Indian languages) is a
relatively free word Order - Assign Salience Weights.
- The NP which gets the maximum salience weight and
agrees in png with the anaphor is considered as
the antecedent to the anaphor.
17The System
Salience Factor Assigner
POS tagger NP Chunker, Clause Identifier
Input
Output
18The Pre Processors
- oru aalaaka oru velaiyai ceyarpatuttuvataivita
palarum - one person(adv) one task(acc)
enforce(vbnaccpsp) many_people(inc) - kuuti atai niraiverruvatanaal payanum niramba
untu - gather it(acc) perform(vbnins) benefit(inc)
lot have. - Instead of enforcing a task as a single person,
performing it by many people together gives lot
of benefit.
19Pre-processing Contn..
Morphological Analyser oru/Q aalaaka/CNADV
oru/Q velaiyai/CN3SNACC ceyarpatuttuvataivita/CN
3SNACCDRDPSP palarum/CN3PNINC kuuti
/CNVVBP atai/PR3SNACC niraiverruvatanaal/VCND
payanum/VPST niramba/ADV untu/V3SN POS
Tagger oru/Q aalaaka/ADV oru/Q
velaiyai/CNACC ceyarpatuttuvataivita/CN
palarum/CN kuuti/CN atai/PRACC
niraiverruvatanaal/CN payanum/VPST niramba/ADV
untu/V
20Pre-processing contn
NP Chunker NP oruQ aalaakaCN3SNADV/NP
NP oruQ velaiyaiCN3SNACC
ltANTECEDE1,2gt/NPNP ceyarpatuttuvataivitaCN3S
NACCDRDPSP/NP NPpalarumCN3PNINC/NP
VPkuutiVVBP/VPNPataiPR3SNACCltANAPHOR1gt
/NPVPniraiverruvatanaalVCND/VPNPpayanum
CN3SNINC/NP VP nirambaADV/VP VP
untuV3SN/VP Clause Marking SCL1NPoru
QaalaakaCN3SNADV/NPNPoruQvelaiyaiCN3S
NACCltANTECEDE1,2gt/NPNPceyarpatuttuvataivita
CN3SNACCDRDPSP/NPNPpalarumCN3PNINC/N
PVPkuutiVVBP/VPNPataiPR3SNACCltANAPHO
R1gt/NPVPniraiverruvatanaalVCND/VP/SCL1
MCLNPpayanumCN3SNINC/NPVPnirambaADV/
VPVPuntuV3SN/VP PNC/MCL
21The Result
- The total number of sentences 900
- Total Pronouns taken 233
- Number of Possible NP candidates4391
- Number of correct antecedent identified170
- Number of wrong antecedent identified 63
- The precision is 170/223x100 76.2
22Machine Learning Approach
- Multiple Linear Regression is used as a two-class
classifier. - The factors, called features here, are treated as
binary for example the candidate NP is in the
current sentence or not. - During Training, the weights of these features
are computed using regression the weights
minimize the sqaured error
23MLR
- Multiple Linear Regression
- Regression analysis
- A statistical technique for investigating and
modeling the relationship between variables in
a system. - Suitable for a small number of features
- Suppose there are k features. Let xij
denote ith the observation of feature xj where i
1,2...n and j 1,2...k. Let yi be the ith
observed value of the decision variable. - yi ß0 ß1 xi1 ß2 xi2 ßk xik
24MLR Continued
- where ßj , j 1,2...k are regression
coefficients and ?i , i 1,2...n are called
error terms or residuals - In matrix notation,
- Y ß X ?
- Where y is an n1 vector of observations, X is
the np matrix of feature values where is p
k1, ß is the p1 matrix of parameters and ? is
the n1 vector of error terms
25MLR Continued
- We estimate the values of the parameters ß
using the least square method
26Training
- The training data consists of candidate Nps, the
binary feature values and the target, that is, if
it actually an antecedent or not. - The weights show the relative significance of the
various features given the training data - Stability of the weights can be checked by
training on different sets of data - Target value (y) is assigned symmetrical values
for the two cases
27Testing
- During testing, the target value is computed and
checked for sign to decide whether it is an
antecedent or not - Performance can be measured in terms of accuracy
- By introducing a threshold for the unkown case,
we can trade Recall for Precision
28Experiments and Results
- Initially, it was observed that the training data
was highly imbalanced about 95 of the candidate
Nps were not antecedents. Blind guess can give us
90 plus overall accuracy although no antecedent
will be correctly recognized! - When trained and tested on 1800 data items,
overall accuracy was 81.98 but only 39.3 of the
antecedents were identified.
29Experiments and Results
- Whenever the NP is in the current clause, it is
also in the current sentence, although the
vice-versa is not true. Current sentence factor
was largely redundant in the training data and by
removing this feature, 55.7 of the antecedents
could be identified correctly
30Experiments and Results
- By training on a somewhat more balanced training
data, 143 out of 168, that is, 85.1 of the
antecedents could be correctly recognized on a
test data of 768 items. - There is scope for Precision-Recall Trade-off,
Sensitivity Analysis, Dimensionality Reduction,
etc.
31Conclusions
- The machine learning approach holds promise
- Further experimentation with sensitivity analysis
of weights is planned - Dimensionality reduction by elimination of least
contributing features may actually boost
performance as noise will be reduced - The system can be used to develop larger tagged
data
32Thank You