A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities

About This Presentation

Title:

A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities

Description:

Pawe Mazur (University of Technology, Wroc aw, Poland) Pawel.Mazur_at_pwr.wroc.pl and Robert Dale (Macquarie University, Sydney, Australia) Robert.Dale_at_mq.edu.au – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 20

Provided by: Robert2572

Category:

more less

Transcript and Presenter's Notes

Title: A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities

1
A Supervised Machine Learning Approach to
Conjunction Disambiguation in Named Entities

Pawel Mazur(University of Technology, Wroclaw,
Poland)Pawel.Mazur_at_pwr.wroc.pl
and
Robert Dale(Macquarie University, Sydney,
Australia)Robert.Dale_at_mq.edu.au

2
Agenda

Conjunction in Named Entities
Our approach
Experiments
Results of the experiment
Results analysis
Conclusions
Further work

3
Conjunction in Candidate Named Entity String

Fujitsu Australia and New Zealand
Australia and New Zealand Banking Group Limited
Peter Smith and Ann Arbor Software Council
Candidate named entity string
Any sequence of words starting with initial
capitals
Single instance of the word and or form of
conjunction
45 documents out of 13460 5.7 of candidate
named entity strings contained conjunction in
some documents the frequency is as high as 23
in MUC-7 it is 4.5
A lot of candidate named entity strings in this
domain contain company names and person names

4
Our Approach - A Classification Task

We distinguish 4 categories of a conjunction in a
candidate NE string
Category A Name Internal ConjunctionCopper
Mines and Metals LimitedHerbert P Cooper Son,
Ernst and Young
Category B Name External ConjunctionProxy Form
and Explanatory MemorandumHardware Operating
SystemsEchoStar and News Corporation
Category C Right-Copy SeparatorWilliam and Alma
Ford, Connel and Bent Streets,Eastern and
Western Australia
Category D Left-Copy SeparatorHospital
Equipment SystemsJ H Blair Company Secretary
Corporate Counsel

The most common
Could be seen as one linguistic category
5
Our Approach - Candidate NE String Pattern

String Australia and New Zealand Banking
Group LimitedPattern (Loc and Loc Org
CompDesig)
String Peter Smith and Ann Arbor Software
Council Pattern
(GivenName FamilyName and GivenName FamilyName
Noun Org)
Patterns are created using gazetteers and simple
keyword-based heuristics.

6
Tag Set
PersDesig Mr, Mrs, Ms, Miss, Dr, Prof, Sir,
Madam, Messrs, and Jnr.CompDesig Ltd, Limited,
Pty Ltd, GmbH, plc and many more and Investments
Pty Ltd, Management Pty Ltd, Corporate Pty Ltd,
Associates Pty Ltd, Family Trust, Co Limited,
Partners, Partners Limited, Capital Limited, and
Capital Pty Ltd. CompPos Director, Secretary,
Manager, Counsel, Managing Director, Member,
Chairman, Chief Executive, Chief Executive
Officer, and CEO, and also some bodies within
organizations, such as Board and Committee.

InitCapped 925 42.24Loc 245 11.19Org 175
7.99FamilyName 164 7.49CompDesig 138
6.30Initial 108 4.93CompPos 99
4.52GivenName 89 4.06Of 76
3.47Abbrev 73 3.33PersDesig 39
1.78Det 31 1.42Dir 12
0.55Son 7 0.32Month 6
0.27AlphaNum 3 0.14

7
Data Encoding

Each instance is encoded with 33 attributes
1 binary attribute for each tag for each conjunct
signaling its presence in the
string (2x1632 attributes in total)
1 binary attribute ConjType encoding the lexical
form of the conjunction in the string (0 for , 1
for and)

8
Corpus Data Sets

Corpus 13460 text documents from 8 to 1000
lines long
Our corpus is a subcorpus drawn from a collection
of company announcements from the Australian
Stock Exchange
Selection of candidate named entity
stringssequence of initcapped words and a
single conjunction (and or ),also optional of,
a, an, the
We got a set of 10925 strings, 6437 of which are
unique
Hand elimination of wrongly identified strings
due to typographic features of the documents
(tables)
Random selection of 600 examples from the unique
set

Name Internal Name External Right-Copy Left-Copy Sum
18530.8 35058.3 396.5 264.3 600100
9
Machine-learned Classifiers

Naïve Bayes
Multilayered Perceptron
IBk
K
Random Tree
Logistic Model Trees (LMT)
J4.8
SMO
Implementations in WEKA (Waikato Environment for
Knowledge Analysis), University of Waikato in New
Zealand

10
Baseline

Determined with the 0-R algorithm always assigns
the most common category (Name External) 58.33
Better baseline is given by 1-R algorithmIF
ConjForm THEN PredCat?InternalIF ConjFormand
THEN PredCat?Externalbaseline 69.83

11
Results
Algorithm Correctly classified
IBk 504 (84.00)
Random Tree 503 (83.83)
K 501 (83.50)
SMO - quadratic kernel 494 (82.33)
Mult. Perceptron 493 (82.17)
LMT 487 (81.17)
J4.8 477 (79.50)
SMO - linear kernel 468 (78.00)
Naïve Bayes 424 (70.67)
Baseline 419 (69.83)
12
Accuracy by Conjunction Category
Category Precision Recall F-Measure
Name Internal 0.814 0.876 0.844
Name External 0.872 0.897 0.885
Right-Copy 0.615 0.410 0.492
Left-Copy 0.800 0.462 0.585

weighted mean 0.834 0.840 0.833
13
Confusion Matrix
Name Internal Name External Right-Copy Left-Copy Classified as
162 28 6 3 Name Internal
18 314 17 11 Name External
4 6 16 0 Right-Copy
1 2 0 12 Left-Copy
14
Results Analysis Conjunction Cat. Indicators

For Name External conjunction
Month X
X Month
CompDesig X
X PersDesig
X GivenName
X Dir
X Deter
Abbrev X
X Abbrev
For Name Internal conjunction
- X Son(note Sons of Gwalia Ltd and Gwalia
Consolidated Ltd)

15
Error Analysis InitCapped

38 of all 96 missclassified examples are
InitCapped tag based only (40)
In these cases the classification ended up being
determined on the basis of ConjForm attribute
(just like the baseline was determined).
There were 134 InitCapped-only patterns in the
data set
96 of them (71.64) were classified correctly
(comparative to the overall baseline result of
69.83).
There were also 11 missclassified examples
consisting mainly of InitCapped tag.
ExAustralian Labor Party and Independent
MembersLoc InitCapped Org and InitCapped
InitCapped

16
Error Analysis Long Patterns

In 2 cases the misclassification was due to the
long patterns of the examples
Fellow of the Australian Institute of
Geoscientists and The Australasian Institute of
Mining
CompPos Of Det Loc Org Of InitCapped and Det Loc
Org Of InitCapped
(Left-Copy gt Name Internal)
Fellow of the Royal College of Pathologists of
Australasia and Chairman of Scientific Services
Limited
Pos Of Deter Org Of InitCapped Of Loc and Pos Of
InitCapped InitCapped Desig
(Name External gt Name Internal)

17
Error Analysis Other Cases

2 cases of extended patterns a pattern is built
as another (common) pattern additional tagWD
HO Wills Holdings LimitedInitial Initial
Initial Initial FamilyName CompDesig (Name
Inter) vsInitial Initial Initial Initial
FamilyName (Right-Copy)
A conjunction of a person name and a company name
Wayne Jones and Topsfield Pty Ltd
ambiguos even for humans without contextual
information
A conjunction of two person names in our domain
there is only one case where this is name
external type
There are around 20 examples where it is
difficult to judge the reason for
missclasification - perhaps the reason is the
model we have built
Influence of k-fold evaluation different
classification for the same pattern in different
folds

18
Conclusions

Distinguished 4 categories of conjunctions in NEs
Presented the problem as one of classification
Experiment with machine-learned classifiers
Results F0.833
Simple tag set used
Some examples are truly ambiguous even for humans

19
Further Work

Multiple conjunctions
Human supervised N-gram based preprocessing
Abbreviation preprocessing
Limit the number of InitCapped tags
Take into account the syntactic number of tokens
Use contextual information (ex. syntactic number
of associated verb)
Extend the evaluation data
Evaluation with full named entity recognition
process

Write a Comment

User Comments (0)

About PowerShow.com

A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities - PowerPoint PPT Presentation

A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities

Pawe Mazur (University of Technology, Wroc aw, Poland) Pawel.Mazur_at_pwr.wroc.pl and Robert Dale (Macquarie University, Sydney, Australia) Robert.Dale_at_mq.edu.au – PowerPoint PPT presentation