Title: A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities
1A Supervised Machine Learning Approach to
Conjunction Disambiguation in Named Entities
- Pawel Mazur(University of Technology, Wroclaw,
Poland)Pawel.Mazur_at_pwr.wroc.pl - and
- Robert Dale(Macquarie University, Sydney,
Australia)Robert.Dale_at_mq.edu.au
2Agenda
- Conjunction in Named Entities
- Our approach
- Experiments
- Results of the experiment
- Results analysis
- Conclusions
- Further work
3Conjunction in Candidate Named Entity String
- Fujitsu Australia and New Zealand
- Australia and New Zealand Banking Group Limited
- Peter Smith and Ann Arbor Software Council
- Candidate named entity string
- Any sequence of words starting with initial
capitals - Single instance of the word and or form of
conjunction - 45 documents out of 13460 5.7 of candidate
named entity strings contained conjunction in
some documents the frequency is as high as 23
in MUC-7 it is 4.5 - A lot of candidate named entity strings in this
domain contain company names and person names
4Our Approach - A Classification Task
- We distinguish 4 categories of a conjunction in a
candidate NE string - Category A Name Internal ConjunctionCopper
Mines and Metals LimitedHerbert P Cooper Son,
Ernst and Young - Category B Name External ConjunctionProxy Form
and Explanatory MemorandumHardware Operating
SystemsEchoStar and News Corporation - Category C Right-Copy SeparatorWilliam and Alma
Ford, Connel and Bent Streets,Eastern and
Western Australia - Category D Left-Copy SeparatorHospital
Equipment SystemsJ H Blair Company Secretary
Corporate Counsel
The most common
Could be seen as one linguistic category
5Our Approach - Candidate NE String Pattern
- String Australia and New Zealand Banking
Group LimitedPattern (Loc and Loc Org
CompDesig) - String Peter Smith and Ann Arbor Software
Council Pattern
(GivenName FamilyName and GivenName FamilyName
Noun Org) - Patterns are created using gazetteers and simple
keyword-based heuristics.
6Tag Set
PersDesig Mr, Mrs, Ms, Miss, Dr, Prof, Sir,
Madam, Messrs, and Jnr.CompDesig Ltd, Limited,
Pty Ltd, GmbH, plc and many more and Investments
Pty Ltd, Management Pty Ltd, Corporate Pty Ltd,
Associates Pty Ltd, Family Trust, Co Limited,
Partners, Partners Limited, Capital Limited, and
Capital Pty Ltd. CompPos Director, Secretary,
Manager, Counsel, Managing Director, Member,
Chairman, Chief Executive, Chief Executive
Officer, and CEO, and also some bodies within
organizations, such as Board and Committee.
- InitCapped 925 42.24Loc 245 11.19Org 175
7.99FamilyName 164 7.49CompDesig 138
6.30Initial 108 4.93CompPos 99
4.52GivenName 89 4.06Of 76
3.47Abbrev 73 3.33PersDesig 39
1.78Det 31 1.42Dir 12
0.55Son 7 0.32Month 6
0.27AlphaNum 3 0.14
7Data Encoding
- Each instance is encoded with 33 attributes
- 1 binary attribute for each tag for each conjunct
signaling its presence in the
string (2x1632 attributes in total) - 1 binary attribute ConjType encoding the lexical
form of the conjunction in the string (0 for , 1
for and)
8Corpus Data Sets
- Corpus 13460 text documents from 8 to 1000
lines long - Our corpus is a subcorpus drawn from a collection
of company announcements from the Australian
Stock Exchange - Selection of candidate named entity
stringssequence of initcapped words and a
single conjunction (and or ),also optional of,
a, an, the - We got a set of 10925 strings, 6437 of which are
unique - Hand elimination of wrongly identified strings
due to typographic features of the documents
(tables) - Random selection of 600 examples from the unique
set
Name Internal Name External Right-Copy Left-Copy Sum
18530.8 35058.3 396.5 264.3 600100
9Machine-learned Classifiers
- Naïve Bayes
- Multilayered Perceptron
- IBk
- K
- Random Tree
- Logistic Model Trees (LMT)
- J4.8
- SMO
- Implementations in WEKA (Waikato Environment for
Knowledge Analysis), University of Waikato in New
Zealand
10Baseline
- Determined with the 0-R algorithm always assigns
the most common category (Name External) 58.33 - Better baseline is given by 1-R algorithmIF
ConjForm THEN PredCat?InternalIF ConjFormand
THEN PredCat?Externalbaseline 69.83
11Results
Algorithm Correctly classified
IBk 504 (84.00)
Random Tree 503 (83.83)
K 501 (83.50)
SMO - quadratic kernel 494 (82.33)
Mult. Perceptron 493 (82.17)
LMT 487 (81.17)
J4.8 477 (79.50)
SMO - linear kernel 468 (78.00)
Naïve Bayes 424 (70.67)
Baseline 419 (69.83)
12Accuracy by Conjunction Category
Category Precision Recall F-Measure
Name Internal 0.814 0.876 0.844
Name External 0.872 0.897 0.885
Right-Copy 0.615 0.410 0.492
Left-Copy 0.800 0.462 0.585
weighted mean 0.834 0.840 0.833
13Confusion Matrix
Name Internal Name External Right-Copy Left-Copy Classified as
162 28 6 3 Name Internal
18 314 17 11 Name External
4 6 16 0 Right-Copy
1 2 0 12 Left-Copy
14Results Analysis Conjunction Cat. Indicators
- For Name External conjunction
- Month X
- X Month
- CompDesig X
- X PersDesig
- X GivenName
- X Dir
- X Deter
- Abbrev X
- X Abbrev
- For Name Internal conjunction
- - X Son(note Sons of Gwalia Ltd and Gwalia
Consolidated Ltd)
15Error Analysis InitCapped
- 38 of all 96 missclassified examples are
InitCapped tag based only (40) - In these cases the classification ended up being
determined on the basis of ConjForm attribute
(just like the baseline was determined). - There were 134 InitCapped-only patterns in the
data set - 96 of them (71.64) were classified correctly
(comparative to the overall baseline result of
69.83). - There were also 11 missclassified examples
consisting mainly of InitCapped tag.
ExAustralian Labor Party and Independent
MembersLoc InitCapped Org and InitCapped
InitCapped
16Error Analysis Long Patterns
- In 2 cases the misclassification was due to the
long patterns of the examples - Fellow of the Australian Institute of
Geoscientists and The Australasian Institute of
Mining - CompPos Of Det Loc Org Of InitCapped and Det Loc
Org Of InitCapped - (Left-Copy gt Name Internal)
- Fellow of the Royal College of Pathologists of
Australasia and Chairman of Scientific Services
Limited - Pos Of Deter Org Of InitCapped Of Loc and Pos Of
InitCapped InitCapped Desig - (Name External gt Name Internal)
17Error Analysis Other Cases
- 2 cases of extended patterns a pattern is built
as another (common) pattern additional tagWD
HO Wills Holdings LimitedInitial Initial
Initial Initial FamilyName CompDesig (Name
Inter) vsInitial Initial Initial Initial
FamilyName (Right-Copy) - A conjunction of a person name and a company name
Wayne Jones and Topsfield Pty Ltd - ambiguos even for humans without contextual
information - A conjunction of two person names in our domain
there is only one case where this is name
external type - There are around 20 examples where it is
difficult to judge the reason for
missclasification - perhaps the reason is the
model we have built - Influence of k-fold evaluation different
classification for the same pattern in different
folds
18Conclusions
- Distinguished 4 categories of conjunctions in NEs
- Presented the problem as one of classification
- Experiment with machine-learned classifiers
- Results F0.833
- Simple tag set used
- Some examples are truly ambiguous even for humans
19Further Work
- Multiple conjunctions
- Human supervised N-gram based preprocessing
- Abbreviation preprocessing
- Limit the number of InitCapped tags
- Take into account the syntactic number of tokens
- Use contextual information (ex. syntactic number
of associated verb) - Extend the evaluation data
- Evaluation with full named entity recognition
process