A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities

Description:

Pawe Mazur (University of Technology, Wroc aw, Poland) Pawel.Mazur_at_pwr.wroc.pl and Robert Dale (Macquarie University, Sydney, Australia) Robert.Dale_at_mq.edu.au – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 20
Provided by: Robert2572
Category:

less

Transcript and Presenter's Notes

Title: A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities


1
A Supervised Machine Learning Approach to
Conjunction Disambiguation in Named Entities
  • Pawel Mazur(University of Technology, Wroclaw,
    Poland)Pawel.Mazur_at_pwr.wroc.pl
  • and
  • Robert Dale(Macquarie University, Sydney,
    Australia)Robert.Dale_at_mq.edu.au

2
Agenda
  • Conjunction in Named Entities
  • Our approach
  • Experiments
  • Results of the experiment
  • Results analysis
  • Conclusions
  • Further work

3
Conjunction in Candidate Named Entity String
  • Fujitsu Australia and New Zealand
  • Australia and New Zealand Banking Group Limited
  • Peter Smith and Ann Arbor Software Council
  • Candidate named entity string
  • Any sequence of words starting with initial
    capitals
  • Single instance of the word and or form of
    conjunction
  • 45 documents out of 13460 5.7 of candidate
    named entity strings contained conjunction in
    some documents the frequency is as high as 23
    in MUC-7 it is 4.5
  • A lot of candidate named entity strings in this
    domain contain company names and person names

4
Our Approach - A Classification Task
  • We distinguish 4 categories of a conjunction in a
    candidate NE string
  • Category A Name Internal ConjunctionCopper
    Mines and Metals LimitedHerbert P Cooper Son,
    Ernst and Young
  • Category B Name External ConjunctionProxy Form
    and Explanatory MemorandumHardware Operating
    SystemsEchoStar and News Corporation
  • Category C Right-Copy SeparatorWilliam and Alma
    Ford, Connel and Bent Streets,Eastern and
    Western Australia
  • Category D Left-Copy SeparatorHospital
    Equipment SystemsJ H Blair Company Secretary
    Corporate Counsel

The most common
Could be seen as one linguistic category
5
Our Approach - Candidate NE String Pattern
  • String Australia and New Zealand Banking
    Group LimitedPattern (Loc and Loc Org
    CompDesig)
  • String Peter Smith and Ann Arbor Software
    Council Pattern
    (GivenName FamilyName and GivenName FamilyName
    Noun Org)
  • Patterns are created using gazetteers and simple
    keyword-based heuristics.

6
Tag Set
PersDesig Mr, Mrs, Ms, Miss, Dr, Prof, Sir,
Madam, Messrs, and Jnr.CompDesig Ltd, Limited,
Pty Ltd, GmbH, plc and many more and Investments
Pty Ltd, Management Pty Ltd, Corporate Pty Ltd,
Associates Pty Ltd, Family Trust, Co Limited,
Partners, Partners Limited, Capital Limited, and
Capital Pty Ltd. CompPos Director, Secretary,
Manager, Counsel, Managing Director, Member,
Chairman, Chief Executive, Chief Executive
Officer, and CEO, and also some bodies within
organizations, such as Board and Committee.
  • InitCapped 925 42.24Loc 245 11.19Org 175
    7.99FamilyName 164 7.49CompDesig 138
    6.30Initial 108 4.93CompPos 99
    4.52GivenName 89 4.06Of 76
    3.47Abbrev 73 3.33PersDesig 39
    1.78Det 31 1.42Dir 12
    0.55Son 7 0.32Month 6
    0.27AlphaNum 3 0.14

7
Data Encoding
  • Each instance is encoded with 33 attributes
  • 1 binary attribute for each tag for each conjunct
    signaling its presence in the
    string (2x1632 attributes in total)
  • 1 binary attribute ConjType encoding the lexical
    form of the conjunction in the string (0 for , 1
    for and)

8
Corpus Data Sets
  • Corpus 13460 text documents from 8 to 1000
    lines long
  • Our corpus is a subcorpus drawn from a collection
    of company announcements from the Australian
    Stock Exchange
  • Selection of candidate named entity
    stringssequence of initcapped words and a
    single conjunction (and or ),also optional of,
    a, an, the
  • We got a set of 10925 strings, 6437 of which are
    unique
  • Hand elimination of wrongly identified strings
    due to typographic features of the documents
    (tables)
  • Random selection of 600 examples from the unique
    set

Name Internal Name External Right-Copy Left-Copy Sum
18530.8 35058.3 396.5 264.3 600100
9
Machine-learned Classifiers
  • Naïve Bayes
  • Multilayered Perceptron
  • IBk
  • K
  • Random Tree
  • Logistic Model Trees (LMT)
  • J4.8
  • SMO
  • Implementations in WEKA (Waikato Environment for
    Knowledge Analysis), University of Waikato in New
    Zealand

10
Baseline
  • Determined with the 0-R algorithm always assigns
    the most common category (Name External) 58.33
  • Better baseline is given by 1-R algorithmIF
    ConjForm THEN PredCat?InternalIF ConjFormand
    THEN PredCat?Externalbaseline 69.83

11
Results
Algorithm Correctly classified
IBk 504 (84.00)
Random Tree 503 (83.83)
K 501 (83.50)
SMO - quadratic kernel 494 (82.33)
Mult. Perceptron 493 (82.17)
LMT 487 (81.17)
J4.8 477 (79.50)
SMO - linear kernel 468 (78.00)
Naïve Bayes 424 (70.67)
Baseline 419 (69.83)
12
Accuracy by Conjunction Category
Category Precision Recall F-Measure
Name Internal 0.814 0.876 0.844
Name External 0.872 0.897 0.885
Right-Copy 0.615 0.410 0.492
Left-Copy 0.800 0.462 0.585

weighted mean 0.834 0.840 0.833
13
Confusion Matrix
Name Internal Name External Right-Copy Left-Copy Classified as
162 28 6 3 Name Internal
18 314 17 11 Name External
4 6 16 0 Right-Copy
1 2 0 12 Left-Copy
14
Results Analysis Conjunction Cat. Indicators
  • For Name External conjunction
  • Month X
  • X Month
  • CompDesig X
  • X PersDesig
  • X GivenName
  • X Dir
  • X Deter
  • Abbrev X
  • X Abbrev
  • For Name Internal conjunction
  • - X Son(note Sons of Gwalia Ltd and Gwalia
    Consolidated Ltd)

15
Error Analysis InitCapped
  • 38 of all 96 missclassified examples are
    InitCapped tag based only (40)
  • In these cases the classification ended up being
    determined on the basis of ConjForm attribute
    (just like the baseline was determined).
  • There were 134 InitCapped-only patterns in the
    data set
  • 96 of them (71.64) were classified correctly
    (comparative to the overall baseline result of
    69.83).
  • There were also 11 missclassified examples
    consisting mainly of InitCapped tag.
    ExAustralian Labor Party and Independent
    MembersLoc InitCapped Org and InitCapped
    InitCapped

16
Error Analysis Long Patterns
  • In 2 cases the misclassification was due to the
    long patterns of the examples
  • Fellow of the Australian Institute of
    Geoscientists and The Australasian Institute of
    Mining
  • CompPos Of Det Loc Org Of InitCapped and Det Loc
    Org Of InitCapped
  • (Left-Copy gt Name Internal)
  • Fellow of the Royal College of Pathologists of
    Australasia and Chairman of Scientific Services
    Limited
  • Pos Of Deter Org Of InitCapped Of Loc and Pos Of
    InitCapped InitCapped Desig
  • (Name External gt Name Internal)

17
Error Analysis Other Cases
  • 2 cases of extended patterns a pattern is built
    as another (common) pattern additional tagWD
    HO Wills Holdings LimitedInitial Initial
    Initial Initial FamilyName CompDesig (Name
    Inter) vsInitial Initial Initial Initial
    FamilyName (Right-Copy)
  • A conjunction of a person name and a company name
    Wayne Jones and Topsfield Pty Ltd
  • ambiguos even for humans without contextual
    information
  • A conjunction of two person names in our domain
    there is only one case where this is name
    external type
  • There are around 20 examples where it is
    difficult to judge the reason for
    missclasification - perhaps the reason is the
    model we have built
  • Influence of k-fold evaluation different
    classification for the same pattern in different
    folds

18
Conclusions
  • Distinguished 4 categories of conjunctions in NEs
  • Presented the problem as one of classification
  • Experiment with machine-learned classifiers
  • Results F0.833
  • Simple tag set used
  • Some examples are truly ambiguous even for humans

19
Further Work
  • Multiple conjunctions
  • Human supervised N-gram based preprocessing
  • Abbreviation preprocessing
  • Limit the number of InitCapped tags
  • Take into account the syntactic number of tokens
  • Use contextual information (ex. syntactic number
    of associated verb)
  • Extend the evaluation data
  • Evaluation with full named entity recognition
    process
Write a Comment
User Comments (0)
About PowerShow.com