Named Entity Recognition based on three different machine learning techniques PowerPoint PPT Presentation

presentation player overlay
1 / 23
About This Presentation
Transcript and Presenter's Notes

Title: Named Entity Recognition based on three different machine learning techniques


1
Named Entity Recognition based on three different
machine learning techniques
  • Zornitsa Kozareva
  • zkozareva_at_dlsi.ua.es
  • JRC Workshop
  • September 27, 2005

Research Group on Language Processing and
Information Systems
g PLSI
2
Outline
  • Named Entity Recognition
  • task definition
  • applications
  • Machine learning approach
  • Classifier combination
  • Feature description and experimental evaluation
  • for NE detection
  • for NE classification
  • NERUA at GeoCLEF
  • Conclusions and future work

3
Named Entity Recognition task definition
Adam_B-PER Smith_I-PER works_O for_O IBM_B-ORG
,_O London_B-LOC ._O
  • Identification of proper names in text, using BIO
    scheme
  • B starts an entity
  • I continues the entity
  • O words outside entity
  • Classification into a predefined set of
    categories
  • Person names
  • Organizations (companies, governmental
    organizations, etc)
  • Locations (cities, countries, etc)
  • Miscellaneous (movie titles, sport events, etc)

4
Named Entity Recognition applications
  • Information Extraction
  • Question Answering
  • Document classification
  • Automatic indexing of books
  • Increase accuracy of Internet search results
    (location Clinton/South Carolina vs.
    PresidentClinton)

5
Outline
  • Named Entity Recognition
  • task definition
  • applications
  • Machine learning approach
  • Classifier combination
  • Feature description and experimental evaluation
  • for NE detection
  • for NE classification
  • NERUA at GeoCLEF
  • Conclusions and future work

6
Machine learning approach
  • Given
  • NER task
  • tagged corpus
  • Select classification methods
  • Memory-based learning
  • Maximum Entropy
  • Hidden Markov Models
  • Construct set of characteristics
  • detection phase
  • classification phase

7

HMM
Text
Detection
Voting
TiMBL
Classification
HMM
Voting
MXE
TiMBL
NERText
NERUAsistema de detección y clasificación de
entidades utilizando aprendizaje automático,
Ferrández et al.
8
Classification method 1
  • Memory-based learning (k-nearest neighbours)
  • toolkit
  • TiMBL package
  • time performance
  • quick training phase
  • slow during testing
  • features
  • various types of features
  • irrelevant features impede performance

9
Classification method 2
  • Maximum Entropy
  • toolkit
  • MaxEnt
  • time performance
  • slow training phase
  • slow testing phase
  • feature management
  • string, missing values

10
Classification method 3
  • Hidden Markov Models
  • toolkit
  • ICOPOST
  • time performance
  • quick training phase
  • quick testing phase
  • feature management
  • cannot handle as many features as the other two
    methods
  • need corpus or label transformation

11
Outline
  • Named Entity Recognition
  • task definition
  • applications
  • Machine learning approach
  • Classifier combination
  • Feature description and experimental evaluation
  • for NE detection
  • for NE classification
  • NERUA at GeoCLEF
  • Conclusions and future work

12
Classifier combination
  • Majority voting
  • give each classifier one vote

CL 1 CL 2 CL 3
PER PER PER
ORG LOC ORG
PER LOC LOC
PER ORG MISC
Vote
PER
ORG
LOC

13
Outline
  • Named Entity Recognition
  • task definition
  • applications
  • Machine learning approach
  • Classifier combination
  • Feature description and experimental evaluation
  • for NE detection
  • for NE classification
  • NERUA at GeoCLEF
  • Conclusions and future work

14
Features for NE detection
  • Contextual
  • anchor word (e.g. the word to be classified)
  • words in a -3,,3 window
  • Orthographic
  • capitalization at position 0,-3,..,3
  • whole anchor word in capitals (ex. IBM)
  • position of anchor word in a sentence
  • Substring extraction
  • 2 and 3 letter extraction from left and right
    side of the anchor word
  • Gazetteer list
  • word at position 0,1,2,3 seen in the list
  • Trigger word list
  • word at position 0,-3,..,3 seen in the list

Using Language Resource Independent Detection for
Spanish NER, Kozareva et al., RANLP05
15
Results for NE detection
Data Size Train Test
Sp tokens 264715 51533
Sp entities 18794 3558
Pt tokens 68597 22624
Pt entities 3094 1013
Spanish B I BIO
TMB-ALL 94.81 86.45 92.56
TMB-CO1 94.62 86.14 92.34
TMB-COS2 94.72 86.48 92.51
HMM3 93.19 82.33 90.29
Voting1,2,3 95.07 87.17 92.96
Portuguese B I BIO
TMB-CO 82.91 68.53 78.41
TMB-COS 81.65 63.80 76.20
HMM 72.93 59.81 68.53
Voting 83.32 69.09 78.86
16
Index
  • Named Entity Recognition
  • task definition
  • applications
  • Machine learning approach
  • Classifier combination
  • Feature description and experimental evaluation
  • for NE detection
  • for NE classification
  • NERUA at GeoCLEF
  • Conclusions and future work

17
Features for NE classification
  • Contextual
  • whole entity
  • first word of the entity
  • second word of the entity if present
  • words around the entity in -3,,3 window
  • Orthographic
  • position of anchor word in a sentence
  • capital, lowercase or other symbol
  • Gazetteer list
  • part of entity in the list
  • whole entity in the list
  • whole entity is not in any of these lists
  • Trigger lists
  • anchor word
  • words in -1,1 window

18
Results for NE classification
Classification LOC MISC ORG PER
MxE241 77.81 57.49 78.83 85.41
TMB24 75.49 53.19 77.44 83.89
MxE25 78.27 58.22 78.64 85.60
TMB252 75.15 52.94 77.79 85.36
HMM3 71.15 45.69 72.95 70.20
Voting1,2,3 78.46 57.00 78.93 86.52
F-score for Spanish classification
19
Outline
  • Named Entity Recognition task definition,
    applications
  • Machine learning approach
  • Classifier combination
  • Feature description and experimental evaluation
  • for NE detection
  • for NE classification
  • NERUA at GeoCLEF
  • Conclusions and future work

20
NERUA at GeoCLEF
Language Run Result
English IRnNERUA 34.95
English IRnDramneri 29.77
Spanish-English IRnNERUA 26.06
Spanish-English IRnDramneri 23.65
  • English used directly the feature sets
    constructed for Spanish
  • NERUA outperformed the rule-based system
    Dramneri although both consulted the same
    gazetteer and trigger word lists
  • NERUA took more processing time

University of Alicante at GeoCLEF 2005, Ferrández
et al., CLEF05
21
Conclusions and future work
  • We found a language resource independent feature
    set for NE detection
  • 92.96 of Spanish entities
  • 78.86 of Portuguese entities
  • Classifier combination has improved NE
    classification
  • Good coverage over PER, LOC and ORG classes is
    maintained
  • Machine learning systems may outperform
    rule-based systems, however they need more
    processing time and hand-labeled resources which
    are not available for all languages

22
Future work
  • Find discriminative features for MISC class
  • Resolve NER leaning upon unlabeled data
  • Divide the four categories into more detailed
    ones
  • Adapt the system for other languages
  • Study ways of automatic gazetteer construction

23
Thank you for the attention!Questions?
  • Named Entity Recognition based on three different
    machine learning techniques
  • Zornitsa Kozareva
  • zkozareva_at_dlsi.ua.es
  • JRC Workshop
  • September 27, 2005

Research Group on Language Processing and
Information Systems
g PLSI
Write a Comment
User Comments (0)
About PowerShow.com