Title: Named Entity Recognition based on three different machine learning techniques
1Named Entity Recognition based on three different
machine learning techniques
- Zornitsa Kozareva
- zkozareva_at_dlsi.ua.es
- JRC Workshop
- September 27, 2005
Research Group on Language Processing and
Information Systems
g PLSI
2Outline
- Named Entity Recognition
- task definition
- applications
- Machine learning approach
- Classifier combination
- Feature description and experimental evaluation
- for NE detection
- for NE classification
- NERUA at GeoCLEF
- Conclusions and future work
3Named Entity Recognition task definition
Adam_B-PER Smith_I-PER works_O for_O IBM_B-ORG
,_O London_B-LOC ._O
- Identification of proper names in text, using BIO
scheme - B starts an entity
- I continues the entity
- O words outside entity
- Classification into a predefined set of
categories - Person names
- Organizations (companies, governmental
organizations, etc) - Locations (cities, countries, etc)
- Miscellaneous (movie titles, sport events, etc)
4Named Entity Recognition applications
- Information Extraction
- Question Answering
- Document classification
- Automatic indexing of books
- Increase accuracy of Internet search results
(location Clinton/South Carolina vs.
PresidentClinton)
5Outline
- Named Entity Recognition
- task definition
- applications
- Machine learning approach
- Classifier combination
- Feature description and experimental evaluation
- for NE detection
- for NE classification
- NERUA at GeoCLEF
- Conclusions and future work
6Machine learning approach
- Given
- NER task
- tagged corpus
- Select classification methods
- Memory-based learning
- Maximum Entropy
- Hidden Markov Models
- Construct set of characteristics
- detection phase
- classification phase
7 HMM
Text
Detection
Voting
TiMBL
Classification
HMM
Voting
MXE
TiMBL
NERText
NERUAsistema de detección y clasificación de
entidades utilizando aprendizaje automático,
Ferrández et al.
8Classification method 1
- Memory-based learning (k-nearest neighbours)
- toolkit
- TiMBL package
- time performance
- quick training phase
- slow during testing
- features
- various types of features
- irrelevant features impede performance
9Classification method 2
- Maximum Entropy
- toolkit
- MaxEnt
- time performance
- slow training phase
- slow testing phase
- feature management
- string, missing values
10Classification method 3
- Hidden Markov Models
- toolkit
- ICOPOST
- time performance
- quick training phase
- quick testing phase
- feature management
- cannot handle as many features as the other two
methods - need corpus or label transformation
11Outline
- Named Entity Recognition
- task definition
- applications
- Machine learning approach
- Classifier combination
- Feature description and experimental evaluation
- for NE detection
- for NE classification
- NERUA at GeoCLEF
- Conclusions and future work
12Classifier combination
- Majority voting
- give each classifier one vote
CL 1 CL 2 CL 3
PER PER PER
ORG LOC ORG
PER LOC LOC
PER ORG MISC
Vote
PER
ORG
LOC
13Outline
- Named Entity Recognition
- task definition
- applications
- Machine learning approach
- Classifier combination
- Feature description and experimental evaluation
- for NE detection
- for NE classification
- NERUA at GeoCLEF
- Conclusions and future work
14Features for NE detection
- Contextual
- anchor word (e.g. the word to be classified)
- words in a -3,,3 window
- Orthographic
- capitalization at position 0,-3,..,3
- whole anchor word in capitals (ex. IBM)
- position of anchor word in a sentence
- Substring extraction
- 2 and 3 letter extraction from left and right
side of the anchor word - Gazetteer list
- word at position 0,1,2,3 seen in the list
- Trigger word list
- word at position 0,-3,..,3 seen in the list
-
Using Language Resource Independent Detection for
Spanish NER, Kozareva et al., RANLP05
15Results for NE detection
Data Size Train Test
Sp tokens 264715 51533
Sp entities 18794 3558
Pt tokens 68597 22624
Pt entities 3094 1013
Spanish B I BIO
TMB-ALL 94.81 86.45 92.56
TMB-CO1 94.62 86.14 92.34
TMB-COS2 94.72 86.48 92.51
HMM3 93.19 82.33 90.29
Voting1,2,3 95.07 87.17 92.96
Portuguese B I BIO
TMB-CO 82.91 68.53 78.41
TMB-COS 81.65 63.80 76.20
HMM 72.93 59.81 68.53
Voting 83.32 69.09 78.86
16Index
- Named Entity Recognition
- task definition
- applications
- Machine learning approach
- Classifier combination
- Feature description and experimental evaluation
- for NE detection
- for NE classification
- NERUA at GeoCLEF
- Conclusions and future work
17Features for NE classification
- Contextual
- whole entity
- first word of the entity
- second word of the entity if present
- words around the entity in -3,,3 window
- Orthographic
- position of anchor word in a sentence
- capital, lowercase or other symbol
- Gazetteer list
- part of entity in the list
- whole entity in the list
- whole entity is not in any of these lists
- Trigger lists
- anchor word
- words in -1,1 window
18Results for NE classification
Classification LOC MISC ORG PER
MxE241 77.81 57.49 78.83 85.41
TMB24 75.49 53.19 77.44 83.89
MxE25 78.27 58.22 78.64 85.60
TMB252 75.15 52.94 77.79 85.36
HMM3 71.15 45.69 72.95 70.20
Voting1,2,3 78.46 57.00 78.93 86.52
F-score for Spanish classification
19Outline
- Named Entity Recognition task definition,
applications - Machine learning approach
- Classifier combination
- Feature description and experimental evaluation
- for NE detection
- for NE classification
- NERUA at GeoCLEF
- Conclusions and future work
20NERUA at GeoCLEF
Language Run Result
English IRnNERUA 34.95
English IRnDramneri 29.77
Spanish-English IRnNERUA 26.06
Spanish-English IRnDramneri 23.65
- English used directly the feature sets
constructed for Spanish - NERUA outperformed the rule-based system
Dramneri although both consulted the same
gazetteer and trigger word lists - NERUA took more processing time
University of Alicante at GeoCLEF 2005, Ferrández
et al., CLEF05
21Conclusions and future work
- We found a language resource independent feature
set for NE detection - 92.96 of Spanish entities
- 78.86 of Portuguese entities
- Classifier combination has improved NE
classification - Good coverage over PER, LOC and ORG classes is
maintained - Machine learning systems may outperform
rule-based systems, however they need more
processing time and hand-labeled resources which
are not available for all languages
22Future work
- Find discriminative features for MISC class
- Resolve NER leaning upon unlabeled data
- Divide the four categories into more detailed
ones - Adapt the system for other languages
- Study ways of automatic gazetteer construction
23Thank you for the attention!Questions?
- Named Entity Recognition based on three different
machine learning techniques - Zornitsa Kozareva
- zkozareva_at_dlsi.ua.es
- JRC Workshop
- September 27, 2005
Research Group on Language Processing and
Information Systems
g PLSI