Title: Named Entity Recognition
1Named Entity Recognition
2Introduction
- Presentation
- Motivation
- Contents
- Information Extraction
- Named Entity Recognition (NER)
- An experiment with NER
- Conclusions
3Information Extraction
- Automatic identification of selected types of
entities, relations or events in free text
(GRISHAM, 2003) - Related areas
- Information Retrieval, Knowledge Extraction
- IE x IR
4Information Extraction
- Applications
- Processing of natural language texts for the
extraction of relevant content pieces (MARTÍ AND
CASTELLÓN, 2000) - Raw texts gt structured databases
- Templates filling
- Improving search engines
- Auxiliary tool for other language applications
5IE History
- Early projects
- Knowledge-based, rule-based
- FRUMP 1979
- Newswire
- LSP (Language String Project) 1981
- AMA American Medical Association
- Patient summaries
6IE History
- MUC Message Understanding Conferences (1987)
- DARPA, NRAD
- Standardization
- Evaluation
- Dissemination
- DARPAs TIPSTER Program Document Detection,
Summarization and Information Extraction until
1998 - TREC (Text Retrieval Conferences)
7IE History
- MUC
- Evaluation standards (for the 1st time in MUC-2)
- Recall
- correct units
- total units
- Precision
- correct units
- units found
- F-Measure
- (ß21) PR
- ß2P R
8IE History
- MUC
- Template filling
- Mr. John Smith was appointed CEO of ACME last
December 31.
Name John Smith Post CEO Company ACME Date
December 31
- MUC-5 (1993)
- 47 slots divided in 11 different nested templates
- English and Japanese
9IE History
- MUC-6 (1995)
- Extraction of Named Entities
- names of persons, organizations, locations
- temporal expressions, currency and percentages
- Extraction of Template Elements
- grouping of entity attributes together into
entity objects - Extraction of events (or Scenario Templates)
- Extraction of coreferences
10IE History
- MUC-6
- ENAMEX (entity name expression) tag
- people, organization and locations
- NUMEX (numeric expression) tag
- currency and percentages
- TIMEX (time expression) tag
- temporal expressions dates and times
11IE History
- MUC-6
- Andrew Johnson was appointed last Sunday
president of ACME, the biggest company in Santa
Barbara, California, with an estimated 300
million market capacity. - ltENAMEX TYPEPERSONgtAndrew Johnsonlt/ENAMEXgt
was appointed ltTIMEX TYPEDATEgtlast
Sundaylt/TIMEXgt president of ltENAMEX
TYPEORGANIZATIONgtACMElt/ENAMEXgt, the biggest
company in ltENAMEX TYPELOCATIONgtSanta
Barbaralt/ENAMEXgt, ltENAMEX TYPELOCATIONgtCaliforn
ialt/ENAMEXgt with an estimated ltNUMEX
TYPEMONEYgt300 millionlt/NUMEXgt market
capacity.
12IE History
- MUC-7 (1998)
- Tasks
- Named Entities (NE task)
- Template Element (TE task)
- Scenario Template (ST task)
- Template Relation (TR task)
- Coreferences (CO task)
- System portability among domains
13IE History
14IE History
15IE History
- Other conferences
- MET (Multilingual Entity Task Evaluation)
- Japanese NEs
- IREX
- Japan, 1998
- Organization, Person, Location, Artifact, Date,
Time, Money and Percent
16IE History
- Other conferences
- HUB-4 and ACE (Automatic Content Extraction)
- NIST National Institute of Standards and
Technology - Spoken and printed text
- CoNLL (Conference on Natural Language Learning)
- Since 1997
- NEs in the 2002 and 2003 editions
- Multilingual
- of person (PER), location (LOC), organization
(ORG) and other (O) classes
17IE Techniques and tasks
- IE techniques
- Document indexing text understanding
- Document Indexing
- Tags texts with different descriptors, giving a
kind of semantic representation for its contents - Text Understanding
- Builds a knowledge representation of texts
- IE history
- TU gt DI
- More tractable perspective
18IE Techniques and tasks
- FC Barcelona sold goalkeeper Valdés to Espanyol
last August 14
Seller Team FC Barcelona Buying Team
Espanyol Player Valdés Position
goalkeeper Date August 14.
- Entities
- one person
- two clubs
- Position
- date
- Relationship
- to sell a player
19IE Techniques and tasks
FC Barcelona, the current european champion, has
unexpectedly sold goalkeeper Valdés to its main
rival Espanyol last August 14. Victor Valdés,
goalkeeper of FC Barcelona, has been transferred
to Espanyol last August 14. Espanyol expects a
great season after hiring FC Barcelona
goalkeeper, Valdés, last August 14. FC
Barcelona, the current european champion, is
looking for a new goalkeeper. The club
unexpectedly sold goalkeeper Valdés to its main
rival Espanyol last August 14. The Blaugrana
must hurry because in just a few days the
transfer market will be closed.
20IE Techniques and tasks
- Events and relations extraction
- Knowledge-based techniques
- Regular expressions and patterns
- Knowledge-poor approaches
- Machine learning, statistics
- Coreferences
- Anaphora resolution
- Cross-document
21IE Techniques and tasks
- Performance
- Events and relations extraction
- x
- Named entities extraction
- Why?
22Named Entity Recognition
- Recognition x Classification
- Name Identification and Classification
- NER as
- as a tool or component of IE and IR
- as an input module for a robust shallow parsing
engine - Component technology for other areas
- Question Answering (QA)
- Summarization
- Automatic translation
- Document indexing
- Text data mining
- Genetics
23Named Entity Recognition
- NE Hierarchies
- Person
- Organization
- Location
- But also
- Artifact
- Facility
- Geopolitical entity
- Vehicle
- Weapon
- Etc.
- SEKINE NOBATA (2004)
- 150 types
- Domain-dependent
24Named Entity Recognition
- Internal and external features (or evidences)
- Capitalization
- not all languages
- speech data
- trigger words
- El senyor Balaguer vol comprar-se un cotxe nou.
- La ciutat de Balaguer és tot un compendi de
història de Catalunya.
25Named Entity Recognition
- Handcrafted systems
- Knowledge (rule) based
- Patterns
- Gazetteers
- Automatic systems
- Statistical
- Machine learning
- Unsupervised
- Analyze char type, POS, lexical info,
dictionaries - Hybrid systems
26Named Entity Recognition
- Handcrafted systems
- LTG
- F-measure of 93.39 in MUC-7 (the best)
- Ltquery, XML internal representation
- Tokenizer, POS-tagger, SGML transducer
- Nominator (1997)
- IBM
- Heavy heuristics
- Cross-document co-reference resolution
- Used later in IBM Intelligent Miner
27Named Entity Recognition
- Handcrafted systems
- LaSIE (Large Scale Information Extraction)
- MUC-6 (LaSIE II in MUC-7)
- Univ. of Sheffields GATE architecture (General
Architecture for Text Engineering ) - JAPE language
- FACILE (1998)
- NEA language (Named Entity Analysis)
- Context-sensitive rules
- NetOwl (MUC-7)
- Commercial product
- C engine, extraction rules
28NER automatic approaches
- Learning of statistical models or symbolic rules
- Use of annotated text corpus
- Manually annotated
- Automatically annotated
- BIO tagging
- Tags Begin, Inside, Outside an NE
- Probabilities
- Simple
- P(tag i token i)
- With external evidence
- P(tag i token i-1, token i, token i1)
- OpenClose tagging
- Two classifiers one for the beginning, one for
the end
29NER automatic approaches
- Decision trees
- Tree-oriented sequence of tests in every word
- Determine probabilities of having a BIO tag
- Use training corpus
- Viterbi, ID3, C4.5 algorithms
- Select most probable tag sequence
- SEKINE et al (1998)
- BALUJA et al (1999)
- F-measure 90
30NER automatic approaches
- HMM
- Markov models, Viterbi
- Separate statistical model for each NE category
model for words outside NEs - Nymble (1997) / IdentiFinder (1999)
- Maximum Entropy (ME)
- Separate, independent probabilities for every
evidence (external and internal features) are
merged multiplicatively - MENE (NYU - 1998)
- Capitalization, many lexical features, type of
text - F-Measure 89
31NER other approaches
- Hybrid systems
- Combination of techniques
- IBMs Intelligent Miner Nominator DB/2 data
mining - WordNet hierarchies
- MAGNINI et al. (2002)
- Stacks of classifiers
- Adaboost algorithm
- Bootstrapping approaches
- Small set of seeds
- Memory-based ML, etc.
32Named Entity Recognition
- Handcrafted systems x automatic systems
- Ease of change
- Portability (domains and languages)
- Scalability
- Language resources
- Cost-effectiveness
33NER in various languages
- Arabic
- TAGARAB (1998)
- Pattern-matching engine morphological analysis
- Lots of morphological info (no differences in
ortographic case) - Bulgarian
- OSENOVA KOLKOVSKA (2002)
- Handcrafted cascaded regular NE grammar
- Pre-compiled lexicon and gazetteers
- Catalan
- CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003)
- Extract catalan NEs with spanish resources
(F-measure 93) - Bootstrap using catalan texts
34NER in various languages
- Chinese Japanese
- Many works
- Special characteristics
- Character or word-based
- No capitalization
- CHINERS (2003)
- Sports domain
- Machine learning
- Shallow parsing technique
- ASAHARA MATSMUTO (2003)
- Character-based method
- Support Vector Machine
- 87.2 F-measure in the IREX (outperformed most
word-based systems)
35NER in various languages
- Dutch
- DE MEULDER et al. (2002)
- Hybrid system
- Gazetteers, grammars of names
- Machine Learning Ripper algorithm
- French
- BÉCHET et al. (2000)
- Decision trees
- Le Monde news corpus
- German
- Non-proper nouns also capitalized
- THIELEN (1995)
- Incremental statistical approach
- 65 of corrected disambiguated proper names
36NER in various languages
- Greek
- KARKALETSIS et al. (1998)
- English Greek GIE (Greek Information
Extraction) project - GATE platform
- Italian
- CUCCHIARELLI et al. (1998)
- Merge rule-based and statistical approaches
- Gazetteers
- Context-dependent heuristics
- ECRAN (Extraction of Content Research at Near
Market) - GATE architecture
- Lack of linguistic resources 20 of NEs
undetected - Korean
- CHUNG et al. (2003)
- Rule-based model, Hidden Markov Model, boosting
approach over unannotated data
37NER in various languages
- Portuguese
- SOLORIO LÓPEZ (2004, 2005)
- Adapted CARRERAS et al. (2002b) spanish NER
- Brazilian newspapers
- Serbo-croatian
- NENADIC SPASIC (2000)
- Hand-written grammar rules
- Highly inflective language
- Lots of lexical and lemmatization pre-processing
- Dual alphabet (Cyrillic and Latin)
- Pre-processing stores the text in an independent
format
38NER in various languages
- Spanish
- CARRERAS et al. (2002b)
- Machine Learning, AdaBoost algorithm
- BIO and OpenClose approaches
- Swedish
- SweNam system (DALIANIS ASTROM, 2001)
- Perl
- Machine Learning techniques and matching rules
- Turkish
- TUR et al (2000)
- Hidden Markov Model and Viterbi search
- Lexical, morphological and context clues
39Named Entity Recognition
- Multilingual approaches
- Goals - CUCERZAN YAROWSKI (1999)
- To handle basic language-specific evidences
- To learn from small NE lists (about 100 names)
- To process large and small texts
- To have a good class-scalability (to allow the
definition of different classes of entities,
according to the language or to the purpose) - To learn incrementally, storing learned
information for future use
40Named Entity Recognition
- Multilingual approaches
- GALLIPI (1996)
- Machine Learning
- English, Spanish, Portuguese
- ECRAN (Extraction of Content Research at Near
Market) - REFLEX project (2005)
- the US National Business Center
41Named Entity Recognition
- Multilingual approaches
- POIBEAU (2003)
- Arabic, Chinese, English, French, German,
Japanese, Finnish, Malagasy, Persian, Polish,
Russian, Spanish and Swedish - UNICODE
- Language independent architecture
- Rule-based, machine-learning
- Sharing of resources (dictionary, grammar rules)
for some languages - BOAS II (2004)
- University of Maryland Baltimore County
- Web-based
- Pattern-matching
- No large corpora
42NER other topics
- Character x word-based
- JING et al. (2003)
- Hidden Markov Model classifier
- Character-based model better than word-based
model - NER translation
- Cross-language Information Retrieval (CLIR),
Machine Translation (MT) and Question Answering
(QA) - NER in speech
- No punctuation, no capitalization
- KIM WOODLAND (2000)
- Up to 88.58 F-measure
- NER in Web pages
- wrappers
43NER an experiment in Catalan
- General architecture
- Common API
- Segmentation module
- POS-tagger
- Disambiguator
- Grammar module
- Module for accessing the system dictionaries
44NER an experiment in Catalan
- General architecture
- Typographical error detection module
- Spelling error detection module
- Grammatical error detection module
- NER module
45NER an experiment in Catalan
- NER Module
- Dictionary
- Multi tokens
- WORD FORMLEMMATAGFREQUENCYWORD
FORMFREQUENCYWORD FORMFREQUENCY - cancanN5-FP444barbet42barça23Barceló4
- Categories
- PERSON
- Names and surnames
- LOCATION
- Common indicators
- ORGANIZATION
- Common indicators
- UNKNOWN
46NER an experiment in Catalan
- NER Module
- Rules
- Locations
- Verb_viure a location
- Exiliat novament, Macià viu a Bélgica.
- Verb_néixer a location
- Joan neix a Barcelona
- Persons
- Sr. person
- El Sr. Companys va sortir.
- El position de location, person
- El alcalde de Barcelona, Joan Clos.
47NER an experiment in Catalan
- NER Module
- Rules
- Organizations
- El position de organization
- El president de Cases Rives.
- Organization, verb_fundat el
- El club Orfeas Smyrna, fundat el 1890 per jònics
que residien a la ciutat turca. - Combinations
- For persons, organizations and locations
48NER an experiment in Catalan
- NER Module
- Error detection and suggestion
- Pre-defined spelling rules
- Inserting try characters before every letter of
the word - Swapping characters one by one
- Inserting try characters in their places
- The NER correction as input for the Grammar
module
49NER an experiment in Catalan
- Results
- 20 catalan texts
- Wikipedia, El Periòdic
- 10000 words
- Various domains
- Precision 70
- Recall 75
- F-Measure 72
- Error correction and suggestions
50Conclusions
- Needs better tuning
- Rules
- Dictionary
- canP0
- can benetCan BenetN4BMS9canN4BMS9benetN4BM
S910000000P1 - can benet deP0
- can benet de laP0
- can benet de la pruaCan Benet de la
PruacanN4BMSbenetN4BMSdePelEA--FSpruaN4B
FSP1 - Test statistical based-engine?
- Treatment of gender, number
- Expand to full IE system