Named Entity Recognition - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Named Entity Recognition

Description:

'Automatic identification of selected types of entities, relations or events in ... Anaphora resolution. Cross-document. IE Techniques and tasks. Performance: ... – PowerPoint PPT presentation

Number of Views:1012

Avg rating:5.0/5.0

Slides: 51

Provided by: muti

Category:

more less

Transcript and Presenter's Notes

Title: Named Entity Recognition

1
Named Entity Recognition

Beto Boullosa

2
Introduction

Presentation
Motivation
Contents
Information Extraction
Named Entity Recognition (NER)
An experiment with NER
Conclusions

3
Information Extraction

Automatic identification of selected types of
entities, relations or events in free text
(GRISHAM, 2003)
Related areas
Information Retrieval, Knowledge Extraction
IE x IR

4
Information Extraction

Applications
Processing of natural language texts for the
extraction of relevant content pieces (MARTÍ AND
CASTELLÓN, 2000)
Raw texts gt structured databases
Templates filling
Improving search engines
Auxiliary tool for other language applications

5
IE History

Early projects
Knowledge-based, rule-based
FRUMP 1979
Newswire
LSP (Language String Project) 1981
AMA American Medical Association
Patient summaries

6
IE History

MUC Message Understanding Conferences (1987)
DARPA, NRAD
Standardization
Evaluation
Dissemination
DARPAs TIPSTER Program Document Detection,
Summarization and Information Extraction until
1998
TREC (Text Retrieval Conferences)

7
IE History

MUC
Evaluation standards (for the 1st time in MUC-2)
Recall
correct units
total units
Precision
correct units
units found
F-Measure
(ß21) PR
ß2P R

8
IE History

MUC
Template filling
Mr. John Smith was appointed CEO of ACME last
December 31.

Name John Smith Post CEO Company ACME Date
December 31

MUC-5 (1993)
47 slots divided in 11 different nested templates
English and Japanese

9
IE History

MUC-6 (1995)
Extraction of Named Entities
names of persons, organizations, locations
temporal expressions, currency and percentages
Extraction of Template Elements
grouping of entity attributes together into
entity objects
Extraction of events (or Scenario Templates)
Extraction of coreferences

10
IE History

MUC-6
ENAMEX (entity name expression) tag
people, organization and locations
NUMEX (numeric expression) tag
currency and percentages
TIMEX (time expression) tag
temporal expressions dates and times

11
IE History

MUC-6
Andrew Johnson was appointed last Sunday
president of ACME, the biggest company in Santa
Barbara, California, with an estimated 300
million market capacity.
ltENAMEX TYPEPERSONgtAndrew Johnsonlt/ENAMEXgt
was appointed ltTIMEX TYPEDATEgtlast
Sundaylt/TIMEXgt president of ltENAMEX
TYPEORGANIZATIONgtACMElt/ENAMEXgt, the biggest
company in ltENAMEX TYPELOCATIONgtSanta
Barbaralt/ENAMEXgt, ltENAMEX TYPELOCATIONgtCaliforn
ialt/ENAMEXgt with an estimated ltNUMEX
TYPEMONEYgt300 millionlt/NUMEXgt market
capacity.

12
IE History

MUC-7 (1998)
Tasks
Named Entities (NE task)
Template Element (TE task)
Scenario Template (ST task)
Template Relation (TR task)
Coreferences (CO task)
System portability among domains

13
IE History

Domains used in MUCs

14
IE History

Results in MUC-6

15
IE History

Other conferences
MET (Multilingual Entity Task Evaluation)
Japanese NEs
IREX
Japan, 1998
Organization, Person, Location, Artifact, Date,
Time, Money and Percent

16
IE History

Other conferences
HUB-4 and ACE (Automatic Content Extraction)
NIST National Institute of Standards and
Technology
Spoken and printed text
CoNLL (Conference on Natural Language Learning)
Since 1997
NEs in the 2002 and 2003 editions
Multilingual
of person (PER), location (LOC), organization
(ORG) and other (O) classes

17
IE Techniques and tasks

IE techniques
Document indexing text understanding
Document Indexing
Tags texts with different descriptors, giving a
kind of semantic representation for its contents
Text Understanding
Builds a knowledge representation of texts
IE history
TU gt DI
More tractable perspective

18
IE Techniques and tasks

FC Barcelona sold goalkeeper Valdés to Espanyol
last August 14

Seller Team FC Barcelona Buying Team
Espanyol Player Valdés Position
goalkeeper Date August 14.

Entities
one person
two clubs
Position
date
Relationship
to sell a player

19
IE Techniques and tasks

Compare with

FC Barcelona, the current european champion, has
unexpectedly sold goalkeeper Valdés to its main
rival Espanyol last August 14. Victor Valdés,
goalkeeper of FC Barcelona, has been transferred
to Espanyol last August 14. Espanyol expects a
great season after hiring FC Barcelona
goalkeeper, Valdés, last August 14. FC
Barcelona, the current european champion, is
looking for a new goalkeeper. The club
unexpectedly sold goalkeeper Valdés to its main
rival Espanyol last August 14. The Blaugrana
must hurry because in just a few days the
transfer market will be closed.
20
IE Techniques and tasks

Events and relations extraction
Knowledge-based techniques
Regular expressions and patterns
Knowledge-poor approaches
Machine learning, statistics
Coreferences
Anaphora resolution
Cross-document

21
IE Techniques and tasks

Performance
Events and relations extraction
x
Named entities extraction
Why?

22
Named Entity Recognition

Recognition x Classification
Name Identification and Classification
NER as
as a tool or component of IE and IR
as an input module for a robust shallow parsing
engine
Component technology for other areas
Question Answering (QA)
Summarization
Automatic translation
Document indexing
Text data mining
Genetics

23
Named Entity Recognition

NE Hierarchies
Person
Organization
Location
But also
Artifact
Facility
Geopolitical entity
Vehicle
Weapon
Etc.
SEKINE NOBATA (2004)
150 types
Domain-dependent

24
Named Entity Recognition

Internal and external features (or evidences)
Capitalization
not all languages
speech data
trigger words
El senyor Balaguer vol comprar-se un cotxe nou.
La ciutat de Balaguer és tot un compendi de
història de Catalunya.

25
Named Entity Recognition

Handcrafted systems
Knowledge (rule) based
Patterns
Gazetteers
Automatic systems
Statistical
Machine learning
Unsupervised
Analyze char type, POS, lexical info,
dictionaries
Hybrid systems

26
Named Entity Recognition

Handcrafted systems
LTG
F-measure of 93.39 in MUC-7 (the best)
Ltquery, XML internal representation
Tokenizer, POS-tagger, SGML transducer
Nominator (1997)
IBM
Heavy heuristics
Cross-document co-reference resolution
Used later in IBM Intelligent Miner

27
Named Entity Recognition

Handcrafted systems
LaSIE (Large Scale Information Extraction)
MUC-6 (LaSIE II in MUC-7)
Univ. of Sheffields GATE architecture (General
Architecture for Text Engineering )
JAPE language
FACILE (1998)
NEA language (Named Entity Analysis)
Context-sensitive rules
NetOwl (MUC-7)
Commercial product
C engine, extraction rules

28
NER automatic approaches

Learning of statistical models or symbolic rules
Use of annotated text corpus
Manually annotated
Automatically annotated
BIO tagging
Tags Begin, Inside, Outside an NE
Probabilities
Simple
P(tag i token i)
With external evidence
P(tag i token i-1, token i, token i1)
OpenClose tagging
Two classifiers one for the beginning, one for
the end

29
NER automatic approaches

Decision trees
Tree-oriented sequence of tests in every word
Determine probabilities of having a BIO tag
Use training corpus
Viterbi, ID3, C4.5 algorithms
Select most probable tag sequence
SEKINE et al (1998)
BALUJA et al (1999)
F-measure 90

30
NER automatic approaches

HMM
Markov models, Viterbi
Separate statistical model for each NE category
model for words outside NEs
Nymble (1997) / IdentiFinder (1999)
Maximum Entropy (ME)
Separate, independent probabilities for every
evidence (external and internal features) are
merged multiplicatively
MENE (NYU - 1998)
Capitalization, many lexical features, type of
text
F-Measure 89

31
NER other approaches

Hybrid systems
Combination of techniques
IBMs Intelligent Miner Nominator DB/2 data
mining
WordNet hierarchies
MAGNINI et al. (2002)
Stacks of classifiers
Adaboost algorithm
Bootstrapping approaches
Small set of seeds
Memory-based ML, etc.

32
Named Entity Recognition

Handcrafted systems x automatic systems
Ease of change
Portability (domains and languages)
Scalability
Language resources
Cost-effectiveness

33
NER in various languages

Arabic
TAGARAB (1998)
Pattern-matching engine morphological analysis
Lots of morphological info (no differences in
ortographic case)
Bulgarian
OSENOVA KOLKOVSKA (2002)
Handcrafted cascaded regular NE grammar
Pre-compiled lexicon and gazetteers
Catalan
CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003)
Extract catalan NEs with spanish resources
(F-measure 93)
Bootstrap using catalan texts

34
NER in various languages

Chinese Japanese
Many works
Special characteristics
Character or word-based
No capitalization
CHINERS (2003)
Sports domain
Machine learning
Shallow parsing technique
ASAHARA MATSMUTO (2003)
Character-based method
Support Vector Machine
87.2 F-measure in the IREX (outperformed most
word-based systems)

35
NER in various languages

Dutch
DE MEULDER et al. (2002)
Hybrid system
Gazetteers, grammars of names
Machine Learning Ripper algorithm
French
BÉCHET et al. (2000)
Decision trees
Le Monde news corpus
German
Non-proper nouns also capitalized
THIELEN (1995)
Incremental statistical approach
65 of corrected disambiguated proper names

36
NER in various languages

Greek
KARKALETSIS et al. (1998)
English Greek GIE (Greek Information
Extraction) project
GATE platform
Italian
CUCCHIARELLI et al. (1998)
Merge rule-based and statistical approaches
Gazetteers
Context-dependent heuristics
ECRAN (Extraction of Content Research at Near
Market)
GATE architecture
Lack of linguistic resources 20 of NEs
undetected
Korean
CHUNG et al. (2003)
Rule-based model, Hidden Markov Model, boosting
approach over unannotated data

37
NER in various languages

Portuguese
SOLORIO LÓPEZ (2004, 2005)
Adapted CARRERAS et al. (2002b) spanish NER
Brazilian newspapers
Serbo-croatian
NENADIC SPASIC (2000)
Hand-written grammar rules
Highly inflective language
Lots of lexical and lemmatization pre-processing
Dual alphabet (Cyrillic and Latin)
Pre-processing stores the text in an independent
format

38
NER in various languages

Spanish
CARRERAS et al. (2002b)
Machine Learning, AdaBoost algorithm
BIO and OpenClose approaches
Swedish
SweNam system (DALIANIS ASTROM, 2001)
Perl
Machine Learning techniques and matching rules
Turkish
TUR et al (2000)
Hidden Markov Model and Viterbi search
Lexical, morphological and context clues

39
Named Entity Recognition

Multilingual approaches
Goals - CUCERZAN YAROWSKI (1999)
To handle basic language-specific evidences
To learn from small NE lists (about 100 names)
To process large and small texts
To have a good class-scalability (to allow the
definition of different classes of entities,
according to the language or to the purpose)
To learn incrementally, storing learned
information for future use

40
Named Entity Recognition

Multilingual approaches
GALLIPI (1996)
Machine Learning
English, Spanish, Portuguese
ECRAN (Extraction of Content Research at Near
Market)
REFLEX project (2005)
the US National Business Center

41
Named Entity Recognition

Multilingual approaches
POIBEAU (2003)
Arabic, Chinese, English, French, German,
Japanese, Finnish, Malagasy, Persian, Polish,
Russian, Spanish and Swedish
UNICODE
Language independent architecture
Rule-based, machine-learning
Sharing of resources (dictionary, grammar rules)
for some languages
BOAS II (2004)
University of Maryland Baltimore County
Web-based
Pattern-matching
No large corpora

42
NER other topics

Character x word-based
JING et al. (2003)
Hidden Markov Model classifier
Character-based model better than word-based
model
NER translation
Cross-language Information Retrieval (CLIR),
Machine Translation (MT) and Question Answering
(QA)
NER in speech
No punctuation, no capitalization
KIM WOODLAND (2000)
Up to 88.58 F-measure
NER in Web pages
wrappers

43
NER an experiment in Catalan

General architecture
Common API
Segmentation module
POS-tagger
Disambiguator
Grammar module
Module for accessing the system dictionaries

44
NER an experiment in Catalan

General architecture
Typographical error detection module
Spelling error detection module
Grammatical error detection module
NER module

45
NER an experiment in Catalan

NER Module
Dictionary
Multi tokens
WORD FORMLEMMATAGFREQUENCYWORD
FORMFREQUENCYWORD FORMFREQUENCY
cancanN5-FP444barbet42barça23Barceló4
Categories
PERSON
Names and surnames
LOCATION
Common indicators
ORGANIZATION
Common indicators
UNKNOWN

46
NER an experiment in Catalan

NER Module
Rules
Locations
Verb_viure a location
Exiliat novament, Macià viu a Bélgica.
Verb_néixer a location
Joan neix a Barcelona
Persons
Sr. person
El Sr. Companys va sortir.
El position de location, person
El alcalde de Barcelona, Joan Clos.

47
NER an experiment in Catalan

NER Module
Rules
Organizations
El position de organization
El president de Cases Rives.
Organization, verb_fundat el
El club Orfeas Smyrna, fundat el 1890 per jònics
que residien a la ciutat turca.
Combinations
For persons, organizations and locations

48
NER an experiment in Catalan

NER Module
Error detection and suggestion
Pre-defined spelling rules
Inserting try characters before every letter of
the word
Swapping characters one by one
Inserting try characters in their places
The NER correction as input for the Grammar
module

49
NER an experiment in Catalan

Results
20 catalan texts
Wikipedia, El Periòdic
10000 words
Various domains
Precision 70
Recall 75
F-Measure 72
Error correction and suggestions

50
Conclusions

Needs better tuning
Rules
Dictionary
canP0
can benetCan BenetN4BMS9canN4BMS9benetN4BM
S910000000P1
can benet deP0
can benet de laP0
can benet de la pruaCan Benet de la
PruacanN4BMSbenetN4BMSdePelEA--FSpruaN4B
FSP1
Test statistical based-engine?
Treatment of gender, number
Expand to full IE system