Named Entity Recognition - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Named Entity Recognition

Description:

'Automatic identification of selected types of entities, relations or events in ... Anaphora resolution. Cross-document. IE Techniques and tasks. Performance: ... – PowerPoint PPT presentation

Number of Views:1015
Avg rating:5.0/5.0
Slides: 51
Provided by: muti
Category:

less

Transcript and Presenter's Notes

Title: Named Entity Recognition


1
Named Entity Recognition
  • Beto Boullosa

2
Introduction
  • Presentation
  • Motivation
  • Contents
  • Information Extraction
  • Named Entity Recognition (NER)
  • An experiment with NER
  • Conclusions

3
Information Extraction
  • Automatic identification of selected types of
    entities, relations or events in free text
    (GRISHAM, 2003)
  • Related areas
  • Information Retrieval, Knowledge Extraction
  • IE x IR

4
Information Extraction
  • Applications
  • Processing of natural language texts for the
    extraction of relevant content pieces (MARTÍ AND
    CASTELLÓN, 2000)
  • Raw texts gt structured databases
  • Templates filling
  • Improving search engines
  • Auxiliary tool for other language applications

5
IE History
  • Early projects
  • Knowledge-based, rule-based
  • FRUMP 1979
  • Newswire
  • LSP (Language String Project) 1981
  • AMA American Medical Association
  • Patient summaries

6
IE History
  • MUC Message Understanding Conferences (1987)
  • DARPA, NRAD
  • Standardization
  • Evaluation
  • Dissemination
  • DARPAs TIPSTER Program Document Detection,
    Summarization and Information Extraction until
    1998
  • TREC (Text Retrieval Conferences)

7
IE History
  • MUC
  • Evaluation standards (for the 1st time in MUC-2)
  • Recall
  • correct units
  • total units
  • Precision
  • correct units
  • units found
  • F-Measure
  • (ß21) PR
  • ß2P R

8
IE History
  • MUC
  • Template filling
  • Mr. John Smith was appointed CEO of ACME last
    December 31.

Name John Smith Post CEO Company ACME Date
December 31
  • MUC-5 (1993)
  • 47 slots divided in 11 different nested templates
  • English and Japanese

9
IE History
  • MUC-6 (1995)
  • Extraction of Named Entities
  • names of persons, organizations, locations
  • temporal expressions, currency and percentages
  • Extraction of Template Elements
  • grouping of entity attributes together into
    entity objects
  • Extraction of events (or Scenario Templates)
  • Extraction of coreferences

10
IE History
  • MUC-6
  • ENAMEX (entity name expression) tag
  • people, organization and locations
  • NUMEX (numeric expression) tag
  • currency and percentages
  • TIMEX (time expression) tag
  • temporal expressions dates and times

11
IE History
  • MUC-6
  • Andrew Johnson was appointed last Sunday
    president of ACME, the biggest company in Santa
    Barbara, California, with an estimated 300
    million market capacity.
  • ltENAMEX TYPEPERSONgtAndrew Johnsonlt/ENAMEXgt
    was appointed ltTIMEX TYPEDATEgtlast
    Sundaylt/TIMEXgt president of ltENAMEX
    TYPEORGANIZATIONgtACMElt/ENAMEXgt, the biggest
    company in ltENAMEX TYPELOCATIONgtSanta
    Barbaralt/ENAMEXgt, ltENAMEX TYPELOCATIONgtCaliforn
    ialt/ENAMEXgt with an estimated ltNUMEX
    TYPEMONEYgt300 millionlt/NUMEXgt market
    capacity.

12
IE History
  • MUC-7 (1998)
  • Tasks
  • Named Entities (NE task)
  • Template Element (TE task)
  • Scenario Template (ST task)
  • Template Relation (TR task)
  • Coreferences (CO task)
  • System portability among domains

13
IE History
  • Domains used in MUCs

14
IE History
  • Results in MUC-6

15
IE History
  • Other conferences
  • MET (Multilingual Entity Task Evaluation)
  • Japanese NEs
  • IREX
  • Japan, 1998
  • Organization, Person, Location, Artifact, Date,
    Time, Money and Percent

16
IE History
  • Other conferences
  • HUB-4 and ACE (Automatic Content Extraction)
  • NIST National Institute of Standards and
    Technology
  • Spoken and printed text
  • CoNLL (Conference on Natural Language Learning)
  • Since 1997
  • NEs in the 2002 and 2003 editions
  • Multilingual
  • of person (PER), location (LOC), organization
    (ORG) and other (O) classes

17
IE Techniques and tasks
  • IE techniques
  • Document indexing text understanding
  • Document Indexing
  • Tags texts with different descriptors, giving a
    kind of semantic representation for its contents
  • Text Understanding
  • Builds a knowledge representation of texts
  • IE history
  • TU gt DI
  • More tractable perspective

18
IE Techniques and tasks
  • FC Barcelona sold goalkeeper Valdés to Espanyol
    last August 14

Seller Team FC Barcelona Buying Team
Espanyol Player Valdés Position
goalkeeper Date August 14.
  • Entities
  • one person
  • two clubs
  • Position
  • date
  • Relationship
  • to sell a player

19
IE Techniques and tasks
  • Compare with

FC Barcelona, the current european champion, has
unexpectedly sold goalkeeper Valdés to its main
rival Espanyol last August 14. Victor Valdés,
goalkeeper of FC Barcelona, has been transferred
to Espanyol last August 14. Espanyol expects a
great season after hiring FC Barcelona
goalkeeper, Valdés, last August 14. FC
Barcelona, the current european champion, is
looking for a new goalkeeper. The club
unexpectedly sold goalkeeper Valdés to its main
rival Espanyol last August 14. The Blaugrana
must hurry because in just a few days the
transfer market will be closed.
20
IE Techniques and tasks
  • Events and relations extraction
  • Knowledge-based techniques
  • Regular expressions and patterns
  • Knowledge-poor approaches
  • Machine learning, statistics
  • Coreferences
  • Anaphora resolution
  • Cross-document

21
IE Techniques and tasks
  • Performance
  • Events and relations extraction
  • x
  • Named entities extraction
  • Why?

22
Named Entity Recognition
  • Recognition x Classification
  • Name Identification and Classification
  • NER as
  • as a tool or component of IE and IR
  • as an input module for a robust shallow parsing
    engine
  • Component technology for other areas
  • Question Answering (QA)
  • Summarization
  • Automatic translation
  • Document indexing
  • Text data mining
  • Genetics

23
Named Entity Recognition
  • NE Hierarchies
  • Person
  • Organization
  • Location
  • But also
  • Artifact
  • Facility
  • Geopolitical entity
  • Vehicle
  • Weapon
  • Etc.
  • SEKINE NOBATA (2004)
  • 150 types
  • Domain-dependent

24
Named Entity Recognition
  • Internal and external features (or evidences)
  • Capitalization
  • not all languages
  • speech data
  • trigger words
  • El senyor Balaguer vol comprar-se un cotxe nou.
  • La ciutat de Balaguer és tot un compendi de
    història de Catalunya.

25
Named Entity Recognition
  • Handcrafted systems
  • Knowledge (rule) based
  • Patterns
  • Gazetteers
  • Automatic systems
  • Statistical
  • Machine learning
  • Unsupervised
  • Analyze char type, POS, lexical info,
    dictionaries
  • Hybrid systems

26
Named Entity Recognition
  • Handcrafted systems
  • LTG
  • F-measure of 93.39 in MUC-7 (the best)
  • Ltquery, XML internal representation
  • Tokenizer, POS-tagger, SGML transducer
  • Nominator (1997)
  • IBM
  • Heavy heuristics
  • Cross-document co-reference resolution
  • Used later in IBM Intelligent Miner

27
Named Entity Recognition
  • Handcrafted systems
  • LaSIE (Large Scale Information Extraction)
  • MUC-6 (LaSIE II in MUC-7)
  • Univ. of Sheffields GATE architecture (General
    Architecture for Text Engineering )
  • JAPE language
  • FACILE (1998)
  • NEA language (Named Entity Analysis)
  • Context-sensitive rules
  • NetOwl (MUC-7)
  • Commercial product
  • C engine, extraction rules

28
NER automatic approaches
  • Learning of statistical models or symbolic rules
  • Use of annotated text corpus
  • Manually annotated
  • Automatically annotated
  • BIO tagging
  • Tags Begin, Inside, Outside an NE
  • Probabilities
  • Simple
  • P(tag i token i)
  • With external evidence
  • P(tag i token i-1, token i, token i1)
  • OpenClose tagging
  • Two classifiers one for the beginning, one for
    the end

29
NER automatic approaches
  • Decision trees
  • Tree-oriented sequence of tests in every word
  • Determine probabilities of having a BIO tag
  • Use training corpus
  • Viterbi, ID3, C4.5 algorithms
  • Select most probable tag sequence
  • SEKINE et al (1998)
  • BALUJA et al (1999)
  • F-measure 90

30
NER automatic approaches
  • HMM
  • Markov models, Viterbi
  • Separate statistical model for each NE category
    model for words outside NEs
  • Nymble (1997) / IdentiFinder (1999)
  • Maximum Entropy (ME)
  • Separate, independent probabilities for every
    evidence (external and internal features) are
    merged multiplicatively
  • MENE (NYU - 1998)
  • Capitalization, many lexical features, type of
    text
  • F-Measure 89

31
NER other approaches
  • Hybrid systems
  • Combination of techniques
  • IBMs Intelligent Miner Nominator DB/2 data
    mining
  • WordNet hierarchies
  • MAGNINI et al. (2002)
  • Stacks of classifiers
  • Adaboost algorithm
  • Bootstrapping approaches
  • Small set of seeds
  • Memory-based ML, etc.

32
Named Entity Recognition
  • Handcrafted systems x automatic systems
  • Ease of change
  • Portability (domains and languages)
  • Scalability
  • Language resources
  • Cost-effectiveness

33
NER in various languages
  • Arabic
  • TAGARAB (1998)
  • Pattern-matching engine morphological analysis
  • Lots of morphological info (no differences in
    ortographic case)
  • Bulgarian
  • OSENOVA KOLKOVSKA (2002)
  • Handcrafted cascaded regular NE grammar
  • Pre-compiled lexicon and gazetteers
  • Catalan
  • CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003)
  • Extract catalan NEs with spanish resources
    (F-measure 93)
  • Bootstrap using catalan texts

34
NER in various languages
  • Chinese Japanese
  • Many works
  • Special characteristics
  • Character or word-based
  • No capitalization
  • CHINERS (2003)
  • Sports domain
  • Machine learning
  • Shallow parsing technique
  • ASAHARA MATSMUTO (2003)
  • Character-based method
  • Support Vector Machine
  • 87.2 F-measure in the IREX (outperformed most
    word-based systems)

35
NER in various languages
  • Dutch
  • DE MEULDER et al. (2002)
  • Hybrid system
  • Gazetteers, grammars of names
  • Machine Learning Ripper algorithm
  • French
  • BÉCHET et al. (2000)
  • Decision trees
  • Le Monde news corpus
  • German
  • Non-proper nouns also capitalized
  • THIELEN (1995)
  • Incremental statistical approach
  • 65 of corrected disambiguated proper names

36
NER in various languages
  • Greek
  • KARKALETSIS et al. (1998)
  • English Greek GIE (Greek Information
    Extraction) project
  • GATE platform
  • Italian
  • CUCCHIARELLI et al. (1998)
  • Merge rule-based and statistical approaches
  • Gazetteers
  • Context-dependent heuristics
  • ECRAN (Extraction of Content Research at Near
    Market)
  • GATE architecture
  • Lack of linguistic resources 20 of NEs
    undetected
  • Korean
  • CHUNG et al. (2003)
  • Rule-based model, Hidden Markov Model, boosting
    approach over unannotated data

37
NER in various languages
  • Portuguese
  • SOLORIO LÓPEZ (2004, 2005)
  • Adapted CARRERAS et al. (2002b) spanish NER
  • Brazilian newspapers
  • Serbo-croatian
  • NENADIC SPASIC (2000)
  • Hand-written grammar rules
  • Highly inflective language
  • Lots of lexical and lemmatization pre-processing
  • Dual alphabet (Cyrillic and Latin)
  • Pre-processing stores the text in an independent
    format

38
NER in various languages
  • Spanish
  • CARRERAS et al. (2002b)
  • Machine Learning, AdaBoost algorithm
  • BIO and OpenClose approaches
  • Swedish
  • SweNam system (DALIANIS ASTROM, 2001)
  • Perl
  • Machine Learning techniques and matching rules
  • Turkish
  • TUR et al (2000)
  • Hidden Markov Model and Viterbi search
  • Lexical, morphological and context clues

39
Named Entity Recognition
  • Multilingual approaches
  • Goals - CUCERZAN YAROWSKI (1999)
  • To handle basic language-specific evidences
  • To learn from small NE lists (about 100 names)
  • To process large and small texts
  • To have a good class-scalability (to allow the
    definition of different classes of entities,
    according to the language or to the purpose)
  • To learn incrementally, storing learned
    information for future use

40
Named Entity Recognition
  • Multilingual approaches
  • GALLIPI (1996)
  • Machine Learning
  • English, Spanish, Portuguese
  • ECRAN (Extraction of Content Research at Near
    Market)
  • REFLEX project (2005)
  • the US National Business Center

41
Named Entity Recognition
  • Multilingual approaches
  • POIBEAU (2003)
  • Arabic, Chinese, English, French, German,
    Japanese, Finnish, Malagasy, Persian, Polish,
    Russian, Spanish and Swedish
  • UNICODE
  • Language independent architecture
  • Rule-based, machine-learning
  • Sharing of resources (dictionary, grammar rules)
    for some languages
  • BOAS II (2004)
  • University of Maryland Baltimore County
  • Web-based
  • Pattern-matching
  • No large corpora

42
NER other topics
  • Character x word-based
  • JING et al. (2003)
  • Hidden Markov Model classifier
  • Character-based model better than word-based
    model
  • NER translation
  • Cross-language Information Retrieval (CLIR),
    Machine Translation (MT) and Question Answering
    (QA)
  • NER in speech
  • No punctuation, no capitalization
  • KIM WOODLAND (2000)
  • Up to 88.58 F-measure
  • NER in Web pages
  • wrappers

43
NER an experiment in Catalan
  • General architecture
  • Common API
  • Segmentation module
  • POS-tagger
  • Disambiguator
  • Grammar module
  • Module for accessing the system dictionaries

44
NER an experiment in Catalan
  • General architecture
  • Typographical error detection module
  • Spelling error detection module
  • Grammatical error detection module
  • NER module

45
NER an experiment in Catalan
  • NER Module
  • Dictionary
  • Multi tokens
  • WORD FORMLEMMATAGFREQUENCYWORD
    FORMFREQUENCYWORD FORMFREQUENCY
  • cancanN5-FP444barbet42barça23Barceló4
  • Categories
  • PERSON
  • Names and surnames
  • LOCATION
  • Common indicators
  • ORGANIZATION
  • Common indicators
  • UNKNOWN

46
NER an experiment in Catalan
  • NER Module
  • Rules
  • Locations
  • Verb_viure a location
  • Exiliat novament, Macià viu a Bélgica.
  • Verb_néixer a location
  • Joan neix a Barcelona
  • Persons
  • Sr. person
  • El Sr. Companys va sortir.
  • El position de location, person
  • El alcalde de Barcelona, Joan Clos.

47
NER an experiment in Catalan
  • NER Module
  • Rules
  • Organizations
  • El position de organization
  • El president de Cases Rives.
  • Organization, verb_fundat el
  • El club Orfeas Smyrna, fundat el 1890 per jònics
    que residien a la ciutat turca.
  • Combinations
  • For persons, organizations and locations

48
NER an experiment in Catalan
  • NER Module
  • Error detection and suggestion
  • Pre-defined spelling rules
  • Inserting try characters before every letter of
    the word
  • Swapping characters one by one
  • Inserting try characters in their places
  • The NER correction as input for the Grammar
    module

49
NER an experiment in Catalan
  • Results
  • 20 catalan texts
  • Wikipedia, El Periòdic
  • 10000 words
  • Various domains
  • Precision 70
  • Recall 75
  • F-Measure 72
  • Error correction and suggestions

50
Conclusions
  • Needs better tuning
  • Rules
  • Dictionary
  • canP0
  • can benetCan BenetN4BMS9canN4BMS9benetN4BM
    S910000000P1
  • can benet deP0
  • can benet de laP0
  • can benet de la pruaCan Benet de la
    PruacanN4BMSbenetN4BMSdePelEA--FSpruaN4B
    FSP1
  • Test statistical based-engine?
  • Treatment of gender, number
  • Expand to full IE system
Write a Comment
User Comments (0)
About PowerShow.com