Named Entity Recognition - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Named Entity Recognition

Description:

Classify entities into NE categories ... 5 other name classes. START_OF_SENTENCE. END_OF_SENTENCE. CIS 530 - Intro to NLP. 8. IdentiFinder (2) ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 19
Provided by: ralp84
Category:

less

Transcript and Presenter's Notes

Title: Named Entity Recognition


1
Named Entity Recognition
  • Mitch Marcus
  • Slides adapted from
  • Lance Ramshaw, BBN Technologies
  • Salim Roukos, IBM Research (not distributed)
  • Elizabeth Boschee, Michael Levit, Marjorie
    Freedman, BBN Technologies (not distributed)

2
Review Named Entities
  • The who, where, when how much in a sentence
  • The task identify atomic elements of information
    in text
  • person names
  • company/organization names
  • locations
  • datestimes
  • percentages
  • monetary amounts

3
Extraction Example
  • George Garrick, 40 years old, president of the
    London-based European Information Services Inc.,
    was appointed chief executive officer ofNielsen
    Marketing Research, USA.

George Garrick, 40 years old,
Nielsen Marketing Research, USA.
4
Review Applications
  • Information Extraction
  • Summary generation
  • Machine Translation
  • Document organization/classification
  • Automatic indexing of books
  • Improve Internet search results(location
    Clinton/South Carolina vs. PresidentClinton)

5
Machine Learning Approaches
  • ML approaches frequently break down the NE task
    into two parts
  • Find entity boundaries
  • Classify entities into NE categories
  • Or Reduce NE boundary detection and
    classification to IOB tagging
  • O outside, B-XXX first word in NE, I-XXX
    all other words in NE
  • Argentina B-LOCplayed Owith ODel B-PERBos
    que I-PER

6
IdentiFinder (Nymble) Bikel et al 99
  • Based on Hidden Markov Models
  • Features
  • Capitalisation
  • Numeric symbols
  • Punctuation marks
  • Position in the sentence
  • 14 features in total, combining above info, e.g.,
    containsDigitAndDash (e.g. 09-96),
    containsDigitAndComma (e.g. 23,000.00)

7
Nymbles structure (simplified)
START_OF_SENTENCE
END_OF_SENTENCE
PERSON
ORGANIZATION
5 other name classes
NOT_A_NAME
8
IdentiFinder (2)
  • MUC-6 (English) and MET-1(Spanish) corpora used
    for evaluation
  • Mixed case English
  • IdentiFinder - 94.9 f-measure
  • Best rule-based 96.4
  • Spanish mixed case
  • IdentiFinder 90
  • Best rule-based - 93
  • Lower case names, noisy training data, less
    training data
  • Training data 650,000 words, but similar
    performance with half of the data. Less than
    100,000 words reduce the performance to below 90
    on English

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
one out of date but simple method
13
NER as a benchmark task
  • CONLL (Conference on Natural Language Learning)
    shared task 2002, 2003
  • Two languages, English German
  • Following Information from
  • Introduction to the CoNLL-2003 Shared Task
    Language-Independent Named Entity Recognition
  • Erik F. Tjong Kim Sang and Fien De Meulder

14
Data Sets Annotation
15
Feature Representations Used
16
Methods Used
  • Max Entropy Bender, ChieuNg, CurranClark
  • Max Entropy others Florianal, Kleinal,
  • HMMs Florianal, Kleinal, Mayfieldal,
    WhitelawPatrick
  • CRFs Kleinal
  • WINNOW ZhangJohnson
  • Voted Perceptron Carrerasal
  • RNN Hammerton
  • TBL Florianal
  • SVM Mayfieldal

17
Evaluation
18
Additional Data System Availability Matters
Write a Comment
User Comments (0)
About PowerShow.com