Named-Entity Recognition with Character-Level Models - PowerPoint PPT Presentation

About This Presentation
Title:

Named-Entity Recognition with Character-Level Models

Description:

Word-Level: 'Washington' occurs as LOC, PER, and ORG ... misc. extra features 's-2, s-1', 's-2, s-1, t-1, t0' s-1, 's-1, t-1, t0' ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 11
Provided by: joseph121
Category:

less

Transcript and Presenter's Notes

Title: Named-Entity Recognition with Character-Level Models


1
Named-Entity Recognition with Character-Level
Models
  • Dan Klein, Joseph Smarr, Huy Nguyen,
  • and Christopher D. Manning
  • Stanford University
  • CoNLL-2003 Seventh Conference on Natural
    Language Learning

klein_at_cs.stanford.edu jsmarr_at_stanford.edu htnguyen_at_stanford.edu manning_at_cs.stanford.edu
2
Unknown Words are a Central Challenge for NER
  • Recognizing known named-entities (NEs) is
    relatively simple and accurate
  • Recognizing novel NEs requires recognizing
    context and/or word-internal features
  • External context and frequent internal words
    (e.g. Inc.) are most commonly used features
  • Internal composition of NEs alone provide
    surprisingly strong evidence for classification
    (Smarr Manning, 2002)
  • Staffordshire
  • Abdul-Karim al-Kabariti
  • CentrInvest

3
Are Names Self-Describing?
  • NO names can be opaque/ambiguous
  • Word-Level Washington occurs as LOC, PER, and
    ORG
  • Char-Level ville suggests LOC, but exceptions
    like Neville
  • YES names can be highly distinctive/descriptive
  • Word-Level National Bank is a bank (i.e. ORG)
  • Char-Level Cotramoxazole is clearly a drug
    name
  • Question Overall, how informative are names
    alone?

4
How Internally Descriptive are Isolated Named
Entities?
  • Classification accuracy of pre-segmented CoNLL
    NEs without context is 90
  • Using character n-grams as features instead of
    words yields 25 error reduction
  • On single-word unknown NEs, word model is at
    chance char n-gram model fixes 38 of errors

NE Classification Accuracy () not CoNLL task
5
Exploiting Word-Internal Features
  • Many existing systems use some word-internal
    features (suffix, capitalization, punctuation,
    etc.)
  • e.g. Mikheev 97, Wacholder et al 97, Bikel et al
    97
  • Features usually language-dependent (e.g.
    morphology)
  • Our approach use char n-grams as primary
    representation
  • Use all substrings as classification features
  • Char n-grams subsume word features
  • Features are language-independent (assuming its
    alphabetic)
  • Similar in spirit to Cucerzan and Yarowsky (99),
    but uses ALL char n-grams vs. just prefix/suffix

6
Character-Feature Based Classifier
  • Model I Independent classification at each word
  • maxent classifiers, trained using conjugate
    gradient
  • equal-scale gaussian priors for smoothing
  • trained models with gt800K features in 2 hrs
  • POS tags and contextual features complement
    n-grams

Description Added Features Overall F1 (English Dev.)
Words w0
Official Baseline -
Char N-Grams n(w0)
POS Tags t0
Simple Context w-1, w0, t-1, t1
More Context w-1, w0, w0, w1, t-1, t0, t0, w1
7
Character-Based CMM
  • Model II Joint classifications along the
    sequence
  • Previous classification decisions are clearly
    relevant
  • Grace Road is a single location, not a person
    location
  • Include neighboring classification decisions as
    features
  • Perform joint inference across chain of
    classifiers
  • Conditional Markov Model (CMM, aka. maxent Markov
    model)
  • Borthwick 1999, McCallum et al 2000

8
Character-Based CMM
Description Added Features Overall F1 (English Dev)
More Context w-1, w0, w0, w1, t-1, t0, t0, w1
Simple Sequence s-1, s-1, t-1, t0
More Sequence s-2, s-1, s-2, s-1, t-1, t0
Final misc. extra features
  • Final extra features
  • Letter-type patterns for each word
  • United ? Xx, 12-month ? d-x, etc.
  • Conjunction features
  • E.g., previous state and current signature
  • Repeated last words of multi-word names
  • E.g., Jones after having seen Doug Jones
  • and a few more

9
Final Results
  • Drop from English dev to test largely due to
    inconsistent labeling
  • Lack of capitalization cues in German hurts
    recall more because maxent classifier is
    precision-biased when faced with weak evidence

10
Conclusions
  • Character substrings are valuable and
    underexploited model features
  • Named entities are internally quite descriptive
  • 25-30 error reduction vs. word-level models
  • Discriminative maxent models allow productive
    feature engineering
  • 30 error reduction vs. basic model
  • What distinguishes our approach?
  • More and better features
  • Regularization is crucial for preventing
    overfitting

11
References
  • Daniel M. Bikel, Scott Miller, Richard Schwartz,
    and Ralph Weischedel. 1997. Nymble a
    highperformance learning namefinder. In
    Proceedings of ANLP97, pages 194--201.
  • Andrew Borthwick. 1999. A Maximum Entropy
    Approach to Named Entity Recognition. Ph.D.
    thesis, New York University.
  • Silviu Cucerzan and David Yarowsky. 1999.
    Language independent named entity recognition
    combining morphological and contextual evidence.
    In Joint SIGDAT Conference on EMNLP and VLC.
  • Shai Fine, Yoram Singer, and Naftali Tishby.
    1998. The hierarchical hidden markov model
    Analysis and applications. Machine Learning,
    3241--62.

12
References (cont.)
  • Andrew McCallum, Dayne Freitag, and Fernando
    Pereira. 2000. Maximum entropy Markov models for
    information extraction and segmentation. In
    ICML2000.
  • Andrei Mikheev. 1997. Automatic rule induction
    for unknownword guessing. Computational
    Linguistics, 23(3)405--423.
  • Adwait Ratnaparkhi. 1996. A maximum entropy model
    for partofspeech tagging. In EMNLP 1, pages
    133--142.
  • Joseph Smarr and Christopher D. Manning. 2002.
    Classifying unknown proper noun phrases without
    context. Technical Report dbpubs/200246,
    Stanford University, Stanford, CA.
  • Nina Wacholder, Yael Ravin, and Misook Choi.
    1997. Disambiguation of proper names in text. In
    ANLP 5, pages 202--208.

13
CoNLL Named Entity Recognition
  • Task Predict semantic label of each word in text
  • Foreign NNP I-NP ORG
  • Ministry NNP I-NP ORG
  • spokesman NN I-NP O
  • Shen NNP I-NP PER
  • Guofang NNP I-NP PER
  • told VBD I-VP O
  • Reuters NNP I-NP ORG
  • O O

14
Final Results English Dev.
15
More Results
Description Added Features ALL LOC MISC ORG PER
Words w0 52.29 41.03 70.18 60.43 60.14
Official Baseline - 71.18 80.52 83.52 66.43 55.2
N-Grams n(w0) 73.10 80.95 71.67 59.06 77.23
Tags t0 74.17 81.27 74.46 59.61 78.73
Simple Context w-1, w0, t-1, t1 82.39 87.77 82.91 70.62 85.77
More Context w-1, w0, w0, w1, t-1, t0, t0, w1 83.09 89.13 83.51 71.31 85.89
Description Added Features ALL LOC MISC ORG PER

More Context w-1, w0, w0, w1, t-1, t0, t0, w1 83.09 89.13 83.51 71.31 85.89
Simple Sequence s-1, s-1, t-1, t0 85.44 90.09 80.95 76.4 89.66
More Sequence s-2, s-1, s-2, s-1, t-1, t0 87.21 90.76 81.01 81.71 90.8
Final misc. error-driven 92.27 94.39 87.1 88.44 95.41
16
Complete Final Results
English dev. Precision Recall F1
LOC 94.44 94.34 94.39
MISC 90.62 83.84 87.10
ORG 87.63 89.26 88.44
PER 93.86 97.01 95.41
Overall 92.15 92.39 92.27
English test Precision Recall F1
LOC 90.04 89.93 89.98
MISC 83.49 77.07 78.85
ORG 82.49 78.57 80.48
PER 86.66 95.18 90.72
Overall 86.12 86.49 86.31
German dev. Precision Recall F1
LOC 75.53 66.13 70.52
MISC 78.71 47.23 59.03
ORG 77.57 53.51 63.33
PER 72.36 71.02 71.69
Overall 75.36 60.36 67.03
German test Precision Recall F1
LOC 78.01 69.57 73.54
MISC 75.90 47.01 58.06
ORG 73.26 51.75 60.65
PER 87.68 79.83 83.57
Overall 80.38 65.04 71.90
Write a Comment
User Comments (0)
About PowerShow.com