CSA405: Advanced Topics in NLP - PowerPoint PPT Presentation

About This Presentation
Title:

CSA405: Advanced Topics in NLP

Description:

TIME: complete or partial expression of time of day. Absolute temporal expressions only, i.e. ... 'third quarter of 1991' TIMEX TYPE='DATE' third quarter of ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 59
Provided by: MikeR2
Category:
Tags: nlp | advanced | csa405 | topics

less

Transcript and Presenter's Notes

Title: CSA405: Advanced Topics in NLP


1
CSA405 Advanced Topicsin NLP
  • Information Extraction II
  • Named Entity Recognition

2
Sources
  • D. Appelt and D. Israel, Introduction to IE
    Technology, tutorial given at IJCAI99
  • Mikheev et al EACL 1999 Named Entity Recognition
    without Gazetteers
  • Daniel M. Bikel, Richard Schwartz and Ralph M.
    Weischedel. 1999. An Algorithm that Learns Whats
    in a Name

3
Outline
  • NER what is involved
  • The MUC6/7 task definition
  • Two approaches
  • Mikheev 1999 (Rule Based)
  • Bikel 1999 (NER Based on HMMs)

4
The Named Entity Recognition
  • Named Entity task introduced as part of MUC-6
    (1995), and continued at MUC-7 (1998)
  • Different kinds of named entity
  • temporal expressions
  • numeric expressions
  • name expressions

5
Temporal Expressions(TIMEX tag)
  • DATE complete or partial date expression
  • TIME complete or partial expression of time of
    day
  • Absolute temporal expressions only, i.e.
  • Monday,
  • "10th of October
  • but not "first day of the month".

6
More TIMEX Examples
  • "twelve o'clock noon" ltTIMEX TYPE"TIME"gttwelve
    o'clock noonlt/TIMEXgt
  • "January 1990" ltTIMEX TYPE"DATE"gtJanuary
    1990lt/TIMEXgt
  • "third quarter of 1991" ltTIMEX TYPE"DATE"gtthird
    quarter of 1991lt/TIMEXgt
  • "the fourth quarter ended Sept. 30" ltTIMEX
    TYPE"DATE"gtthe fourth quarter ended Sept.
    30lt/TIMEXgt

7
Time Expressions - Difficulties
  • Problems interpreting some task specsRelative
    time expressions are not to be tagged, but any
    absolute times expressed as part of the entire
    expression are to be tagged
  • this ltTIMEX TYPE"DATE"gtJunelt/TIMEXgt
  • thirty days before the end of the year (no
    markup)
  • the end of ltTIMEX TYPE"DATE"gt1991lt/TIMEXgt

8
Temporal Expressions
  • DATE/TIME distinction relatively straightforward
    to handle
  • Can typically be captured by Regular Expressions
  • Need to handle missing elements properlye.g. Jan
    21st ? Jan 21st 2002

9
Number Expressions(NUMEX)
  • Monetary expressions
  • Percentages.
  • Numbers may be expressed in either numeric or
    alphabetic form.
  • Categorized as MONEY or PERCENT via the TYPE
    attribute.

10
NUMEX Tag
  • The entire string is to be tagged. ltNUMEX
    TYPE"MONEY"gt20 million New Pesoslt/NUMEXgt
  • Modifying words are to be excluded from the NUMEX
    tag. over ltNUMEX TYPE"MONEY"gt90,000lt/NUMEXgt
  • Nested tags allowed ltNUMEX TYPE"MONEY"gtltENAMEX
    TYPE"LOCATION"gtUSlt/ENAMEXgt43.6 millionlt/NUMEXgt
  • Numeric expressions that do not use
    currency/percentage terms are not to be
    tagged.12 points (no markup)

11
NUMEX Examples
  • "about 5" about ltNUMEX TYPE"PERCENT"gt5lt/NUMEXgt
  • "over 90,000" over ltNUMEX TYPE"MONEY"gt90,000lt/
    NUMEXgt
  • "several million dollars" ltNUMEX TYPE"MONEY"
    ALT"million dollars"gtseveral million
    dollarslt/NUMEXgt
  • "US43.6 million" ltNUMEX TYPE"MONEY"gtltENAMEX
    TYPE"LOCATION"gtUSlt/ENAMEXgt43.6 millionlt/NUMEXgt

12
Name Expressions
  • Two related subtasks
  • Identification which piece of text
  • Classification what kind of name

13
Name RecognitionIdentification and Classification
  • The delegation, which included the commander of
    the U.N. troops in Bosnia, Lt.Gen. Sir Michael
    Rose, went to the Serb stronghold of Pale, near
    Sarajevo, for talks with Bosnian Serb leader
    Radovan Karadzic .
  • Locations
  • Persons
  • Organizations

14
Annotator Guidelines
15
MUC-6 Output Format
  • Output in terms of SGML markupltENAMEX
    TYPE"ORGANIZATION"gtTaga Co.lt/ENAMEXgt

type attribute
tag
16
Name ExpressionsProblems
  • Recognition
  • Sentence initial uppercase is unreliable
  • Delimitation
  • Conjunctions to bind or not to bindVictoria and
    Albert (Museum)
  • Type Ambiguity
  • Persons versus Organisations versus Locations,
    e.g. J. Arthur RankWashington

17
Example 2
  • MATSUSHITA ELECTRIC INDUSTRIAL CO . HAS REACHED
    AGREEMENT
  • IF ALL GOES WELL, MATSUSHITA AND ROBERT BOSCH
    WILL
  • VICTOR CO. OF JAPAN ( JVC ) AND SONY CORP.
  • IN A FACTORY OF BLAUPUNKT WERKE , A ROBERT BOSCH
    SUBSIDIARY ,
  • TOUCH PANEL SYSTEMS , CAPITALIZED AT 50 MILLION
    YEN, IS OWNED
  • MATSUSHITA EILL DECIDE ON THE PRODUCTION SCALE.

18
Example 2
  • EASY keyword present
  • EASY shortened form is computable
  • EASY acronym is computable
  • HARD difficult to tell ROBERT BOSCH is an
    organisation name
  • HARD cf. 4.
  • HARD spelling error difficult to spot.

19
Name ExpressionsSources of Information
  • Occurrence specific
  • capitalisation presence of immediately
    surrounding clue words (e.g . Mr.)
  • Document specific
  • Previous mention of a name (cf. symbol tables)
  • same document same collection
  • External
  • Gazetteers e.g. person names place names zip
    codes.

20
Gazetteers
  • System that recognises only entities stored in
    its lists (gazetteers).
  • Advantages - Simple, fast, language independent,
    easy to retarget (just create lists)
  • Disadvantages impossible to enumerate all
    names, cannot deal with name variants, cannot
    resolve ambiguity.

21
Gazetteers
  • Limited availability
  • Maintenance (organisations change)
  • Criteria for building effective gazetteers
    unclear, e.g. size, but
  • Better to use small gazetteers with of well-known
    names than large ones of low-frequency names
    (Mikheev et al. 1999).

22
Sources for Creation of Gazetteers
  • Yellow pages for person and organisation names.
  • US GEOnet Names Server (GNS) data 3.9 million
    locations with 5.37 million nameshttp//earth-inf
    o.nga.mil/gns/html/
  • UN site http//unstats.un.org/unsd/citydata
  • Automatic collection from annotated training data

23
Recognising Names
  • Two main approaches
  • Rule Based System
  • Usually based on FS methods
  • Automatically trained system
  • Usually based on HMMs
  • Rule based systems tend to have a performance
    advantage

24
Mikheev et al 1999
  • How important are gazetteers?
  • Is it important that they are big?
  • If gazetteers are important but their size isn't,
  • What are the criteria for building gazetteers?

25
Mikheev Experiment
  • Learned List
  • Training data (200 articles from MUC7)
  • 1228 persons, 809 Organisations, 770 Locations
  • Common Lists
  • CIA World Fact book
  • 33K Organisations, 27K persons, 5K Locations
  • Combined

26
Mikheev Results of Experiment
27
Mikheevs System
  • Hybrid approach c. 100 rules
  • Rules make heavy use of capitalisation
  • Rules based on internal structure which reveals
    the type e.g.Word Word plcProf. Word Word
  • Modest but well-chosen gazetteer - 5000 Company
    Names, 1000 Human Names, 20,000 Locations, 2-3
    weeks effort

28
Mikheev et-al (1999) Architecture
1. Sure-fire Rules
Rule Relaxation
2. Partial Match
Partial Match 2
Title Assignment
29
Sure-Fire Rules
  • Fire when a possible candidate expression
    is
  • surrounded by a suggestive context

30
Partial Match 1
  • Collect all named entitities already identified
    eg Adam Kluver Ltd.
  • Generate all subsequences Adam, Adam Kluver
    Kluver, Kluver Ltd, Ltd.
  • Check for occurrences of subsequences and mark as
    possible items of the same class as the orginal
    named entity
  • Check against pre-trained maximum entropy model.

31
Maximum Entropy Model
  • This model takes into account contextual
    information for named entities
  • sentence position
  • whether they exist in lowercase in general
  • used in lowercase elsewhere in the same document,
    etc.
  • These features are passed to the model as
    attributes of the partially matched words.
  • If the model provides a positive answer for a
    partial match, the system makes a definite
    assignment.

32
Rule Relaxation
  • More relaxed contextual constraints
  • Make use of information from existing markup and
    from previous stages to
  • Resolve conjunctions within named entitites e.g.
    China Import and Export Co.
  • Resolve ambiguity of e.g. Murdochs News Corp

33
Partial Match 2
  • Handle single word names not covered by partial
    match 1 (eg Hughes Hughes Communication Ltd)
  • U7ited States and Russia If evidence for 2 items
    and one item has already been tagged Location,
    then likely that XXX and YYY are of same type.
    Hence conclude that U7ited States is of type
    Location

34
Title Assignment
  • Newswire titles are uppercase
  • Mark up entities in title by matching or
    partially matching entities found in text

35
Mikheev System Results
36
Use of Gazetteers
37
Mikheev - Conclusions
  • Locations suffer without gazetteers, but addition
    of small numbers of certain entries (e.g.country
    names) make a big difference.
  • Main point relatively small gazetteers are
    sufficient to give good precision and recall.
  • Experiments on the basis of a particuar type
    (journalistic English with mixed case)

38
Bikel 99 - Trainable SystemsHidden Markov Models
  • HMM is a probabilistic model based on a sequence
    of events in this case words..
  • Whether a word is part of a name is an event with
    an estimable probability that can be determined
    from a training corpus.
  • With HMM we assume that there is an underlying
    probabilistic FSM that changes state with each
    input event.
  • Probability that a word is part of a name is
    conditional also on the state of the machine.

39
Creating HMMs
  • Constructing an HMM depends upon
  • Having a good hidden state model
  • Having enough training data to estimate the
    probabilities of the state transitions given
    sequences of words.
  • When the recogniser is run, it computes the
    maximum likelihood path through the hidden state
    model, given the input word sequence.
  • Viterbi Algorithm finds the path.

40
The HMM for NER (Bikel)
person
end-of-sentence
organisation
start-of-sentence
(other name classes)
not-a-name
41
Name Class Categories
  • Eight Name Classes not-a-name (NAN).
  • Within each category, use a bigram language model
    (number of states in each category is V).
  • Aim, for a given sentence, is to find the most
    likely sequence of name-classes (NC) given a
    sequence of words (W)
  • NC argmax(P(NCW))

42
Model of Word Production
  • Select a name class NC, conditioning on the
    previous name-class (NC-1) and previous word w-1.
  • Generate the first word inside NC, conditioning
    on the NC and NC-1..
  • Generate all subsequent words inside NC, where
    each subsequent word is conditioned on its
    immediate predecessor (using standard bigram
    language model).

43
Example
  • Sentence Mr. Jones eats
  • According to MUC-6 rules, correct labelling
    isMr. ltENAMEX TYPEPERSONgtJoneslt/ENAMEXgteats.NAN

    PERSON NAN
  • According to model, the likelihood of this
    word/name-class sequence is given by the
    following expression (which should turn out to be
    most likely, given sufficient training)..

44
Likelihood Under the Model
  • Pr(NOT-A-NAME START-OF-SENTENCE, end)
  • Pr(Mr. NOT-A-NAME, START-OF-SENTENCE)
  • Pr(end Mr., NOT-A-NAME)
  • Pr(PERSON NOT-A-NAME, Mr.)
  • Pr(Jones PERSON, NOT-A-NAME)
  • Pr(end Jones, PERSON)
  • Pr(NOT-A-NAME PERSON, Jones)
  • Pr(eats NOT-A-NAME, PERSON)
  • Pr(. eats, NOT-A-NAME)
  • Pr(end ., NOT-A-NAME)
  • Pr(END-OF-SENTENCE NOT-A-NAME, .)

45
Words and Word Features
  • Word features are a language dependent part of
    the model
  • twoDigitNum 90 Two digit year
  • fourDigitNum 1990 Four digit year
  • containsDigitAndAlpha A8956-67 Product code
  • containsDigitAndDash 09-96 Date
  • containsDigitAndSlash 11/9/89 Date
  • containsDigitAndComma 23,000.00 Monetary amount
  • containsDigitAndPeriod 1.00 Monetary amount
  • allCaps BBN Organization
  • capPeriod M. Person name initial
  • initCap Sally Capitalized word
  • other , Punctuation all other words

46
Three Sub Models
  • Model to generate a name class
  • Model to generate first word
  • Model to generate subsequent words

47
How the Model Works
Model to generate a name class
Model to generate first word
Model to generate subsequent words
48
Generate First Word in NC
  • Likelihood P(transition from NC-1 to NC
    )P(generate word w).P(NC NC-1,w-1)P(ltw,fgt
    NC, NC-1)
  • N.B. Underlying Intuitions
  • Transition to NC strongly influenced by previous
    word and previous word class
  • First word of a name class strongly influenced by
    preceding word class.

49
Generate Subsequent Wordsin Name Class
  • Here there are two cases
  • Normal likelihood of w following w-1 within a
    particular NC. P(ltw,fgt ltw,fgt-1,NC )
  • Final word likelihood of w in NC being the
    final word of the class. This uses a
    distinguished end word with features other
    P(ltend,othergt ltw,fgtfinal,NC)

50
Estimating Probabilities
  • P(NCNC-1,w-1) c(NC,NC-1,w-1) / c(NC-1,w-1)
  • P(ltw,fgtfirstNC,NC-1) c(ltw,fgtfirst,NC,NC-1)/c(N
    C,NC-1)
  • P(ltw,fgtltw,fgt-1,NC) c(ltw,fgt,ltw,fgt-1,NC)/c(ltw,fgt
    -1,NC)

51
Backoff Models and Smoothing
  • System knows about all words/bigrams encountered
    during training.
  • However, in real applications, unknown words are
    also encountered, and mapped to _UNK_
  • System must therefore handle bigram probabilities
    involving _UNK_
  • as first word, as second word, as both.

52
Constructing Unknown Word Model
  • Based on "held out" data.
  • Divide data into 2 halves.
  • Use first half to create vocabulary, and train on
    second half.
  • When performing name recognition, the unknown
    word model is used whenever either or both words
    of a bigram is unknown.

53
Backoff Strategy
  • However, even with UWM, it is possible to be
    faced with a bigram that has never been
    encountered. In this case a backoff strategy is
    used.
  • Underlying such a strategy is a series of
    fallback models.
  • Data for successive members of the series are
    easier to obtain, but of lower quality.

54
Backoff Models for Names Class Bigrams
  • P(NC NC-1,w-1)
  • P(NC NC-1)
  • P(NC)
  • 1/NC

55
Backoff Weighting
  • The weight for each backoff model is computed on
    the fly
  • If computing P(XY), assign weight ? to the
    direct estimate and a weight (1- ?) to the
    backoff model, where ? 1 (old c(Y)/c(y)) /
    1 (unique outcomes of Y/c(Y))

56
Results of Evaluation
57
How Much Data is Needed?
  • Performance increase of 1.5 F-points for each
    doubling in the quantity of training data.
  • 1.2 million words of training data 200 hours of
    broadcast news or 1777 Wall Street Journal
    articles. 20 person weeks

58
Bikel - Conclusion
  • Old fashioned techniques
  • Simple probabilistic
  • Near human performance
  • Higher F-measure than any other system when case
    information is missing.
Write a Comment
User Comments (0)
About PowerShow.com