CSA405: Advanced Topics in NLP - PowerPoint PPT Presentation

About This Presentation

Title:

CSA405: Advanced Topics in NLP

Description:

TIME: complete or partial expression of time of day. Absolute temporal expressions only, i.e. ... 'third quarter of 1991' TIMEX TYPE='DATE' third quarter of ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 59

Provided by: MikeR2

Category:

more less

Transcript and Presenter's Notes

Title: CSA405: Advanced Topics in NLP

1
CSA405 Advanced Topicsin NLP

Information Extraction II
Named Entity Recognition

2
Sources

D. Appelt and D. Israel, Introduction to IE
Technology, tutorial given at IJCAI99
Mikheev et al EACL 1999 Named Entity Recognition
without Gazetteers
Daniel M. Bikel, Richard Schwartz and Ralph M.
Weischedel. 1999. An Algorithm that Learns Whats
in a Name

3
Outline

NER what is involved
The MUC6/7 task definition
Two approaches
Mikheev 1999 (Rule Based)
Bikel 1999 (NER Based on HMMs)

4
The Named Entity Recognition

Named Entity task introduced as part of MUC-6
(1995), and continued at MUC-7 (1998)
Different kinds of named entity
temporal expressions
numeric expressions
name expressions

5
Temporal Expressions(TIMEX tag)

DATE complete or partial date expression
TIME complete or partial expression of time of
day
Absolute temporal expressions only, i.e.
Monday,
"10th of October
but not "first day of the month".

6
More TIMEX Examples

"twelve o'clock noon" ltTIMEX TYPE"TIME"gttwelve
o'clock noonlt/TIMEXgt
"January 1990" ltTIMEX TYPE"DATE"gtJanuary
1990lt/TIMEXgt
"third quarter of 1991" ltTIMEX TYPE"DATE"gtthird
quarter of 1991lt/TIMEXgt
"the fourth quarter ended Sept. 30" ltTIMEX
TYPE"DATE"gtthe fourth quarter ended Sept.
30lt/TIMEXgt

7
Time Expressions - Difficulties

Problems interpreting some task specsRelative
time expressions are not to be tagged, but any
absolute times expressed as part of the entire
expression are to be tagged
this ltTIMEX TYPE"DATE"gtJunelt/TIMEXgt
thirty days before the end of the year (no
markup)
the end of ltTIMEX TYPE"DATE"gt1991lt/TIMEXgt

8
Temporal Expressions

DATE/TIME distinction relatively straightforward
to handle
Can typically be captured by Regular Expressions
Need to handle missing elements properlye.g. Jan
21st ? Jan 21st 2002

9
Number Expressions(NUMEX)

Monetary expressions
Percentages.
Numbers may be expressed in either numeric or
alphabetic form.
Categorized as MONEY or PERCENT via the TYPE
attribute.

10
NUMEX Tag

The entire string is to be tagged. ltNUMEX
TYPE"MONEY"gt20 million New Pesoslt/NUMEXgt
Modifying words are to be excluded from the NUMEX
tag. over ltNUMEX TYPE"MONEY"gt90,000lt/NUMEXgt
Nested tags allowed ltNUMEX TYPE"MONEY"gtltENAMEX
TYPE"LOCATION"gtUSlt/ENAMEXgt43.6 millionlt/NUMEXgt
Numeric expressions that do not use
currency/percentage terms are not to be
tagged.12 points (no markup)

11
NUMEX Examples

"about 5" about ltNUMEX TYPE"PERCENT"gt5lt/NUMEXgt
"over 90,000" over ltNUMEX TYPE"MONEY"gt90,000lt/
NUMEXgt
"several million dollars" ltNUMEX TYPE"MONEY"
ALT"million dollars"gtseveral million
dollarslt/NUMEXgt
"US43.6 million" ltNUMEX TYPE"MONEY"gtltENAMEX
TYPE"LOCATION"gtUSlt/ENAMEXgt43.6 millionlt/NUMEXgt

12
Name Expressions

Two related subtasks
Identification which piece of text
Classification what kind of name

13
Name RecognitionIdentification and Classification

The delegation, which included the commander of
the U.N. troops in Bosnia, Lt.Gen. Sir Michael
Rose, went to the Serb stronghold of Pale, near
Sarajevo, for talks with Bosnian Serb leader
Radovan Karadzic .
Locations
Persons
Organizations

14
Annotator Guidelines
15
MUC-6 Output Format

Output in terms of SGML markupltENAMEX
TYPE"ORGANIZATION"gtTaga Co.lt/ENAMEXgt

type attribute
tag
16
Name ExpressionsProblems

Recognition
Sentence initial uppercase is unreliable
Delimitation
Conjunctions to bind or not to bindVictoria and
Albert (Museum)
Type Ambiguity
Persons versus Organisations versus Locations,
e.g. J. Arthur RankWashington

17
Example 2

MATSUSHITA ELECTRIC INDUSTRIAL CO . HAS REACHED
AGREEMENT
IF ALL GOES WELL, MATSUSHITA AND ROBERT BOSCH
WILL
VICTOR CO. OF JAPAN ( JVC ) AND SONY CORP.

IN A FACTORY OF BLAUPUNKT WERKE , A ROBERT BOSCH
SUBSIDIARY ,
TOUCH PANEL SYSTEMS , CAPITALIZED AT 50 MILLION
YEN, IS OWNED
MATSUSHITA EILL DECIDE ON THE PRODUCTION SCALE.

18
Example 2

EASY keyword present
EASY shortened form is computable
EASY acronym is computable

HARD difficult to tell ROBERT BOSCH is an
organisation name
HARD cf. 4.
HARD spelling error difficult to spot.

19
Name ExpressionsSources of Information

Occurrence specific
capitalisation presence of immediately
surrounding clue words (e.g . Mr.)
Document specific
Previous mention of a name (cf. symbol tables)
same document same collection
External
Gazetteers e.g. person names place names zip
codes.

20
Gazetteers

System that recognises only entities stored in
its lists (gazetteers).
Advantages - Simple, fast, language independent,
easy to retarget (just create lists)
Disadvantages impossible to enumerate all
names, cannot deal with name variants, cannot
resolve ambiguity.

21
Gazetteers

Limited availability
Maintenance (organisations change)
Criteria for building effective gazetteers
unclear, e.g. size, but
Better to use small gazetteers with of well-known
names than large ones of low-frequency names
(Mikheev et al. 1999).

22
Sources for Creation of Gazetteers

Yellow pages for person and organisation names.
US GEOnet Names Server (GNS) data 3.9 million
locations with 5.37 million nameshttp//earth-inf
o.nga.mil/gns/html/
UN site http//unstats.un.org/unsd/citydata
Automatic collection from annotated training data

23
Recognising Names

Two main approaches
Rule Based System
Usually based on FS methods
Automatically trained system
Usually based on HMMs
Rule based systems tend to have a performance
advantage

24
Mikheev et al 1999

How important are gazetteers?
Is it important that they are big?
If gazetteers are important but their size isn't,
What are the criteria for building gazetteers?

25
Mikheev Experiment

Learned List
Training data (200 articles from MUC7)
1228 persons, 809 Organisations, 770 Locations
Common Lists
CIA World Fact book
33K Organisations, 27K persons, 5K Locations
Combined

26
Mikheev Results of Experiment
27
Mikheevs System

Hybrid approach c. 100 rules
Rules make heavy use of capitalisation
Rules based on internal structure which reveals
the type e.g.Word Word plcProf. Word Word
Modest but well-chosen gazetteer - 5000 Company
Names, 1000 Human Names, 20,000 Locations, 2-3
weeks effort

28
Mikheev et-al (1999) Architecture
1. Sure-fire Rules
Rule Relaxation
2. Partial Match
Partial Match 2
Title Assignment
29
Sure-Fire Rules

Fire when a possible candidate expression
is
surrounded by a suggestive context

30
Partial Match 1

Collect all named entitities already identified
eg Adam Kluver Ltd.
Generate all subsequences Adam, Adam Kluver
Kluver, Kluver Ltd, Ltd.
Check for occurrences of subsequences and mark as
possible items of the same class as the orginal
named entity
Check against pre-trained maximum entropy model.

31
Maximum Entropy Model

This model takes into account contextual
information for named entities
sentence position
whether they exist in lowercase in general
used in lowercase elsewhere in the same document,
etc.
These features are passed to the model as
attributes of the partially matched words.
If the model provides a positive answer for a
partial match, the system makes a definite
assignment.

32
Rule Relaxation

More relaxed contextual constraints
Make use of information from existing markup and
from previous stages to
Resolve conjunctions within named entitites e.g.
China Import and Export Co.
Resolve ambiguity of e.g. Murdochs News Corp

33
Partial Match 2

Handle single word names not covered by partial
match 1 (eg Hughes Hughes Communication Ltd)
U7ited States and Russia If evidence for 2 items
and one item has already been tagged Location,
then likely that XXX and YYY are of same type.
Hence conclude that U7ited States is of type
Location

34
Title Assignment

Newswire titles are uppercase
Mark up entities in title by matching or
partially matching entities found in text

35
Mikheev System Results
36
Use of Gazetteers
37
Mikheev - Conclusions

Locations suffer without gazetteers, but addition
of small numbers of certain entries (e.g.country
names) make a big difference.
Main point relatively small gazetteers are
sufficient to give good precision and recall.
Experiments on the basis of a particuar type
(journalistic English with mixed case)

38
Bikel 99 - Trainable SystemsHidden Markov Models

HMM is a probabilistic model based on a sequence
of events in this case words..
Whether a word is part of a name is an event with
an estimable probability that can be determined
from a training corpus.
With HMM we assume that there is an underlying
probabilistic FSM that changes state with each
input event.
Probability that a word is part of a name is
conditional also on the state of the machine.

39
Creating HMMs

Constructing an HMM depends upon
Having a good hidden state model
Having enough training data to estimate the
probabilities of the state transitions given
sequences of words.
When the recogniser is run, it computes the
maximum likelihood path through the hidden state
model, given the input word sequence.
Viterbi Algorithm finds the path.

40
The HMM for NER (Bikel)
person
end-of-sentence
organisation
start-of-sentence
(other name classes)
not-a-name
41
Name Class Categories

Eight Name Classes not-a-name (NAN).
Within each category, use a bigram language model
(number of states in each category is V).
Aim, for a given sentence, is to find the most
likely sequence of name-classes (NC) given a
sequence of words (W)
NC argmax(P(NCW))

42
Model of Word Production

Select a name class NC, conditioning on the
previous name-class (NC-1) and previous word w-1.
Generate the first word inside NC, conditioning
on the NC and NC-1..
Generate all subsequent words inside NC, where
each subsequent word is conditioned on its
immediate predecessor (using standard bigram
language model).

43
Example

Sentence Mr. Jones eats
According to MUC-6 rules, correct labelling
isMr. ltENAMEX TYPEPERSONgtJoneslt/ENAMEXgteats.NAN

PERSON NAN
According to model, the likelihood of this
word/name-class sequence is given by the
following expression (which should turn out to be
most likely, given sufficient training)..

44
Likelihood Under the Model

Pr(NOT-A-NAME START-OF-SENTENCE, end)
Pr(Mr. NOT-A-NAME, START-OF-SENTENCE)
Pr(end Mr., NOT-A-NAME)
Pr(PERSON NOT-A-NAME, Mr.)
Pr(Jones PERSON, NOT-A-NAME)
Pr(end Jones, PERSON)
Pr(NOT-A-NAME PERSON, Jones)
Pr(eats NOT-A-NAME, PERSON)
Pr(. eats, NOT-A-NAME)
Pr(end ., NOT-A-NAME)
Pr(END-OF-SENTENCE NOT-A-NAME, .)

45
Words and Word Features

Word features are a language dependent part of
the model
twoDigitNum 90 Two digit year
fourDigitNum 1990 Four digit year
containsDigitAndAlpha A8956-67 Product code
containsDigitAndDash 09-96 Date
containsDigitAndSlash 11/9/89 Date
containsDigitAndComma 23,000.00 Monetary amount
containsDigitAndPeriod 1.00 Monetary amount
allCaps BBN Organization
capPeriod M. Person name initial
initCap Sally Capitalized word
other , Punctuation all other words

46
Three Sub Models

Model to generate a name class
Model to generate first word
Model to generate subsequent words

47
How the Model Works
Model to generate a name class
Model to generate first word
Model to generate subsequent words
48
Generate First Word in NC

Likelihood P(transition from NC-1 to NC
)P(generate word w).P(NC NC-1,w-1)P(ltw,fgt
NC, NC-1)
N.B. Underlying Intuitions
Transition to NC strongly influenced by previous
word and previous word class
First word of a name class strongly influenced by
preceding word class.

49
Generate Subsequent Wordsin Name Class

Here there are two cases
Normal likelihood of w following w-1 within a
particular NC. P(ltw,fgt ltw,fgt-1,NC )
Final word likelihood of w in NC being the
final word of the class. This uses a
distinguished end word with features other
P(ltend,othergt ltw,fgtfinal,NC)

50
Estimating Probabilities

P(NCNC-1,w-1) c(NC,NC-1,w-1) / c(NC-1,w-1)
P(ltw,fgtfirstNC,NC-1) c(ltw,fgtfirst,NC,NC-1)/c(N
C,NC-1)
P(ltw,fgtltw,fgt-1,NC) c(ltw,fgt,ltw,fgt-1,NC)/c(ltw,fgt
-1,NC)

51
Backoff Models and Smoothing

System knows about all words/bigrams encountered
during training.
However, in real applications, unknown words are
also encountered, and mapped to _UNK_
System must therefore handle bigram probabilities
involving _UNK_
as first word, as second word, as both.

52
Constructing Unknown Word Model

Based on "held out" data.
Divide data into 2 halves.
Use first half to create vocabulary, and train on
second half.
When performing name recognition, the unknown
word model is used whenever either or both words
of a bigram is unknown.

53
Backoff Strategy

However, even with UWM, it is possible to be
faced with a bigram that has never been
encountered. In this case a backoff strategy is
used.
Underlying such a strategy is a series of
fallback models.
Data for successive members of the series are
easier to obtain, but of lower quality.

54
Backoff Models for Names Class Bigrams

P(NC NC-1,w-1)
P(NC NC-1)
P(NC)
1/NC

55
Backoff Weighting

The weight for each backoff model is computed on
the fly
If computing P(XY), assign weight ? to the
direct estimate and a weight (1- ?) to the
backoff model, where ? 1 (old c(Y)/c(y)) /
1 (unique outcomes of Y/c(Y))

56
Results of Evaluation
57
How Much Data is Needed?

Performance increase of 1.5 F-points for each
doubling in the quantity of training data.
1.2 million words of training data 200 hours of
broadcast news or 1777 Wall Street Journal
articles. 20 person weeks

58
Bikel - Conclusion