Eugene Agichtein and Silviu Cucerzan - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Eugene Agichtein and Silviu Cucerzan

Description:

Blogs. News. Alerts. Information Extraction System. Events. Entities. E. 1 ... A senior White House official, who accompanied Clinton , told reporters... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 29

Provided by: silviuc

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Eugene Agichtein and Silviu Cucerzan

1
Predicting Accuracy of Extracting Information
from Unstructured Text Collections

Eugene Agichtein and Silviu Cucerzan
Microsoft Research

2
Extracting and Managing Information in Text
TextDocument Collections
Varying propertiesDifferent LanguagesVarying
consistencyNoise/errors .
Blogs News Alerts
Complex problem Usually many parameters Often
tuning required
Relations
Events
Entities
Success Accuracy
-------------------
E
1
E
2
-------------------
E
3
E
4
-----------
-------------------
E
4
E
1
-----------
-------------------
3
The Goal Predict Extraction Accuracy

Estimate the expected success of an IE system
that relies on contextual patterns before
running expensive experiments
tuning parameters
training the system
Useful when adapting an IE system to
a new task
a new document collection
a new language

4
Specific Extraction Tasks

Named Entity Recognition (NER)
Relation Extraction (RE)

Misc
European champions Liverpool paved the way to the
group stages of the Champions League taking a 3-1
lead over CSKA Sofia on Wednesday ... Gerard
Houllier's men started the match in Sofia on fire
with Steven Gerrard scoring ...
Person
Location
Abraham Lincoln was born on Feb. 12, 1809, in a
log cabin in Hardin (now Larue) County, Ky
BORN Who When Where
Abraham Lincoln Feb. 12, 1809 Hardin County, KY
5
Contextual Clues
NER
yesterday, Mrs Clinton told reporters the move
to the East Room
Right context
Left context
RE
engineers Orville and Wilbur Wright built the
first working airplane in 1903 .
Right context
Middle context
Left context
6
Approach Language Modelling

Presence of contextual clues for a task appears
related to extraction difficulty
The more obvious the clues, the easier the
task
Can be modelled as unexpectedness of a word
Use Language Modelling (LM) techniques to
quantify intuition

7
Language Models (LM)

An LM is summary of word distribution in text
Can define unigram, bigram, trigram, n-gram
models
More complex models exist
Distance, syntax, word classes
But not robust for web, other languages,
LMs used in IR, ASR, Text Classification,
Clustering
Query Clarity Predicting query performance
Cronen-Townsend et al, SIGIR 2002
Context Modelling for NER
Cucerzan et al., EMNLP 1999, Klein et al.
CoNLL 2003

8
Document Language Models

A basic LM is a normalized word histogram for the
document collection
Unigram (word) models commonly used
Higher-order n-grams (bigrams, trigrams) can be
used

word Freq
the 0.0584
to 0.0269
and 0.0199
said 0.0147
. . . . . .
's 0.0018
company 0.0014
mrs 0.0003
won 0.0003
president 0.0003
9
Context Language Models
NER PERSON

Senator Christopher Dodd, D-Conn., named general
chairman of the Democratic National Committee
last week by President Bill Clinton , said it was
premature to talk about lifting the U.S. embargo
against Cuba
Although the Clinton s health plan failed to
make it through Congress this year , Mrs Clinton
vowed continued support for the proposal.
A senior White House official, who accompanied
Clinton , told reporters

RE INVENTIONS

By the fall of 1905, the Wright brothers
experimental period ended. With their third
powered airplane , they now routinely made
flights of several
Against this backdrop, we see the Wright brothers
efforts to develop an airplane

10
Key Observation

If normally rare words consistently appear in
contexts around entities, extraction task tends
to be easier.
Contexts for a task are an intrinsic property of
collection and extraction task, and not
restricted to a specific information extraction
system.

11
Divergence Measures

Cosine Divergence
Relative entropy KL Divergence

12
Interpreting Divergence Reference LM

Need to calibrate the observed divergence
Compute Reference Model LMR
Pick K random non-stopwords R and compute the
context language model around Ri. the
five-star Hotel Astoria is a symbol of elegance
and comfort. With an unbeatable location in St
Isaac's Square in the heart of St Petersburg,
...
Normalized KL(LMC)
Normalization corrects for bias introduced by
small sample size

13
Reference LM (cont)

LMR converges to LMBG for large sample sizes
Divergence of LMR substantial for small samples

14
Predicting Extraction Accuracy The Algorithm

Start with a small sample S of entities (or
relation tuples) to be extracted
Find occurrences of S in given collection
Compute LMBG for the collection
Compute LMC for S and the collection
Pick S random words R from LMBG
Compute context LM for R ? LMR
Compute KL(LMC LMBG), KL(LMR LMBG)
Return normalized KL(LMC)

15
Experimental Evaluation

How to measure success?
Compare predicted ease of task vs. observed
extraction accuracy
Extraction Tasks NER and RE
NER Datasets from the CoNLL 2002, 2003
evaluations
RE Binary relations between NEs and generic
phrases

16
Extraction Task Accuracy

English Spanish Dutch
LOC 90.21 79.84 79.19
MISC 78.83 55.82 73.9
ORG 81.86 79.69 69.48
PER 91.47 86.83 78.83
Overall 86.77 79.2 75.24
Relation Accuracy () strict partial Task Difficulty
BORN 0.73 0.96 Easy
DIED 0.34 0.97 Easy
INVENT 0.35 0.64 Hard
WROTE 0.12 0.50 Hard
17
Document Collections
Task Collection Size
NER Reuters RCV1, 1/100 3,566,125 words
NER Reuters RCV1, 1/10 35,639,471 words
NER EFE newswire articles, May 2000 (Spanish) 367,589 words
NER De Morgen articles (Dutch) 268,705 words
RE Encarta document collection 64,187,912 words
Note that Spanish and Dutch corpus sizes are much
smaller
18
Predicting NER Performance (English)
Florian et al. Chieu et al. Klein et al. Zhang et al. Carreras et al. Average
LOC 91.15 91.12 89.98 89.54 89.26 90.21
MISC 80.44 79.16 80.15 75.87 78.54 78.83
ORG 84.67 84.32 80.48 80.46 79.41 81.86
PER 93.85 93.44 90.72 90.44 88.93 91.47
Overall 88.76 88.31 86.31 85.50 85.00 86.77
Reuters 1/10, Context 3 words, discard
stopwords, avg
Context size Absolute Normalized
LOC 0.98 1.07
MISC 1.29 1.40
ORG 2.83 3.08
PER 4.10 4.46
RANDOM 0.92 0.92
LOC exception Large overlap between locations
in the training and test collections (i.e.,
simple gazetteers effective).
Absolute and Normalized KL-divergence
19
NER Robustness / Different Dimensions

Counting stopwords (w) or not (w/o)
Context Size
Corpus size

LOC MISC ORG PER RAND
F 90.2 78.8 81.9 91.5 -
w 0.93 1.09 2.68 3.91 0.78
w/o 1.48 1.83 3.81 5.62 1.27
Reuters 1/100, context 3, avg
LOC MISC ORG PER RAND
1 0.88 1.26 2.12 2.94 2.43
2 1.06 1.47 2.95 4.11 1.14
3 1.07 1.4 3.08 4.46 0.92
Reuters 1/100, no stopwords, avg
LOC MISC ORG PER RAND
1/10 1.07 1.4 3.08 4.46 0.92
1/100 1.48 1.83 3.81 5.62 1.27
Reuters, context 3, no stopwords, avg
20
Other Dimensions Sample Size

Normalized divergence of LMC remains high
- Contrast with LMR for larger sample sizes

21
Other Dimensions N-gram size
LOC 90.21
MISC 78.83
ORG 81.86
PER 91.47

Higher order n-grams may help in some cases.

22
Other Languages

Spanish
Dutch

Context1 Context2 Context3
LOC 1.18 1.39 1.42
MISC 1.73 2.12 2.35
ORG 1.42 1.59 1.64
PER 2.01 2.31 2.56
RANDOM 2.42 1.82 1.53
Entity Actual
LOC 79.84
MISC 55.82
ORG 79.69
PER 86.83
Problem very small collections
Entity Actual
LOC 79.19
MISC 73.9
ORG 69.48
PER 78.83
Context1 Context2 Context3
LOC 1.44 1.65 1.61
MISC 1.97 2.02 1.91
ORG 1.53 1.86 1.92
PER 2.25 2.63 2.60
RANDOM 2.59 1.89 1.71
23
Predicting RE Performance (English)
Relation Context size 1 Context size 2 Context size 3
BORN 2.02 2.17 2.39
DIED 1.89 1.86 1.83
INVENT 1.94 1.75 1.72
WROTE 1.59 1.59 1.53
RANDOM 6.87 6.24 5.79
Relation Accuracy ()
BORN 0.73 0.96
DIED 0.34 0.97
INVENT 0.35 0.64
WROTE 0.12 0.50

2- and 3- word contexts correctly distinguish
between easy tasks (BORN, DIED), and
difficult tasks (INVENT, WROTE).
1-word context size appears not sufficient for
predicting RE

24
Other Dimensions Sample Size

Divergence increases w/ sample size

25
Results Summary

Context models can be effective in predicting the
success of information extraction systems
Even a small sample of available entities can be
sufficient for making accurate predictions
Available large collection size most important
limiting factor

26
Other Applications and Future Work

Could use results for
Active learning/training IE
Improved boundary detection for NER
Improved confidence estimation of extraction
e.g. Culotta and McCallum HLT 2004
For better results, could incorporate
Internal contexts, gazeteers (e.g., for LOC
entities)
e.g. Agichtein Ganti KDD 2004, Cohen
Sarawagi KDD 2004
Syntax/logical distance
Coreference Resolution
Word classes

27
Summary

Presented the first attempt to predict
information extraction accuracy for a given task
and collection
Developed a general, system-independent method
utilizing Language Modelling techniques
Estimates for extraction accuracy can help
Deploy information extraction systems
Port Information Extraction systems to new tasks,
domains, collections, and languages

28
For More Information