Title: Eugene Agichtein and Silviu Cucerzan
1Predicting Accuracy of Extracting Information
from Unstructured Text Collections
- Eugene Agichtein and Silviu Cucerzan
- Microsoft Research
2Extracting and Managing Information in Text
TextDocument Collections
Varying propertiesDifferent LanguagesVarying
consistencyNoise/errors .
Blogs News Alerts
Complex problem Usually many parameters Often
tuning required
Relations
Events
Entities
Success Accuracy
-------------------
E
1
E
2
-------------------
E
3
E
4
-----------
-------------------
E
4
E
1
-----------
-------------------
3The Goal Predict Extraction Accuracy
- Estimate the expected success of an IE system
that relies on contextual patterns before - running expensive experiments
- tuning parameters
- training the system
- Useful when adapting an IE system to
- a new task
- a new document collection
- a new language
4Specific Extraction Tasks
- Named Entity Recognition (NER)
-
-
- Relation Extraction (RE)
-
-
Misc
European champions Liverpool paved the way to the
group stages of the Champions League taking a 3-1
lead over CSKA Sofia on Wednesday ... Gerard
Houllier's men started the match in Sofia on fire
with Steven Gerrard scoring ...
Person
Location
Abraham Lincoln was born on Feb. 12, 1809, in a
log cabin in Hardin (now Larue) County, Ky
BORN Who When Where
Abraham Lincoln Feb. 12, 1809 Hardin County, KY
5Contextual Clues
NER
yesterday, Mrs Clinton told reporters the move
to the East Room
Right context
Left context
RE
engineers Orville and Wilbur Wright built the
first working airplane in 1903 .
Right context
Middle context
Left context
6Approach Language Modelling
- Presence of contextual clues for a task appears
related to extraction difficulty - The more obvious the clues, the easier the
task - Can be modelled as unexpectedness of a word
- Use Language Modelling (LM) techniques to
quantify intuition
7Language Models (LM)
- An LM is summary of word distribution in text
- Can define unigram, bigram, trigram, n-gram
models - More complex models exist
- Distance, syntax, word classes
- But not robust for web, other languages,
- LMs used in IR, ASR, Text Classification,
Clustering - Query Clarity Predicting query performance
Cronen-Townsend et al, SIGIR 2002 - Context Modelling for NER
- Cucerzan et al., EMNLP 1999, Klein et al.
CoNLL 2003 -
8Document Language Models
- A basic LM is a normalized word histogram for the
document collection - Unigram (word) models commonly used
- Higher-order n-grams (bigrams, trigrams) can be
used
word Freq
the 0.0584
to 0.0269
and 0.0199
said 0.0147
. . . . . .
's 0.0018
company 0.0014
mrs 0.0003
won 0.0003
president 0.0003
9Context Language Models
NER PERSON
- Senator Christopher Dodd, D-Conn., named general
chairman of the Democratic National Committee
last week by President Bill Clinton , said it was
premature to talk about lifting the U.S. embargo
against Cuba - Although the Clinton s health plan failed to
make it through Congress this year , Mrs Clinton
vowed continued support for the proposal. - A senior White House official, who accompanied
Clinton , told reporters
RE INVENTIONS
- By the fall of 1905, the Wright brothers
experimental period ended. With their third
powered airplane , they now routinely made
flights of several - Against this backdrop, we see the Wright brothers
efforts to develop an airplane
10Key Observation
- If normally rare words consistently appear in
contexts around entities, extraction task tends
to be easier. - Contexts for a task are an intrinsic property of
collection and extraction task, and not
restricted to a specific information extraction
system.
11Divergence Measures
- Cosine Divergence
-
- Relative entropy KL Divergence
12Interpreting Divergence Reference LM
- Need to calibrate the observed divergence
- Compute Reference Model LMR
- Pick K random non-stopwords R and compute the
context language model around Ri. the
five-star Hotel Astoria is a symbol of elegance
and comfort. With an unbeatable location in St
Isaac's Square in the heart of St Petersburg,
... - Normalized KL(LMC)
-
- Normalization corrects for bias introduced by
small sample size
13Reference LM (cont)
- LMR converges to LMBG for large sample sizes
- Divergence of LMR substantial for small samples
14Predicting Extraction Accuracy The Algorithm
- Start with a small sample S of entities (or
relation tuples) to be extracted - Find occurrences of S in given collection
- Compute LMBG for the collection
- Compute LMC for S and the collection
- Pick S random words R from LMBG
- Compute context LM for R ? LMR
- Compute KL(LMC LMBG), KL(LMR LMBG)
- Return normalized KL(LMC)
15Experimental Evaluation
- How to measure success?
- Compare predicted ease of task vs. observed
extraction accuracy - Extraction Tasks NER and RE
- NER Datasets from the CoNLL 2002, 2003
evaluations - RE Binary relations between NEs and generic
phrases
16Extraction Task Accuracy
English Spanish Dutch
LOC 90.21 79.84 79.19
MISC 78.83 55.82 73.9
ORG 81.86 79.69 69.48
PER 91.47 86.83 78.83
Overall 86.77 79.2 75.24
Relation Accuracy () strict partial Task Difficulty
BORN 0.73 0.96 Easy
DIED 0.34 0.97 Easy
INVENT 0.35 0.64 Hard
WROTE 0.12 0.50 Hard
17Document Collections
Task Collection Size
NER Reuters RCV1, 1/100 3,566,125 words
NER Reuters RCV1, 1/10 35,639,471 words
NER EFE newswire articles, May 2000 (Spanish) 367,589 words
NER De Morgen articles (Dutch) 268,705 words
RE Encarta document collection 64,187,912 words
Note that Spanish and Dutch corpus sizes are much
smaller
18Predicting NER Performance (English)
Florian et al. Chieu et al. Klein et al. Zhang et al. Carreras et al. Average
LOC 91.15 91.12 89.98 89.54 89.26 90.21
MISC 80.44 79.16 80.15 75.87 78.54 78.83
ORG 84.67 84.32 80.48 80.46 79.41 81.86
PER 93.85 93.44 90.72 90.44 88.93 91.47
Overall 88.76 88.31 86.31 85.50 85.00 86.77
Reuters 1/10, Context 3 words, discard
stopwords, avg
Context size Absolute Normalized
LOC 0.98 1.07
MISC 1.29 1.40
ORG 2.83 3.08
PER 4.10 4.46
RANDOM 0.92 0.92
LOC exception Large overlap between locations
in the training and test collections (i.e.,
simple gazetteers effective).
Absolute and Normalized KL-divergence
19NER Robustness / Different Dimensions
- Counting stopwords (w) or not (w/o)
- Context Size
- Corpus size
LOC MISC ORG PER RAND
F 90.2 78.8 81.9 91.5 -
w 0.93 1.09 2.68 3.91 0.78
w/o 1.48 1.83 3.81 5.62 1.27
Reuters 1/100, context 3, avg
LOC MISC ORG PER RAND
1 0.88 1.26 2.12 2.94 2.43
2 1.06 1.47 2.95 4.11 1.14
3 1.07 1.4 3.08 4.46 0.92
Reuters 1/100, no stopwords, avg
LOC MISC ORG PER RAND
1/10 1.07 1.4 3.08 4.46 0.92
1/100 1.48 1.83 3.81 5.62 1.27
Reuters, context 3, no stopwords, avg
20Other Dimensions Sample Size
- Normalized divergence of LMC remains high
- - Contrast with LMR for larger sample sizes
21Other Dimensions N-gram size
LOC 90.21
MISC 78.83
ORG 81.86
PER 91.47
- Higher order n-grams may help in some cases.
22Other Languages
Context1 Context2 Context3
LOC 1.18 1.39 1.42
MISC 1.73 2.12 2.35
ORG 1.42 1.59 1.64
PER 2.01 2.31 2.56
RANDOM 2.42 1.82 1.53
Entity Actual
LOC 79.84
MISC 55.82
ORG 79.69
PER 86.83
Problem very small collections
Entity Actual
LOC 79.19
MISC 73.9
ORG 69.48
PER 78.83
Context1 Context2 Context3
LOC 1.44 1.65 1.61
MISC 1.97 2.02 1.91
ORG 1.53 1.86 1.92
PER 2.25 2.63 2.60
RANDOM 2.59 1.89 1.71
23Predicting RE Performance (English)
Relation Context size 1 Context size 2 Context size 3
BORN 2.02 2.17 2.39
DIED 1.89 1.86 1.83
INVENT 1.94 1.75 1.72
WROTE 1.59 1.59 1.53
RANDOM 6.87 6.24 5.79
Relation Accuracy ()
BORN 0.73 0.96
DIED 0.34 0.97
INVENT 0.35 0.64
WROTE 0.12 0.50
- 2- and 3- word contexts correctly distinguish
between easy tasks (BORN, DIED), and
difficult tasks (INVENT, WROTE). - 1-word context size appears not sufficient for
predicting RE
24Other Dimensions Sample Size
- Divergence increases w/ sample size
25Results Summary
- Context models can be effective in predicting the
success of information extraction systems - Even a small sample of available entities can be
sufficient for making accurate predictions - Available large collection size most important
limiting factor
26Other Applications and Future Work
- Could use results for
- Active learning/training IE
- Improved boundary detection for NER
- Improved confidence estimation of extraction
- e.g. Culotta and McCallum HLT 2004
- For better results, could incorporate
- Internal contexts, gazeteers (e.g., for LOC
entities) - e.g. Agichtein Ganti KDD 2004, Cohen
Sarawagi KDD 2004 - Syntax/logical distance
- Coreference Resolution
- Word classes
27Summary
- Presented the first attempt to predict
information extraction accuracy for a given task
and collection - Developed a general, system-independent method
utilizing Language Modelling techniques - Estimates for extraction accuracy can help
- Deploy information extraction systems
- Port Information Extraction systems to new tasks,
domains, collections, and languages
28For More Information
- Text Mining, Search, and Navigation
Grouphttp//research.microsoft.com/tmsn/