Eugene Agichtein and Silviu Cucerzan - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Eugene Agichtein and Silviu Cucerzan

Description:

Blogs. News. Alerts. Information Extraction System. Events. Entities. E. 1 ... A senior White House official, who accompanied Clinton , told reporters... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 29
Provided by: silviuc
Category:

less

Transcript and Presenter's Notes

Title: Eugene Agichtein and Silviu Cucerzan


1
Predicting Accuracy of Extracting Information
from Unstructured Text Collections
  • Eugene Agichtein and Silviu Cucerzan
  • Microsoft Research

2
Extracting and Managing Information in Text
TextDocument Collections
Varying propertiesDifferent LanguagesVarying
consistencyNoise/errors .
Blogs News Alerts
Complex problem Usually many parameters Often
tuning required
Relations
Events
Entities
Success Accuracy
-------------------
E
1
E
2
-------------------
E
3
E
4
-----------
-------------------
E
4
E
1
-----------
-------------------
3
The Goal Predict Extraction Accuracy
  • Estimate the expected success of an IE system
    that relies on contextual patterns before
  • running expensive experiments
  • tuning parameters
  • training the system
  • Useful when adapting an IE system to
  • a new task
  • a new document collection
  • a new language

4
Specific Extraction Tasks
  • Named Entity Recognition (NER)
  • Relation Extraction (RE)

Misc
European champions Liverpool paved the way to the
group stages of the Champions League taking a 3-1
lead over CSKA Sofia on Wednesday ... Gerard
Houllier's men started the match in Sofia on fire
with Steven Gerrard scoring ...
Person
Location
Abraham Lincoln was born on Feb. 12, 1809, in a
log cabin in Hardin (now Larue) County, Ky
BORN Who When Where
Abraham Lincoln Feb. 12, 1809 Hardin County, KY
5
Contextual Clues
NER
yesterday, Mrs Clinton told reporters the move
to the East Room
Right context
Left context
RE
engineers Orville and Wilbur Wright built the
first working airplane in 1903 .
Right context
Middle context
Left context
6
Approach Language Modelling
  • Presence of contextual clues for a task appears
    related to extraction difficulty
  • The more obvious the clues, the easier the
    task
  • Can be modelled as unexpectedness of a word
  • Use Language Modelling (LM) techniques to
    quantify intuition

7
Language Models (LM)
  • An LM is summary of word distribution in text
  • Can define unigram, bigram, trigram, n-gram
    models
  • More complex models exist
  • Distance, syntax, word classes
  • But not robust for web, other languages,
  • LMs used in IR, ASR, Text Classification,
    Clustering
  • Query Clarity Predicting query performance
    Cronen-Townsend et al, SIGIR 2002
  • Context Modelling for NER
  • Cucerzan et al., EMNLP 1999, Klein et al.
    CoNLL 2003

8
Document Language Models
  • A basic LM is a normalized word histogram for the
    document collection
  • Unigram (word) models commonly used
  • Higher-order n-grams (bigrams, trigrams) can be
    used

word Freq
the 0.0584
to 0.0269
and 0.0199
said 0.0147
. . . . . .
's 0.0018
company 0.0014
mrs 0.0003
won 0.0003
president 0.0003
9
Context Language Models
NER PERSON
  • Senator Christopher Dodd, D-Conn., named general
    chairman of the Democratic National Committee
    last week by President Bill Clinton , said it was
    premature to talk about lifting the U.S. embargo
    against Cuba
  • Although the Clinton s health plan failed to
    make it through Congress this year , Mrs Clinton
    vowed continued support for the proposal.
  • A senior White House official, who accompanied
    Clinton , told reporters

RE INVENTIONS
  • By the fall of 1905, the Wright brothers
    experimental period ended. With their third
    powered airplane , they now routinely made
    flights of several
  • Against this backdrop, we see the Wright brothers
    efforts to develop an airplane

10
Key Observation
  • If normally rare words consistently appear in
    contexts around entities, extraction task tends
    to be easier.
  • Contexts for a task are an intrinsic property of
    collection and extraction task, and not
    restricted to a specific information extraction
    system.

11
Divergence Measures
  • Cosine Divergence
  • Relative entropy KL Divergence

12
Interpreting Divergence Reference LM
  • Need to calibrate the observed divergence
  • Compute Reference Model LMR
  • Pick K random non-stopwords R and compute the
    context language model around Ri. the
    five-star Hotel Astoria is a symbol of elegance
    and comfort. With an unbeatable location in St
    Isaac's Square in the heart of St Petersburg,
    ...
  • Normalized KL(LMC)
  • Normalization corrects for bias introduced by
    small sample size

13
Reference LM (cont)
  • LMR converges to LMBG for large sample sizes
  • Divergence of LMR substantial for small samples

14
Predicting Extraction Accuracy The Algorithm
  1. Start with a small sample S of entities (or
    relation tuples) to be extracted
  2. Find occurrences of S in given collection
  3. Compute LMBG for the collection
  4. Compute LMC for S and the collection
  5. Pick S random words R from LMBG
  6. Compute context LM for R ? LMR
  7. Compute KL(LMC LMBG), KL(LMR LMBG)
  8. Return normalized KL(LMC)

15
Experimental Evaluation
  • How to measure success?
  • Compare predicted ease of task vs. observed
    extraction accuracy
  • Extraction Tasks NER and RE
  • NER Datasets from the CoNLL 2002, 2003
    evaluations
  • RE Binary relations between NEs and generic
    phrases

16
Extraction Task Accuracy
  • NER
  • RE

English Spanish Dutch
LOC 90.21 79.84 79.19
MISC 78.83 55.82 73.9
ORG 81.86 79.69 69.48
PER 91.47 86.83 78.83
Overall 86.77 79.2 75.24
Relation Accuracy () strict partial Task Difficulty
BORN 0.73 0.96 Easy
DIED 0.34 0.97 Easy
INVENT 0.35 0.64 Hard
WROTE 0.12 0.50 Hard
17
Document Collections
Task Collection Size
NER Reuters RCV1, 1/100 3,566,125 words
NER Reuters RCV1, 1/10 35,639,471 words
NER EFE newswire articles, May 2000 (Spanish) 367,589 words
NER De Morgen articles (Dutch) 268,705 words
RE Encarta document collection 64,187,912 words
Note that Spanish and Dutch corpus sizes are much
smaller
18
Predicting NER Performance (English)
Florian et al. Chieu et al. Klein et al. Zhang et al. Carreras et al. Average
LOC 91.15 91.12 89.98 89.54 89.26 90.21
MISC 80.44 79.16 80.15 75.87 78.54 78.83
ORG 84.67 84.32 80.48 80.46 79.41 81.86
PER 93.85 93.44 90.72 90.44 88.93 91.47
Overall 88.76 88.31 86.31 85.50 85.00 86.77
Reuters 1/10, Context 3 words, discard
stopwords, avg
Context size Absolute Normalized
LOC 0.98 1.07
MISC 1.29 1.40
ORG 2.83 3.08
PER 4.10 4.46
RANDOM 0.92 0.92
LOC exception Large overlap between locations
in the training and test collections (i.e.,
simple gazetteers effective).
Absolute and Normalized KL-divergence
19
NER Robustness / Different Dimensions
  • Counting stopwords (w) or not (w/o)
  • Context Size
  • Corpus size

LOC MISC ORG PER RAND
F 90.2 78.8 81.9 91.5 -
w 0.93 1.09 2.68 3.91 0.78
w/o 1.48 1.83 3.81 5.62 1.27
Reuters 1/100, context 3, avg
LOC MISC ORG PER RAND
1 0.88 1.26 2.12 2.94 2.43
2 1.06 1.47 2.95 4.11 1.14
3 1.07 1.4 3.08 4.46 0.92
Reuters 1/100, no stopwords, avg
LOC MISC ORG PER RAND
1/10 1.07 1.4 3.08 4.46 0.92
1/100 1.48 1.83 3.81 5.62 1.27
Reuters, context 3, no stopwords, avg
20
Other Dimensions Sample Size
  • Normalized divergence of LMC remains high
  • - Contrast with LMR for larger sample sizes

21
Other Dimensions N-gram size
LOC 90.21
MISC 78.83
ORG 81.86
PER 91.47
  • Higher order n-grams may help in some cases.

22
Other Languages
  • Spanish
  • Dutch

  Context1 Context2 Context3
LOC 1.18 1.39 1.42
MISC 1.73 2.12 2.35
ORG 1.42 1.59 1.64
PER 2.01 2.31 2.56
RANDOM 2.42 1.82 1.53
Entity Actual
LOC 79.84
MISC 55.82
ORG 79.69
PER 86.83
Problem very small collections
Entity Actual
LOC 79.19
MISC 73.9
ORG 69.48
PER 78.83
  Context1 Context2 Context3
LOC 1.44 1.65 1.61
MISC 1.97 2.02 1.91
ORG 1.53 1.86 1.92
PER 2.25 2.63 2.60
RANDOM 2.59 1.89 1.71
23
Predicting RE Performance (English)
Relation Context size 1 Context size 2 Context size 3
BORN 2.02 2.17 2.39
DIED 1.89 1.86 1.83
INVENT 1.94 1.75 1.72
WROTE 1.59 1.59 1.53
RANDOM 6.87 6.24 5.79
Relation Accuracy ()
BORN 0.73 0.96
DIED 0.34 0.97
INVENT 0.35 0.64
WROTE 0.12 0.50
  • 2- and 3- word contexts correctly distinguish
    between easy tasks (BORN, DIED), and
    difficult tasks (INVENT, WROTE).
  • 1-word context size appears not sufficient for
    predicting RE

24
Other Dimensions Sample Size
  • Divergence increases w/ sample size

25
Results Summary
  • Context models can be effective in predicting the
    success of information extraction systems
  • Even a small sample of available entities can be
    sufficient for making accurate predictions
  • Available large collection size most important
    limiting factor

26
Other Applications and Future Work
  • Could use results for
  • Active learning/training IE
  • Improved boundary detection for NER
  • Improved confidence estimation of extraction
  • e.g. Culotta and McCallum HLT 2004
  • For better results, could incorporate
  • Internal contexts, gazeteers (e.g., for LOC
    entities)
  • e.g. Agichtein Ganti KDD 2004, Cohen
    Sarawagi KDD 2004
  • Syntax/logical distance
  • Coreference Resolution
  • Word classes

27
Summary
  • Presented the first attempt to predict
    information extraction accuracy for a given task
    and collection
  • Developed a general, system-independent method
    utilizing Language Modelling techniques
  • Estimates for extraction accuracy can help
  • Deploy information extraction systems
  • Port Information Extraction systems to new tasks,
    domains, collections, and languages

28
For More Information
  • Text Mining, Search, and Navigation
    Grouphttp//research.microsoft.com/tmsn/
Write a Comment
User Comments (0)
About PowerShow.com