Title: Sources of Success for Information Extraction Methods
1Sources of Success for Information Extraction
Methods
- Seminar for Computational Learning and Adaptation
- Stanford University
- October 25, 2001
- Joseph Smarr
- jsmarr_at_stanford.edu
- Based on research conducted at UC San Diego in
Summer 2001 with Charles Elkan and David Kauchak
2Overview and Themes Identifying Sources of
Success
- Brief overview of Information Extraction (IE)
paradigm and current methods - Getting under the hood of current systems to
understand the source of their performance and
limitations - Identifying new sources of information to exploit
for increased performance and usefulness
3Motivation for Information Extraction
- Abundance of freely available text in digital
form (WWW, MEDLINE, etc.) - Information contained un-annotated text is
largely inaccessible to computers - Much of this information appears ripe for the
plucking without having to do full text
understanding
4Highly Structured Example Amazon.com Book Info
Pages
Desired Info title, author(s), price,
availability, etc.
5Partially Structured ExampleSCLA Speaker
Announcement Emails
Desired Info title, speaker, date, abstract, etc.
6Natural Text ExampleMEDLINE Journal Abstracts
BACKGROUND The most challenging aspect of
revision hip surgery is the management of bone
loss. A reliable and valid measure of bone loss
is important since it will aid in future studies
of hip revisions and in preoperative planning. We
developed a measure of femoral and acetabular
bone loss associated with failed total hip
arthroplasty. The purpose of the present study
was to measure the reliability and the
intraoperative validity of this measure and to
determine how it may be useful in preoperative
planning. METHODS From July 1997 to December
1998, forty-five consecutive patients with a
failed hip prosthesis in need of revision surgery
were prospectively followed. Three general
orthopaedic surgeons were taught the radiographic
classification system, and two of them classified
standardized preoperative anteroposterior and
lateral hip radiographs with use of the system.
Interobserver testing was carried out in a
blinded fashion. These results were then compared
with the intraoperative findings of the third
surgeon, who was blinded to the preoperative
ratings. Kappa statistics (unweighted and
weighted) were used to assess correlation.
Interobserver reliability was assessed by
examining the agreement between the two
preoperative raters. Prognostic validity was
assessed by examining the agreement between the
assessment by either Rater 1 or Rater 2 and the
intraoperative assessment (reference standard).
RESULTS With regard to the assessments of both
the femur and the acetabulum, there was
significant agreement (p lt 0.0001) between the
preoperative raters (reliability), with weighted
kappa values of gt0.75. There was also significant
agreement (p lt 0.0001) between each rater's
assessment and the intraoperative assessment
(validity) of both the femur and the acetabulum,
with weighted kappa values of gt0.75. CONCLUSIONS
With use of the newly developed classification
system, preoperative radiographs are reliable and
valid for assessment of the severity of bone loss
that will be found intraoperatively.
Desired Info subject size, study type, condition
studied, etc.
7Current Types of IE Systems
- Hand-built systems
- Often effective, but slow and expensive to build
and adapt - Stochastic generative models
- HMMs, N-Grams, PCFGs, etc.
- Keep separate distributions for content and
filler states - Induced rule-based systems
- Learn to identify local landmarks for beginning
and end of target information
8Formalization of Information Extraction
- Performance task
- Extract specific tokens from a set of documents
that contain the desired information - Performance measure
- Precision correct returned / total returned
- Recall correct returned / total correct
- F1 harmonic mean of precision and recall
- Learning paradigm
- Supervised learning on set of documents with
target fields manually labeled - Usually train/test on one field at a time
9IE as a Classification Task Token Extraction as
Boundary Detection
Input Linear Sequence of Tokens
Date Thursday , October 25 Time 4
15 - 5 30 PM
Method Binary Classification of Inter-Token
Boundaries
Start / End of Content
Date Thursday , October 25 Time 4
15 - 5 30 PM
Unimportant Boundaries
Output Tokens Between Identified Start / End
Boundary
Date Thursday , October 25 Time 4
15 - 5 30 PM
10Representation of Boundary Classifiers
- Boundary Detectors are pairs of token sequences
ltp,sgt - Detector matches a boundary iff p matches text
before boundary and s matches text after boundary - Detectors can contain wildcards, e.g.
capitalized word, number, etc. - Example
- ltDate,CapitalizedWordgt matches beginning of
- Date Thursday, October 25
11Boosted Wrapper Induction (BWI)Exemplar of
Current Rule-Based Systems
- Wrapper Induction is a high-precision, low-recall
learner that performs well for highly structured
tasks - Boosting is a technique for combining multiple
weak learners into a strong learner by
reweighting examples - Boosted Wrapper Induction (BWI) was proposed by
Freitag and Kushmerick in 2000 as the marriage of
these two techniques
12BWI Algorithm
- Given set of documents with labeled fore and aft
boundaries, induce ltF,A,Hgt - F set of fore detectors
- A set of aft detectors
- H histogram of field lengths (for pairing fore
and aft detectors) - To learn each boundary detector
- Start with an empty rule
- Exhaustively enumerate all extensions up to
lookahead length L - Add best scoring token extension
- Repeat until no extensions improve score
- After learning a new detector
- Re-weight documents according to AdaBoost
(down-weight correctly covered docs, up-weight
incorrectly covered docs, normalize all weights) - Repeat process, learning a new rule and
re-weighting each time - Stop after a predetermined number of iterations
13Summary of Original BWI Results
- BWI gives state-of-the-art performance on highly
structured and partially structured tasks - No systematic analysis of why BWI performs well
- BWI proposed as a solution for natural text IE,
but no tests conducted
14Goals of Our Research
- Understand specifically how boosting contributes
to BWIs performance - Investigate the relationship between performance
and task regularity - Identify new sources of information to improve
performance, particularly for natural language
tasks
15Comparison AlgorithmSequential Wrapper
Induction (SWI)
- Same formulation as BWI, but uses set covering
instead of boosting to learn multiple rules - Find highest scoring rule
- Remove all positive examples covered by new rule
- Stop when all positive examples have been removed
- Scoring function two choices
- Greedy-SWI Most positive examples covered
without covering any negative examples - Root-SWI Sqrt(W) Sqrt(W-) (W and W- are
total weight of positive and negative examples
covered) - BWI uses root scoring, but many set covering
methods use greedy scoring
16Component Matrix of Algorithms
Method for accumulating multiple
detectors Boosting Set Covering
BWI
Root-SWI
Method for scoring individual detectors Greedy
Root
Greedy-SWI
17Question 1 Does BWI Outperform Greedy Approach
of SWI?
- BWI has higher F1 than Greedy-SWI
- Greedy-SWI tends to have slightly higher
precision, but BWI has considerably higher recall - Does this difference come from the scoring
function or the accumulation method?
Average of 8 partially structured IE tasks
18Question 2 How Does Performance Differ By
Choice of Scoring Function?
- Greedy-SWI and Root-SWI differ only by their
scoring function - Greedy-SWI has higher precision, Root-SWI had
higher recall, they have similar F1 - BWI still outperforms Root-SWI, even though they
use identical scoring functions - Remaining differences
- boosting vs. set covering
- total number of rules learned
Average of 8 partially structured IE tasks
19Question 3 How Does Number of Rules Learned
Affect Performance?
- BWI learns predetermined number of rules, but SWI
stops when all examples are covered - Usually BWI learns many more rules than Root-SWI
- Stop BWI after its learned as many rules as
Root-SWI too (Fixed-BWI) - Results in precision-recall tradeoff from
Root-SWI - BWI outperforms Fixed-BWI
Average of 8 partially structured IE tasks
20Analysis of Experimental Results Why Does BWI
Outperform SWI?
- Key Insight Source of BWIs success is
interaction of two complimentary effects, both
due to boosting - Re-weighting examples causes increasingly
specific rules to be learned to cover exceptional
cases (high precision) - Re-weighting examples instead of removing them
means rules can be learned even after all
examples have been covered (high recall)
21Performance vs. Task Regularity Reveals
Important Interaction
- All methods perform better on tasks with more
structure - Relative power of different algorithmic
components varies with task regularity
22How Do We Quantify Task Regularity?
- Goal Measure relationship between task
regularity and performance - Proposed solution
- SWI-Ratio
- of iterations Greedy-SWI takes to cover all
positive examples - total number of positive examples
- Most regular case 1 rule covers all examples
1/? 0 - Least regular case separate rule for each
example N/N 1 - Since each new rule must cover at least one
example, SWI will learn at most N rules for N
examples (and usually much fewer) ? SWI-Ratio
always between 0 and 1 (smaller more regular)
23Desirable Properties of SWI-Ratio
- Relative to size of document collection ?
suitable for comparison across different sizes - General and objective
- SWI is very simple and doesnt allow any negative
examples ? unbiased account of how many
non-overlapping rules are needed to perfectly
cover all examples - Quick and easy to run
- No free parameters to set (except lookahead,
which we kept fixed in all tests)
24Performance of BWI and Greedy-SWI (F1) vs. Task
Regularity (SWI-Ratio)
Dotted lines separate highly structured,
partially structured, and natural text domains
25Improving IE Performance on Natural Text
Documents
- Goal Compensate for weak IE performance on
natural language tasks - Need to look elsewhere for regularities to
exploit - Idea Consider grammatical structure
- Run shallow parser on each sentence
- Flatten output into sequence of typed phrase
segments (using XML tags to mark text)
26Typed Phrase Segments Improve BWIs Performance
on Natural Text IE Tasks
21 increase
65 increase
45 increase
27Typed Phrase Segments Increase Regularity of
Natural Text IE Tasks
Average decrease of 21
28Encouraging Results Suggest Exploiting Other
Sources of Regularity
- Key Insight We can improve performance on
natural text while maintaining simple IE
framework if we expose the right regularities - Suggests other linguistic abstractions may be
useful - More grammatical info, semantic categories,
lexical features, etc.
29Conclusions and Summary
- Boosting is key source of BWIs success
- Learns specific rules, but learns many of them
- IE performance is sensitive to task regularity
- SWI-Ratio is quantitative, objective measure of
regularity (vs. subjective document classes) - Exploiting more regularities in text is key to
IEs future, particularly in natural text - Canonical formatting and keywords are often
sufficient in structured text documents - Exposing grammatical information boosts
performance for natural text IE tasks
30Acknowledgements
- Dayne Fretiag, for making BWI code available
- Mark Craven, for giving us natural text MEDLINE
documents with annotated phrase segments - MedExpert International, Inc. for financial
support of this research - Charles Elkan and David Kauchak, for hosting me
at UCSD this summer
This work was conducted as part of the California
Institute for Telecommunications and Information
Technology, Cal-(IT)2.