Title: IE by Candidate Classification: Jansche
1IE by Candidate ClassificationJansche Abney,
Cohen et al
2Administrative stuff
- Reminders
- Turn in your insightful critiques of this weeks
papers. - Think about your project!
3Landscape of IE TasksComplexity
(Review)
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
4Landscape of IE TasksSingle Field/Record
(Review)
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
5Landscape of IE Tasks Models
(Review)
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
This is often treated as a structured prediction
problemclassifying tokens sequentially
HMMs, CRFs, .
6Background on JA paper
7SCAN Search Summarization for Audio
Collections (ATT Labs)
8(No Transcript)
9Why IE from personal voicemail
- Unified interface for email, voicemail, fax,
requires uniform headers - Sender, Time, Subject,
- Headers are key for uniform interface
- Independently, voicemail access is slow
- useful to have fast access to important parts of
message (contact number, caller)
10Background on JA cont
- Quick review of Huang, Zweig Padmanabhan (IBM
Yorktown) Information Extraction from
Voicemail - Goal find identity and contact number of callers
in voicemail (NER role classification) - Evaluated three systems on 5000 labeled
manually transcribed messages - Baseline system
- 200 hand-coded rules based on trigger phrases
- State-of-art Ratnaparki-style MaxEnt tagger
- Lexical unigrams, bigrams, dictionary features
for names, numbers, trigger phrases feature
selection - Poor results
- On manually transcribed data, F1 in 80s for both
tasks (good!) - On ASR data, F1 about 50 for caller names, 80
for contact numbers even with a very loose
performance metric - Best learning method barely beat the baseline
rule-based system.
11Whats interesting in this paper
- How and when to we use ML?
- Robust information extraction
- Generalizing from manual transcripts (i.e.,
human-produced written version of voicemail) to
automatic (ASR) transcripts - Place of hand-coding vs learning in information
extraction - How to break up task
- Where and how to use engineering
Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
12Voicemail corpus
- About 10,000 manually transcribed and annotated
voice messages. - 1869 used for evaluation
13Observation caller phrases are short and near
the beginning of the message.
14Caller-phrase extraction
- Propose start positions i1,,iN
- Use a learned decision tree to pick the best i
- Propose end positions ij1,ij2,,ijM
- Use a learned decision tree to pick the best j
15Baseline (HZP, Collins log-linear)
- IE as tagging
- Pr(tag iword i,word i-1,,word i1,,tag i-1,)
estimated via MAXENT model - Beam search to find best tag sequence given word
sequence - Features of model are words, word pairs, word
pairtag trigrams, .
16Performance
17Observation caller names are really short and
near the beginning of the message.
18What about ASR transcripts?
19Extracting phone numbers
- Phase 1 hand-coded grammer proposes candidate
phone numbers - Not too hard, due to limited vocabulary
- Optimize recall (96) not precision (30)
- Phase 2 a learned decision tree filters
candidates - Use length, position, context,
20Results
21Their Conclusions
22Cohen, Wang, Murphy
- Another paper with a similar flavor
- IE for a particular task
- IE using similar propose-and-filter approach
- When and how to you engineer, and when and how to
you use learning?
23Cohen, Wang, Murphy
- Another paper with a similar flavor
- IE for a particular task
- IE using similar propose-and-filter approach
- When and how to you engineer, and when and how to
you use learning?
24Background subcellular localization
The most important tool for studying protein
localizations is fluorescence microscopy.
New image processing techniques can automatically
produce a quantitative description of subcellular
localization.
25Background subcellular localization
26Background subcellular localization
Entrez a new 376kD Golgi complex outher
membrane protein SWISSProt INTEGRAL MEMBRANE
PROTEIN. GOLGI MEMBRANE
Entrez GPP130 type II Golgi membrane
protein SWISSProt nothing
27Background subcellular localization
- Some other interesting facts
- Primary structure is poor indicator of
localization - Many possible localizations with image analysis
- Tens of thousands of images in open literature
28Overview of SLIF image analysis of existing
images from online publications
Image
Panel Splitter
On-line paper
Figure finder
Panel Classifier
Scale Finder
Fl. Micr. Panel
Figure
Micr. Scale
29Background overview of SLIF1
30Background overview of SLIF1
Figure 1. (A) Single confocal 0-GFP fusion.
Bars, 5 m m.Movement of Coiled Bodies Vol. 10,
July 1999
Find scalebar and scale measurement
Rescale image of each cell, adjust contrast, and
compute subcellular localization features as if
it were an ordinary microscope image. Of course,
you still dont know what its an image of
31Overview of SLIF image analysis of existing
images from online publications
End result collection of on-line fluorescence
microscope images, with quantitative description
of localization.
E.g. we know this figure section shows a
tubulin-like protein
but not which one!
32Background overview of SLIF2.0
Caption
Image
Image Pointer Finder
Panel Splitter
Panel Label Matcher
Scope Finder
Panel Classifier
Name Finder
Scale Finder
Fl. Micr. Panel
Micr. Scale
Cell Type
Protein Name
33Background overview of SLIF2.0
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
A new issue caption understanding - where are
the entities in the image?
34Why caption understanding? - Location
proteomics. - Remove extraneous junk from caption
text for ordinary IE, NLP, indexing, -
Better text- or content-based image retrieval for
scientific images.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
35Identify image pointers Substrings that refer to
parts of the image
Will focus on text issues, not matching
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
36Identify image pointers Substrings that refer to
parts of the image
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
37 Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
38Compute scopes - The scope of a bullet-style
image pointer is all words after it, but before
next bullet - The scope of a citation-style
image pointer is some set of words nearby it
(heuristically determined by separating words and
punctuation)
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
39Image pointers share all entities in their
scope. Entities are assigned to panels based
on matches of image-pointers to annotations in
panels.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
40Outline
- Details on caption understanding
- Baseline hand-coded methods
- Learning methods
- Experimental results
41Task
- Identify image pointers in captions.
- Classify image pointers
- bullet-style, citation-style, or NP-style
- E.g., Panels A and C show the
- Wont talk about scoping
- Will focus first on extracting image
pointersi.e., binary classification of
substrings is this an image pointer - Data 100 captions from 100 papersabout 600
positive examples.
42Baseline methods
- Labeled 100 sample figure captions.
- HANDCODE-1 patterns like (A), (B-E), (c and d),
etc. - HANDCODE-2 all short parenthesized expressions
patterns like panel A or in B-C
Some plausible tricks (like filtering HC-2) dont
help much
43How hard is the problem?
Some citation-style image pointers
44How hard is the problem?
NP-style
non-image pointers
The difficulty of the task suggests using a
learning approach
45Another use of propose-and-filter
Note that Hand-Code2 (recall 98) is a natural
candidate generator. Well start with off the
shelf features
Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
46Learning methods features
- Start with named sets of labeled substrings
- Image pointers and tokens (not marked)
Fig. 1. Kinase inactive Plk inhibits Golgi
fragmentation by mitotic cytosol. (A) NRK cells
were grown on coverslips and treated with
2mMthymidine for 8 to 14 h. Cells were
subsequently permeabilized with digitonin, washed
with 1M KCl-containing buer, and incubated with
either 7 mgyml interphase cytosol (IE), 7mgyml
mitotic extract (ME), or mitotic extract to
which 20 mgyml kinase inactive Plk (ME Plk-KD)
was added. After a 60-min incubation at 32C,
cells were fixed and stained with anti-mannosidase
II antibody to visualize the Golgi apparatus
by fluorescence microscopy. (B) Percentage of
cells with fragmented Golgi after incubation with
mitotic extract (ME) in the absence or the
presence of kinase inactive Plk (ME Plk-KD).
The histogram represents the average of four
independent experiments.
47Learning methods features
- Start with named sets of labeled substrings
- Image pointers (labely/n) and tokens
(labeltoken) - Substrings act as examples and features
- To create features use a little language
- emit( token, before, -1, label ),
- emit( token, before, -2, label ),
48Learning methods features
- emit( token, before, -1, label ),
- emit( token, before, -2, label ),
kind of substring to look for
what to emit (substring label, distance in chars
to substring, )
direction to go
distance to go
emit inactive
49Learning methods boosting
Generalized version of AdaBoost (SingerSchapire,
99) Allows real-valued predictions for each
base hypothesisincluding value of zero.
50Learning methods boosting rules
- Weak learner to find weak hypothesis t
- Split Data into Growing and Pruning sets
- Let Rt be an empty conjunction
- Greedily add conditions to Rt guided by Growing
set - Greedily remove conditions from Rt guided by
Pruning set - Convert to weak hypothesis
51Learning methods boosting rules
SLIPPER also produces fairly compact rule sets.
52Learning methods BWI
- Boosted wrapper induction (BWI) learns to extract
substrings from a document. - Learns three concepts firstToken(x),
lastToken(x), substringLength(k) - Conditions are tests on tokens before/after x
- E.g., toki-2from, isNumber(toki1)
- SLIPPER weak learner, no pruning.
- Greedy search extends window size by at most L in
each iteration, uses lookahead L, no fixed limit
on window size. - Good results in (Kushmeric and Frietag, 2000)
53BWI example rules
54Learning methods ABWI
- Almost boosted wrapper induction (ABWI) learns
to extract substrings - Learns to filter candidate substrings (HandCode2)
- Conditions are the same tests on tokens near x
- E.g., toki-2from, isNumber(toki1)
- SLIPPER weak learner, no pruning.
- Greedy search extends window size any amount,
uses no lookahead, has fixed limit on window
size. - Optimal window sizes for this problem seem to be
small
55Learning methods
- Features W tokens before/after, all tokens
inside. - Learner 100 rounds boosting conjunctions of
feature tests - Inspired by BWI (Frietag Kushmeric)
- Implemented with SLIPPER learner
56Other learning methods
All learning methods are competitive with
hand-coded methods
57Additional features
- Check if candidate contains certain special
substrings - Matches color name labeled color
- Matches HANDCODE-1 pattern handcode1
- Matches mm, mg, etc measure
- Matches 1980,,2003, et al citation
- Matches top, left, etc place
- Added sentence boundary substrings
- Feature is distance to boundary.
58Learning with expanded feature set
Many new features are inversely correlated with
class (e.g. citation), but ABWI looks only for
positively-correlated patterns.
59Learning with expanded feature set
SABWI is a symmetric version of ABWI can use
rules and/or conditions negatively or positively
correlated with the class
60(No Transcript)
61Task
- Identify image pointers in captions.
- Classify image pointers
- bullet-style, citation-style, or NP-style
- Combine these to get a four-class problem
- bullet-style, citation-style, or NP-style, other
- no hand-coded baseline methods
62Four-class extraction results
63Further improvement is probable with additional
labeled data