IE by Candidate Classification: Jansche - PowerPoint PPT Presentation

About This Presentation
Title:

IE by Candidate Classification: Jansche

Description:

IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03 – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 50
Provided by: William1501
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: IE by Candidate Classification: Jansche


1
IE by Candidate ClassificationJansche Abney,
Cohen et al
  • William Cohen
  • 1/19/03

2
SCAN Search Summarization for Audio
Collections (ATT Labs)
3
(No Transcript)
4
Why IE from personal voicemail
  • Unified interface for email, voicemail, fax,
    requires uniform headers
  • Sender, Time, Subject,
  • Headers are key for uniform interface
  • Independently, voicemail access is slow
  • useful to have fast access to important parts of
    message (contact number, caller)

5
Why else to read this paper
  • Robust information extraction
  • Generalizing from manual transcripts (i.e.,
    human-produced written version of voicemail) to
    automatic (ASR) transcripts
  • Place of hand-coding vs learning in information
    extraction
  • How to break up task
  • Where and how to use engineering

Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
6
Voicemail corpus
  • About 10,000 manually transcribed and annotated
    voice messages.
  • 1869 used for evaluation

7
Observation caller phrases are short and near
the beginning of the message.
8
Caller-phrase extraction
  • Propose start positions i1,,iN
  • Use a learned decision tree to pick the best i
  • Propose end positions ij1,ij2,,ijM
  • Use a learned decision tree to pick the best j

9
Baseline (HZP, Col log-linear)
  • IE as tagging
  • Pr(tag iword i,word i-1,,word i1,,tag i-1,)
    estimated via MAXENT model
  • Beam search to find best tag sequence given word
    sequence
  • Features of model are words, word pairs, word
    pairtag trigrams, .

Hi there its Bill and
OUT OUT IN IN OUT
10
Performance
11
Observation caller names are really short and
near the beginning of the message.
12
What about ASR transcripts?
13
Extracting phone numbers
  • Phase 1 hand-coded grammer proposes candidate
    phone numbers
  • Not too hard, due to limited vocabulary
  • Optimize recall (96) not precision (30)
  • Phase 2 a learned decision tree filters
    candidates
  • Use length, position, context,

14
Results
15
Their Conclusions
16
Cohen, Wang, Murphy
  • Another paper with a similar flavor
  • IE for a particular task
  • IE using similar propose-and-filter approach
  • When and how to you engineer, and when and how to
    you use learning?

17
Background subcellular localization
The most important tool for studying protein
localizations is fluorescence microscopy.
New image processing techniques can automatically
produce a quantitative description of subcellular
localization.
18
Background subcellular localization
19
Background subcellular localization
Entrez a new 376kD Golgi complex outher
membrane protein SWISSProt INTEGRAL MEMBRANE
PROTEIN. GOLGI MEMBRANE
Entrez GPP130 type II Golgi membrane
protein SWISSProt nothing
20
Background subcellular localization
  • Some other interesting facts
  • Primary structure is poor indicator of
    localization
  • Many possible localizations with image analysis
  • Tens of thousands of images in open literature

21
Overview of SLIF image analysis of existing
images from online publications
Image
Panel Splitter
On-line paper
Figure finder
Panel Classifier
Scale Finder
Fl. Micr. Panel
Figure
Micr. Scale
22
Overview of SLIF image analysis of existing
images from online publications
End result collection of on-line fluorescence
microscope images, with quantitative description
of localization.
E.g. we know this figure section shows a
tubulin-like protein
but not which one!
23
Background overview of SLIF1
24
Background overview of SLIF1
Figure 1. (A) Single confocal 0-GFP fusion.
Bars, 5 m m.Movement of Coiled Bodies Vol. 10,
July 1999
Find scalebar and scale measurement
Rescale image of each cell, adjust contrast, and
compute subcellular localization features as if
it were an ordinary microscope image. Of course,
you still dont know what its an image of
25
Background overview of SLIF2.0
Caption
Image
Image Pointer Finder
Panel Splitter
Panel Label Matcher
Scope Finder
Panel Classifier
Name Finder
Scale Finder
Fl. Micr. Panel
Micr. Scale
Cell Type
Protein Name
26
Background overview of SLIF2.0
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
A new issue caption understanding - where are
the entities in the image?
27
Why caption understanding? - Location
proteomics. - Remove extraneous junk from caption
text for ordinary IE, NLP, indexing, -
Better text- or content-based image retrieval for
scientific images.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
28
Identify image pointers Substrings that refer to
parts of the image
Will focus on text issues, not matching
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
29
Identify image pointers Substrings that refer to
parts of the image
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
30
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
31
Compute scopes - The scope of a bullet-style
image pointer is all words after it, but before
next bullet - The scope of a citation-style
image pointer is some set of words nearby it
(heuristically determined by separating words and
punctuation)
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
32
Image pointers share all entities in their
scope. Entities are assigned to panels based
on matches of image-pointers to annotations in
panels.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
33
Outline
  • Details on caption understanding
  • Baseline hand-coded methods
  • Learning methods
  • Experimental results

34
Task
  • Identify image pointers in captions.
  • Classify image pointers
  • bullet-style, citation-style, or NP-style
  • E.g., Panels A and C show the
  • Wont talk about scoping
  • Will focus first on extracting image
    pointersi.e., binary classification of
    substrings is this an image pointer
  • Data 100 captions from 100 papersabout 600
    positive examples.

35
Baseline methods
  • Labeled 100 sample figure captions.
  • HANDCODE-1 patterns like (A), (B-E), (c and d),
    etc.
  • HANDCODE-2 all short parenthesized expressions
    patterns like panel A or in B-C

HC-1 HC-2f HC-2
Precis. 98.5 89.0 74.5
Recall 45.6 54.8 98.0
F1 62.3 67.8 84.6
HC-1 HC-2
Precision 98.5 74.5
Recall 45.6 98.0
F1 62.3 84.6
Some plausible tricks (like filtering HC-2) dont
help much
36
How hard is the problem?
Some citation-style image pointers
37
How hard is the problem?
NP-style
non-image pointers
The difficulty of the task suggests using a
learning approach
38
Another use of propose-and-filter
Note that Hand-Code2 (recall 98) is a natural
candidate generator. Well start with off the
shelf features
Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
39
Learning methods features
  • Start with named sets of labeled substrings
  • Image pointers and tokens (not marked)

Fig. 1. Kinase inactive Plk inhibits Golgi
fragmentation by mitotic cytosol. (A) NRK cells
were grown on coverslips and treated with
2mMthymidine for 8 to 14 h. Cells were
subsequently permeabilized with digitonin, washed
with 1M KCl-containing buer, and incubated with
either 7 mgyml interphase cytosol (IE), 7mgyml
mitotic extract (ME), or mitotic extract to
which 20 mgyml kinase inactive Plk (ME Plk-KD)
was added. After a 60-min incubation at 32C,
cells were fixed and stained with anti-mannosidase
II antibody to visualize the Golgi apparatus
by fluorescence microscopy. (B) Percentage of
cells with fragmented Golgi after incubation with
mitotic extract (ME) in the absence or the
presence of kinase inactive Plk (ME Plk-KD).
The histogram represents the average of four
independent experiments.
40
Learning methods features
  • Start with named sets of labeled substrings
  • Image pointers (labely/n) and tokens
    (labeltoken)
  • Substrings act as examples and features
  • To create features use a little language
  • emit( token, before, -1, label ),
  • emit( token, before, -2, label ),

41
Learning methods features
  • emit( token, before, -1, label ),
  • emit( token, before, -2, label ),

kind of substring to look for
what to emit (substring label, distance in chars
to substring, )
direction to go
distance to go
emit inactive
42
Learning methods boosting
Generalized version of AdaBoost (SingerSchapire,
99) Allows real-valued predictions for each
base hypothesisincluding value of zero.
43
Learning methods boosting rules
  • Weak learner to find weak hypothesis t
  • Split Data into Growing and Pruning sets
  • Let Rt be an empty conjunction
  • Greedily add conditions to Rt guided by Growing
    set
  • Greedily remove conditions from Rt guided by
    Pruning set
  • Convert to weak hypothesis

44
Learning methods boosting rules
SLIPPER also produces fairly compact rule sets.
45
Learning methods BWI
  • Boosted wrapper induction (BWI) learns to extract
    substrings from a document.
  • Learns three concepts firstToken(x),
    lastToken(x), substringLength(k)
  • Conditions are tests on tokens before/after x
  • E.g., toki-2from, isNumber(toki1)
  • SLIPPER weak learner, no pruning.
  • Greedy search extends window size by at most L in
    each iteration, uses lookahead L, no fixed limit
    on window size.
  • Good results in (Kushmeric and Frietag, 2000)

46
Learning methods ABWI
  • Almost boosted wrapper induction (ABWI) learns
    to extract substrings
  • Learns to filter candidate substrings (HandCode2)
  • Conditions are the same tests on tokens near x
  • E.g., toki-2from, isNumber(toki1)
  • SLIPPER weak learner, no pruning.
  • Greedy search extends window size any amount,
    uses no lookahead, has fixed limit on window
    size.
  • Optimal window sizes for this problem seem to be
    small

47
Learning methods
HC-1 HC-2f HC-2 ABWI (W2)
Precis. 98.5 89.0 74.5 89.7
Recall 45.6 54.8 98.0 91.0
F1 62.3 67.8 84.6 90.3
  • Features W tokens before/after, all tokens
    inside.
  • Learner 100 rounds boosting conjunctions of
    feature tests
  • Inspired by BWI (Frietag Kushmeric)
  • Implemented with SLIPPER learner

48
Other learning methods
HC-1 HC-2f HC-2 ABWI (W2) ABWI Slipper ABWI Ripper ABWI SVM1 ABWI SVM2
Precis. 98.5 89.0 74.5 89.7 96.1 88.1 69.0 100.0
Recall 45.6 54.8 98.0 91.0 85.2 87.1 78.0 75.2
F1 62.3 67.8 84.6 90.3 90.3 87.6 73.2 85.6
All learning methods are competitive with
hand-coded methods
49
Additional features
  • Check if candidate contains certain special
    substrings
  • Matches color name labeled color
  • Matches HANDCODE-1 pattern handcode1
  • Matches mm, mg, etc measure
  • Matches 1980,,2003, et al citation
  • Matches top, left, etc place
  • Added sentence boundary substrings
  • Feature is distance to boundary.

50
Learning with expanded feature set
HC-1 HC-2f HC-2 ABWI (W2) ABWI NA
Precis. 98.5 89.0 74.5 89.7 85.9
Recall 45.6 54.8 98.0 91.0 92.2
F1 62.3 67.8 84.6 90.3 89.0
Many new features are inversely correlated with
class (e.g. citation), but ABWI looks only for
positively-correlated patterns.
51
Learning with expanded feature set
HC-1 HC-2f HC-2 ABWI (W2) ABWI NA SABWI NA
Precis. 98.5 89.0 74.5 89.7 85.9 88.6
Recall 45.6 54.8 98.0 91.0 92.2 93.8
F1 62.3 67.8 84.6 90.3 89.0 91.1
SABWI is a symmetric version of ABWI can use
rules and/or conditions negatively or positively
correlated with the class
52
(No Transcript)
53
Task
  • Identify image pointers in captions.
  • Classify image pointers
  • bullet-style, citation-style, or NP-style
  • Combine these to get a four-class problem
  • bullet-style, citation-style, or NP-style, other
  • no hand-coded baseline methods

54
Four-class extraction results
Method Error rate Error rate Error rate
W2 W3 W5
ABWI 24.6 27.5 26.7
ABWINA 26.7 22.2 26.7
SABWINA 24.2 18.2 22.6
55
Further improvement is probable with additional
labeled data
Write a Comment
User Comments (0)
About PowerShow.com