IE by Candidate Classification: Jansche - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

IE by Candidate Classification: Jansche

Description:

Abraham Lincoln was born in Kentucky. Most likely state sequence? ... Some other interesting facts: Primary structure is poor indicator of localization ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 61
Provided by: willia95
Category:

less

Transcript and Presenter's Notes

Title: IE by Candidate Classification: Jansche


1
IE by Candidate ClassificationJansche Abney,
Cohen et al
  • William Cohen
  • 1/30/06

2
Administrative stuff
  • Reminders
  • Turn in your insightful critiques of this weeks
    papers.
  • Think about your project!

3
Landscape of IE TasksComplexity
(Review)
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
4
Landscape of IE TasksSingle Field/Record
(Review)
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
5
Landscape of IE Tasks Models
(Review)
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
This is often treated as a structured prediction
problemclassifying tokens sequentially
HMMs, CRFs, .
6
Background on JA paper
7
SCAN Search Summarization for Audio
Collections (ATT Labs)
8
(No Transcript)
9
Why IE from personal voicemail
  • Unified interface for email, voicemail, fax,
    requires uniform headers
  • Sender, Time, Subject,
  • Headers are key for uniform interface
  • Independently, voicemail access is slow
  • useful to have fast access to important parts of
    message (contact number, caller)

10
Background on JA cont
  • Quick review of Huang, Zweig Padmanabhan (IBM
    Yorktown) Information Extraction from
    Voicemail
  • Goal find identity and contact number of callers
    in voicemail (NER role classification)
  • Evaluated three systems on 5000 labeled
    manually transcribed messages
  • Baseline system
  • 200 hand-coded rules based on trigger phrases
  • State-of-art Ratnaparki-style MaxEnt tagger
  • Lexical unigrams, bigrams, dictionary features
    for names, numbers, trigger phrases feature
    selection
  • Poor results
  • On manually transcribed data, F1 in 80s for both
    tasks (good!)
  • On ASR data, F1 about 50 for caller names, 80
    for contact numbers even with a very loose
    performance metric
  • Best learning method barely beat the baseline
    rule-based system.

11
Whats interesting in this paper
  • How and when to we use ML?
  • Robust information extraction
  • Generalizing from manual transcripts (i.e.,
    human-produced written version of voicemail) to
    automatic (ASR) transcripts
  • Place of hand-coding vs learning in information
    extraction
  • How to break up task
  • Where and how to use engineering

Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
12
Voicemail corpus
  • About 10,000 manually transcribed and annotated
    voice messages.
  • 1869 used for evaluation

13
Observation caller phrases are short and near
the beginning of the message.
14
Caller-phrase extraction
  • Propose start positions i1,,iN
  • Use a learned decision tree to pick the best i
  • Propose end positions ij1,ij2,,ijM
  • Use a learned decision tree to pick the best j

15
Baseline (HZP, Collins log-linear)
  • IE as tagging
  • Pr(tag iword i,word i-1,,word i1,,tag i-1,)
    estimated via MAXENT model
  • Beam search to find best tag sequence given word
    sequence
  • Features of model are words, word pairs, word
    pairtag trigrams, .

16
Performance
17
Observation caller names are really short and
near the beginning of the message.
18
What about ASR transcripts?
19
Extracting phone numbers
  • Phase 1 hand-coded grammer proposes candidate
    phone numbers
  • Not too hard, due to limited vocabulary
  • Optimize recall (96) not precision (30)
  • Phase 2 a learned decision tree filters
    candidates
  • Use length, position, context,

20
Results
21
Their Conclusions
22
Cohen, Wang, Murphy
  • Another paper with a similar flavor
  • IE for a particular task
  • IE using similar propose-and-filter approach
  • When and how to you engineer, and when and how to
    you use learning?

23
Cohen, Wang, Murphy
  • Another paper with a similar flavor
  • IE for a particular task
  • IE using similar propose-and-filter approach
  • When and how to you engineer, and when and how to
    you use learning?

24
Background subcellular localization
The most important tool for studying protein
localizations is fluorescence microscopy.
New image processing techniques can automatically
produce a quantitative description of subcellular
localization.
25
Background subcellular localization
26
Background subcellular localization
Entrez a new 376kD Golgi complex outher
membrane protein SWISSProt INTEGRAL MEMBRANE
PROTEIN. GOLGI MEMBRANE
Entrez GPP130 type II Golgi membrane
protein SWISSProt nothing
27
Background subcellular localization
  • Some other interesting facts
  • Primary structure is poor indicator of
    localization
  • Many possible localizations with image analysis
  • Tens of thousands of images in open literature

28
Overview of SLIF image analysis of existing
images from online publications
Image
Panel Splitter
On-line paper
Figure finder
Panel Classifier
Scale Finder
Fl. Micr. Panel
Figure
Micr. Scale
29
Background overview of SLIF1
30
Background overview of SLIF1
Figure 1. (A) Single confocal 0-GFP fusion.
Bars, 5 m m.Movement of Coiled Bodies Vol. 10,
July 1999
Find scalebar and scale measurement
Rescale image of each cell, adjust contrast, and
compute subcellular localization features as if
it were an ordinary microscope image. Of course,
you still dont know what its an image of
31
Overview of SLIF image analysis of existing
images from online publications
End result collection of on-line fluorescence
microscope images, with quantitative description
of localization.
E.g. we know this figure section shows a
tubulin-like protein
but not which one!
32
Background overview of SLIF2.0
Caption
Image
Image Pointer Finder
Panel Splitter
Panel Label Matcher
Scope Finder
Panel Classifier
Name Finder
Scale Finder
Fl. Micr. Panel
Micr. Scale
Cell Type
Protein Name
33
Background overview of SLIF2.0
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
A new issue caption understanding - where are
the entities in the image?
34
Why caption understanding? - Location
proteomics. - Remove extraneous junk from caption
text for ordinary IE, NLP, indexing, -
Better text- or content-based image retrieval for
scientific images.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
35
Identify image pointers Substrings that refer to
parts of the image
Will focus on text issues, not matching
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
36
Identify image pointers Substrings that refer to
parts of the image
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
37
Classify image pointers as citation-style or
bullet-style.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
38
Compute scopes - The scope of a bullet-style
image pointer is all words after it, but before
next bullet - The scope of a citation-style
image pointer is some set of words nearby it
(heuristically determined by separating words and
punctuation)
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
39
Image pointers share all entities in their
scope. Entities are assigned to panels based
on matches of image-pointers to annotations in
panels.
Figure 1. (A) Single confocal optical section of
BY-2 cells expressing U2B 0-GFP, double labeled
with GFP (left panel) and autoantibody against
p80 coilin (right panel). Three nuclei are shown,
and the bright GFP spots colocalize with bright
foci of anti-coilin labeling. There is some
labeling of the cytoplasm by anti-p80 coilin. (B)
Single confocal optical section of BY-2 cells
expressing U2B 0 -GFP, double labeled with GFP
(left panel) and 4G3 antibody (right panel).
Three nuclei are shown. Most coiled bodies are
in the nucleoplasm, but occasionally are seen in
the nucleolus (arrows). All coiled bodies that
contain U2B 0 also express the U2B 0-GFP fusion.
Bars, 5 m m. Movement of Coiled Bodies Vol. 10,
July 1999 2299
40
Outline
  • Details on caption understanding
  • Baseline hand-coded methods
  • Learning methods
  • Experimental results

41
Task
  • Identify image pointers in captions.
  • Classify image pointers
  • bullet-style, citation-style, or NP-style
  • E.g., Panels A and C show the
  • Wont talk about scoping
  • Will focus first on extracting image
    pointersi.e., binary classification of
    substrings is this an image pointer
  • Data 100 captions from 100 papersabout 600
    positive examples.

42
Baseline methods
  • Labeled 100 sample figure captions.
  • HANDCODE-1 patterns like (A), (B-E), (c and d),
    etc.
  • HANDCODE-2 all short parenthesized expressions
    patterns like panel A or in B-C

Some plausible tricks (like filtering HC-2) dont
help much
43
How hard is the problem?
Some citation-style image pointers
44
How hard is the problem?
NP-style
non-image pointers
The difficulty of the task suggests using a
learning approach
45
Another use of propose-and-filter
Note that Hand-Code2 (recall 98) is a natural
candidate generator. Well start with off the
shelf features
Candidate Generator
Candidate phrase
Learned filter
Extracted phrase
46
Learning methods features
  • Start with named sets of labeled substrings
  • Image pointers and tokens (not marked)

Fig. 1. Kinase inactive Plk inhibits Golgi
fragmentation by mitotic cytosol. (A) NRK cells
were grown on coverslips and treated with
2mMthymidine for 8 to 14 h. Cells were
subsequently permeabilized with digitonin, washed
with 1M KCl-containing buer, and incubated with
either 7 mgyml interphase cytosol (IE), 7mgyml
mitotic extract (ME), or mitotic extract to
which 20 mgyml kinase inactive Plk (ME Plk-KD)
was added. After a 60-min incubation at 32C,
cells were fixed and stained with anti-mannosidase
II antibody to visualize the Golgi apparatus
by fluorescence microscopy. (B) Percentage of
cells with fragmented Golgi after incubation with
mitotic extract (ME) in the absence or the
presence of kinase inactive Plk (ME Plk-KD).
The histogram represents the average of four
independent experiments.
47
Learning methods features
  • Start with named sets of labeled substrings
  • Image pointers (labely/n) and tokens
    (labeltoken)
  • Substrings act as examples and features
  • To create features use a little language
  • emit( token, before, -1, label ),
  • emit( token, before, -2, label ),

48
Learning methods features
  • emit( token, before, -1, label ),
  • emit( token, before, -2, label ),

kind of substring to look for
what to emit (substring label, distance in chars
to substring, )
direction to go
distance to go
emit inactive
49
Learning methods boosting
Generalized version of AdaBoost (SingerSchapire,
99) Allows real-valued predictions for each
base hypothesisincluding value of zero.
50
Learning methods boosting rules
  • Weak learner to find weak hypothesis t
  • Split Data into Growing and Pruning sets
  • Let Rt be an empty conjunction
  • Greedily add conditions to Rt guided by Growing
    set
  • Greedily remove conditions from Rt guided by
    Pruning set
  • Convert to weak hypothesis

51
Learning methods boosting rules
SLIPPER also produces fairly compact rule sets.
52
Learning methods BWI
  • Boosted wrapper induction (BWI) learns to extract
    substrings from a document.
  • Learns three concepts firstToken(x),
    lastToken(x), substringLength(k)
  • Conditions are tests on tokens before/after x
  • E.g., toki-2from, isNumber(toki1)
  • SLIPPER weak learner, no pruning.
  • Greedy search extends window size by at most L in
    each iteration, uses lookahead L, no fixed limit
    on window size.
  • Good results in (Kushmeric and Frietag, 2000)

53
BWI example rules
54
Learning methods ABWI
  • Almost boosted wrapper induction (ABWI) learns
    to extract substrings
  • Learns to filter candidate substrings (HandCode2)
  • Conditions are the same tests on tokens near x
  • E.g., toki-2from, isNumber(toki1)
  • SLIPPER weak learner, no pruning.
  • Greedy search extends window size any amount,
    uses no lookahead, has fixed limit on window
    size.
  • Optimal window sizes for this problem seem to be
    small

55
Learning methods
  • Features W tokens before/after, all tokens
    inside.
  • Learner 100 rounds boosting conjunctions of
    feature tests
  • Inspired by BWI (Frietag Kushmeric)
  • Implemented with SLIPPER learner

56
Other learning methods
All learning methods are competitive with
hand-coded methods
57
Additional features
  • Check if candidate contains certain special
    substrings
  • Matches color name labeled color
  • Matches HANDCODE-1 pattern handcode1
  • Matches mm, mg, etc measure
  • Matches 1980,,2003, et al citation
  • Matches top, left, etc place
  • Added sentence boundary substrings
  • Feature is distance to boundary.

58
Learning with expanded feature set
Many new features are inversely correlated with
class (e.g. citation), but ABWI looks only for
positively-correlated patterns.
59
Learning with expanded feature set
SABWI is a symmetric version of ABWI can use
rules and/or conditions negatively or positively
correlated with the class
60
(No Transcript)
61
Task
  • Identify image pointers in captions.
  • Classify image pointers
  • bullet-style, citation-style, or NP-style
  • Combine these to get a four-class problem
  • bullet-style, citation-style, or NP-style, other
  • no hand-coded baseline methods

62
Four-class extraction results
63
Further improvement is probable with additional
labeled data
Write a Comment
User Comments (0)
About PowerShow.com