Structured Prediction: A Large Margin Approach - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Structured Prediction: A Large Margin Approach

Description:

V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, ... orthography. association. What. is. the. anticipated. cost. of. collecting. fees. under. the ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 76
Provided by: carbonVide
Category:

less

Transcript and Presenter's Notes

Title: Structured Prediction: A Large Margin Approach


1
Structured PredictionA Large Margin Approach
  • Ben Taskar
  • University of Pennsylvania
  • Joint work with
  • V. Chatalbashev, M. Collins, C. Guestrin, M.
    Jordan, D. Klein, D. Koller, S. Lacoste-Julien,
    C. Manning

2
Dont worry, Howard. The big questions are
multiple choice.
3
Handwriting Recognition
x
y
brace
Sequential structure
4
Object Segmentation
x
y
Spatial structure
5
Natural Language Parsing
x
y
The screen was a sea of red
Recursive structure
6
Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de les
droits ?
x
y
What is the anticipated cost of collecting fees
under the new proposal ?
What is the anticipated cost of collecting fees
under the new proposal?
En vertu des nouvelles propositions, quel est le
coût prévu de perception des droits?
Combinatorial structure
7
Protein Structure and Disulfide Bridges
AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHK
IPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK
Protein 1IMT
8
Local Prediction
b
r
e
a
c
  • Classify using local information
  • ? Ignores correlations constraints!

9
Local Prediction
10
Structured Prediction
b
r
e
a
c
  • Use local information
  • Exploit correlations

11
Structured Prediction
12
Outline
  • Structured prediction models
  • Sequences (CRFs)
  • Trees (CFGs)
  • Associative Markov networks (Special MRFs)
  • Matchings
  • Structured large margin estimation
  • Margins and structure
  • Min-max formulation
  • Linear programming inference
  • Certificate formulation

13
Structured Models
scoring function
space of feasible outputs
  • Mild assumption
  • linear
    combination

14
Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
15
Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
16
Associative Markov Nets
Edge features
Point features
spin-images, point height
length of edge, edge orientation
associative restriction
?j
yj
?jk
yk
17
CFG Parsing
(NP ? DT NN) (PP ? IN NP) (NN ? sea)
18
Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
What is the anticipated cost of collecting fees
under the new proposal ?
  • position
  • orthography
  • association

k
j
19
Disulfide Bonds Non-bipartite Matching
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
2
3
1
5
1
4
2
6
4
6
5
3
Fariselli Casadio 01, Baldi et al. 04
20
Scoring Function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
  • amino acid identities
  • phys/chem properties

21
Structured Models
scoring function
space of feasible outputs
  • Mild assumptions
  • linear combination
  • sum of part scores

22
Supervised Structured Prediction
Model
Prediction
Learning
Data
Estimate w
Example Weighted matching Generally
Combinatorialoptimization
Margin
Local (ignores structure)
Likelihood (intractable)
23
Outline
  • Structured prediction models
  • Sequences (CRFs)
  • Trees (CFGs)
  • Associative Markov networks (Special MRFs)
  • Matchings
  • Structured large margin estimation
  • Margins and structure
  • Min-max formulation
  • Linear programming inference
  • Certificate formulation

24
OCR Example
  • We want
  • Equivalently

brace
brace
aaaaa
brace
aaaab
a lot!

brace
zzzzz
25
Parsing Example
  • We want
  • Equivalently

It was red
It was red
It was red
It was red
It was red
a lot!

It was red
It was red
26
Alignment Example
  • We want
  • Equivalently

What is the Quel est le
1 2 3
1 2 3
What is the Quel est le
What is the Quel est le
What is the Quel est le
What is the Quel est le
a lot!

What is the Quel est le
What is the Quel est le
27
Structured Loss
b c a r e
2
b r o r e
2
b r o c e
1
b r a c e
0
0 1 2 2
0 1 2 3
What is the Quel est le
It was red
28
Large margin estimation
  • Given training examples , we want
  • Maximize margin
  • Mistake weighted margin

of mistakes in y
Collins 02, Altun et al 03, Taskar 03
29
Large margin estimation
  • Eliminate
  • Add slacks for inseparable case

30
Large margin estimation
  • Brute force enumeration
  • Min-max formulation
  • Plug-in linear program for inference

31
Min-max formulation
Structured loss (Hamming)
Inference
LP Inference
Key step
discrete optim.
continuous optim.
32
Outline
  • Structured prediction models
  • Sequences (CRFs)
  • Trees (CFGs)
  • Associative Markov networks (Special MRFs)
  • Matchings
  • Structured large margin estimation
  • Margins and structure
  • Min-max formulation
  • Linear programming inference
  • Certificate formulation

33
y ? z Map for Markov Nets
1
0

0
0
1

0
0
1

0
1
0

0
0
1

0
a
b

z
a
b

z
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
1 0 . 0
. . . 0
0 0 0 0
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
0 1 . 0
. . . 0
0 0 0 0
a b . z
a b . z
a b . z
a b . z
34
Markov Net Inference LP
0 1 0 0
normalization
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0
0
1
0
agreement
Has integral solutions z for chains, trees Can be
fractional for untriangulated networks
35
Associative MN Inference LP
associative restriction
0 1 0 0
0
1
0
0
0
1
0
0
  • For K2, solutions are always integral (optimal)
  • For Kgt2, within factor of 2 of optimal

36
CFG Chart
  • CNF tree set of two types of parts
  • Constituents (A, s, e)
  • CF-rules (A ? B C, s, m, e)

37
CFG Inference LP
root
inside
outside
Has integral solutions z
38
Matching Inference LP
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
k
What is the anticipated cost of collecting fees
under the new proposal ?
degree
j
Has integral solutions z
39
LP Duality
  • Linear programming duality
  • Variables ? constraints
  • Constraints ? variables
  • Optimal values are the same
  • When both feasible regions are bounded

40
Min-max Formulation
LP duality
41
Min-max formulation summary
  • Formulation produces concise QP for
  • Low-treewidth Markov networks
  • Associative MNs (K2)
  • Context free grammars
  • Bipartite matchings
  • Approximate for untriangulated MNs, AMNs with Kgt2

Taskar et al 04
42
Unfactored Primal/Dual
QP duality
Exponentially many constraints/variables
43
Factored Primal/Dual
By QP duality
Dual inherits structure from problem-specific
inference LP Variables ? correspond to a
decomposition of ? variables of the flat case
44
The Connection
b c a r e
2
.2
b r o r e
.15
2
b r o c e
.25
1
b r a c e
.4
0
r
c
a
1
.65
1
.8
.6
e
b
?
c
r
o
.4
.35
.2
45
Duals and Kernels
  • Kernel trick works
  • Factored dual
  • Local functions (log-potentials) can use kernels

46
Alternatives Perceptron
  • Simple iterative method
  • Unstable for structured output fewer instances,
    big updates
  • May not converge if non-separable
  • Noisy
  • Voted / averaged perceptron Freund Schapire
    99, Collins 02
  • Regularize / reduce variance by aggregating over
    iterations

47
Alternatives Constraint Generation
  • Add most violated constraint
  • Handles more general loss functions
  • Only polynomial of constraints needed
  • Need to re-solve QP many times
  • Worst case of constraints larger than factored

Collins 02 Altun et al, 03 Tsochantaridis et
al, 04
48
Handwriting Recognition
  • Length 8 chars
  • Letter 16x8 pixels
  • 10-fold Train/Test
  • 5000/50000 letters
  • 600/6000 words
  • Models
  • Multiclass-SVMs
  • CRFs
  • M3 nets

30
better
25
20
Test error (average per-character)
15
10
45 error reduction over linear CRFs 33 error
reduction over multiclass SVMs
5
0
CRFs
MCSVMs
M3 nets
Crammer Singer 01
49
Hypertext Classification
  • WebKB dataset
  • Four CS department websites 1300 pages/3500
    links
  • Classify each page faculty, course, student,
    project, other
  • Train on three universities/test on fourth

better
relaxed dual
53 error reduction over SVMs 38 error reduction
over RMNs
loopy belief propagation
Taskar et al 02
50
3D Mapping
Data provided by Michael Montemerlo Sebastian
Thrun
Laser Range Finder
GPS
IMU
Label ground, building, tree, shrub Training
30 thousand points Testing 3 million points
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
Segmentation results
Hand labeled 180K test points
Model Accuracy
SVM 68
V-SVM 73
M3N 93
56
Fly-through
57
Word Alignment Results
Data Hansards Canadian Parliament
Features induced on ? 1 mil unsupervised
sentences Trained on 100 sentences (10,000
edges) Tested on 350 sentences (35,000
edges)
Model Error
Local learningmatching 10.0
Our approach 8.5
GIZA/IBM4 Och Ney 03 6.5
Local learningmatching 5.4
Our approach 4.9
Our approachQAP 4.5
Taskaral 05
Error weighted combination of precision/recall
Lacoste-JulienTaskaral 06
58
Outline
  • Structured prediction models
  • Sequences (CRFs)
  • Trees (CFGs)
  • Associative Markov networks (Special MRFs)
  • Matchings
  • Structured large margin estimation
  • Margins and structure
  • Min-max formulation
  • Linear programming inference
  • Certificate formulation

59
Certificate formulation
  • Non-bipartite matchings
  • O(n3) combinatorial algorithm
  • No polynomial-size LP known
  • Spanning trees
  • No polynomial-size LP known
  • Simple certificate of optimality
  • Intuition
  • Verifying optimality easier than optimizing
  • Compact optimality condition of wrt.

kl
ij
60
Certificate for non-bipartite matching
  • Alternating cycle
  • Every other edge is in matching
  • Augmenting alternating cycle
  • Score of edges not in matching greater than edges
    in matching
  • Negate score of edges not in matching
  • Augmenting alternating cycle negative length
    alternating cycle
  • Matching is optimal ?? no negative alternating
    cycles

Edmonds 65
61
Certificate for non-bipartite matching
  • Pick any node r as root
  • length of shortest alternating
  • path from r to j
  • Triangle inequality
  • Theorem
  • No negative length cycle ? distance function d
    exists
  • Can be expressed as linear constraints
  • O(n) distance variables, O(n2) constraints

62
Certificate formulation
  • Formulation produces compact QP for
  • Spanning trees
  • Non-bipartite matchings
  • Any problem with compact optimality condition

Taskar et al. 05
63
Disulfide Bonding Prediction
  • Data Swiss Prot 39
  • 450 sequences (4-10 cysteines)
  • Features
  • windows around C-C pair
  • physical/chemical properties

AVITGA ERDLQ GKGT AVSLWIKSVRV TPVGTSGED
HPASHKIPFSGQRMHHT P APNLA VQTSPKKFK LSK
C C CC C C
C C C C
Model Acc
Local learningmatching 41
Recursive Neural Net Baldial04 52
Our approach (certificate) 55
Accuracy proteins with all correct bonds
Taskaral 05
64
Formulation summary
  • Brute force enumeration
  • Min-max formulation
  • Plug-in convex program for inference
  • Certificate formulation
  • Directly guarantee optimality of

65
Omissions
  • Kernels
  • Non-parametric models
  • Structured generalization bounds
  • Bounds on hamming loss
  • Scalable algorithms (no QP solver needed)
  • Structured SMO (works for chains, trees)Taskar
    04
  • Structured ExpGrad (works for chains,
    trees)Bartlettal 04
  • Structured ExtraGrad (works for matchings, AMNs)
    Taskaral 06

66
Open questions
  • Statistical consistency
  • Hinge loss not consistent for non-binary output
  • See Tewari Bartlett 05, McAllester 07
  • Learning with approximate inference
  • Does constant factor approximate inference
    guarantee anything about learning?
  • No See Kulesza Pereira 07
  • Perhaps other assumptions needed
  • Discriminative structure learning
  • Using sparsifying priors

67
Conclusion
  • Two general techniques for structured
    large-margin estimation
  • Exact, compact, convex formulations
  • Allow efficient use of kernels
  • Tractable when other estimation methods are not
  • Efficient learning algorithms
  • Empirical success on many domains

68
References
  • Y. Altun, I. Tsochantaridis, and T. Hofmann.
    Hidden Markov support vector machines. ICML03.
  • M. Collins. Discriminative training methods for
    hidden Markov models Theory and experiments with
    perceptron algorithms. EMNLP02
  • K. Crammer and Y. Singer. On the algorithmic
    implementation of multiclass kernel-based vector
    machines. JMLR01
  • J. Lafferty, A. McCallum, and F. Pereira.
    Conditional random fields Probabilistic models
    for segmenting and labeling sequence data. ICML04
  • More papers at http//www.cis.upenn.edu/taskar

69
(No Transcript)
70
Modeling First Order Effects
Monotonicity Local inversion Local fertility
  • QAP NP-complete
  • Sentences (?30 words, ?1k vars) ? few seconds
    (Mosek)
  • Learning use LP relaxation
  • Testing using LP, 83.5 sentences, 99.85 edges
    integral

71
Segmentation Model ? Min-Cut
Local evidence
0
1
Spatial smoothness
  • Computing is hard in
    general, but
  • if edge potentials attractive ? min-cut algorithm
  • Multiway-cut for multiclass case ? use LP
    relaxation

Greigal 89, Boykoval 99, Kolmogorov Zabih
02, Taskaral 04
72
Scalable Algorithms
  • Batch and online
  • Linear in the size of the data
  • Iterate until convergence
  • For each example in the training sample
  • Run inference using current parameters (varies by
    method)
  • Online Update parameters using computed example
    values
  • Batch Update parameters using computed sample
    values

Structured SMO (Taskar et al, 03 Taskar 04)
Structured Exponentiated Gradient (Bartlett et
al, 04) Structured Extragradient (Taskar et al,
05)
73
Experimental Setup
  • Standard Penn treebank split (2-21/22/23)
  • Generative baselines
  • Klein Manning 03 and Collins 99
  • Discriminative
  • Basic max-margin version of KM 03
  • Lexical Lexical Aux
  • Lexical features (on constituent parts only)
  • ts-1 ts te te1 ? predicted
    tags
  • xs-1 xs xe xe1
  • Auxillary features
  • Flat classifier using same features
  • Prediction of KM 03 on each span

74
Results for sentences 40 words
Model LP LR F1
Generative 86.37 85.27 85.82
LexicalAux 87.56 86.85 87.20
Collins 99 85.33 85.94 85.73
Trained only on sentences 20 words
Taskar et al 04
75
Example
  • The Egyptian president said he would visit
  • Libya today to resume the talks.
  • Generative model Libya today is base NP
  • Lexical model today is a one word constituent
Write a Comment
User Comments (0)
About PowerShow.com