Title: Structured Prediction, Dual Extragradient
1Structured Prediction, Dual Extragradient
Bregman Projections
- Ben Taskar
- Simon Lacoste-Julien
- Michael Jordan
- UC Berkeley
2Handwriting Recognition
x
y
brace
Sequential structure
3Natural Language Parsing
x
y
The screen was a sea of red
Recursive structure
4Object Segmentation
x
y
Spatial structure
5Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de les
droits ?
x
y
What is the anticipated cost of collecting fees
under the new proposal ?
What is the anticipated cost of collecting fees
under the new proposal?
En vertu des nouvelles propositions, quel est le
coût prévu de perception des droits?
Combinatorial structure
6Linear Structured Models
scoring function
space of feasible outputs
- Assumption
-
- linear combination of features
7Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
8Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
9CFG Parsing
(NP ? DT NN) (PP ? IN NP) (NN ? sea)
10Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
What is the anticipated cost of collecting fees
under the new proposal ?
-
- association
- position
- orthography
k
j
11Ising Models Min Cuts
Point features spin image
1
0
Edge features length, angle
- Find max y via min-cut if edge scores are
non-negative - Restrict edge features and
weights
12Linear Structured Models
scoring function
space of feasible outputs
- Assumptions
-
- linear combination of features
- sum of part scores
13Learning w
- Training examples
- Probabilistic approach
- Computing Zw(x) can be P-complete
- Tractable models but intractable estimation
- Large margin approach
- Exact and efficient when prediction is tractable
14Alignment Example Loss
- We want
- Structured Loss
- Precision, Recall, F1, Hamming
What is the Quel est le
0 1 2 2
What is the Quel est le
15Alignment Example Constraints
What is the Quel est le
1 2 3
1 2 3
What is the Quel est le
What is the Quel est le
Exponential number of constraints
What is the Quel est le
What is the Quel est le
What is the Quel est le
What is the Quel est le
16Large Margin Estimation
- Approximation constraint generation/sampling
Collins02Altun03Tsochantaridis04Joachims0
5 - Alternative approach
- Hinge loss
- Min-max formulation
17Alternatives Constraint Generation
- Add most violated constraint
- Handles more general loss functions
- Only polynomial of constraints needed
- Need to re-solve QP many times
- Worst case of constraints larger than factored
Collins 02 Altun et al, 03 Tsochantaridis et
al, 04
18Min-max Formulation
Structured loss (Hamming)
Inference
LP Inference
Key step
discrete optim.
continuous optim.
19y ? z Map for Markov Nets
20Markov Net Inference LP
normalization
agreement
Has integral solutions z for chains, trees Can be
fractional for untriangulated networks
21Matching Inference LP
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
k
What is the anticipated cost of collecting fees
under the new proposal ?
degree
j
Has integral solutions z
22Saddle-point Problem
23First Try Projected Gradient
Euclidean projection
Can oscillate!
no convergence guarantee
24Dual Extragradient for Structured Prediction
State -- cumulative gradient
Start
Prediction
Correction
Cumulative gradient update
Output
O(1/?) convergence rate
Nesterov03
25 for Bipartite Matchings Min Cost Flow
t
s
- All capacities 1
- Min-cost quadratic flow computes projection
- O(N3) complexity for fixed precision (Nnum
nodes) - Well-studied problem, free code (Guerreiro
Tseng02) - See paper for flow-reduction for min-cuts
26Non-Euclidean Dual Extragradient
d( , ) Bregman divergence
Prediction
Correction
Cumulative dual gradient update
Output
27Bregman Divergence Updates
- Squared distance ? Euclidean projections
- KL-distance ? Multiplicative update
normalization - In case of sequences trees, can be computed via
forward-backward and inside-outside
28Memory-efficient Version
- Required memory
- -- proportional to number of parameters
- -- proportional to number/size of
examples - Luckily, we dont need to maintain explicit
- Sufficient to maintain with memory proportional
to number of parameters - Similar trick works in Exponentiated Gradient
Bartlett04 which requires decomposable models
29Experiments
- Word Alignment
- Training data
- 5000 sentences
- 555K edges
- Object Segmentation
- Training data
- 5 scenes
- 37K nodes
- 88K edges
- Compare to averaged perceptron
30(Averaged) Perceptron
- Perceptron for structured output Collins 2002
- For each example ,
- Predict
- Update
- Output averaged parameters
31Matchings
32Min-cuts
33Conclusion
- General technique for structured large-margin
estimation - Exact, compact, convex formulations
- Allow efficient use of kernels
- Tractable when other estimation methods are not
- Memory efficient learning algorithms
- See http//www.cs.berkeley.edu/taskar for paper