Learning Structured Prediction Models: A Large Margin Approach - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Learning Structured Prediction Models: A Large Margin Approach

Description:

Learning Structured Prediction Models: A Large Margin Approach – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 57
Provided by: csr62
Category:

less

Transcript and Presenter's Notes

Title: Learning Structured Prediction Models: A Large Margin Approach


1
Learning Structured Prediction ModelsA Large
Margin Approach
  • Ben Taskar
  • U.C. Berkeley
  • Vassil Chatalbashev Michael
    Collins Carlos Guestrin
    Dan Klein
  • Daphne Koller Chris
    Manning

2
Dont worry, Howard. The big questions are
multiple choice.
3
Handwriting recognition
x
y
brace
Sequential structure
4
Object segmentation
x
y
Spatial structure
5
Natural language parsing
x
y
The screen was a sea of red
Recursive structure
6
Disulfide connectivity prediction
x
y
RSCCPCYWGGCPW GQNCYPEGCSGPKV
Combinatorial structure
7
Outline
  • Structured prediction models
  • Sequences (CRFs)
  • Trees (CFGs)
  • Associative Markov networks (Special MRFs)
  • Matchings
  • Geometric View
  • Structured model polytopes
  • Linear programming inference
  • Structured large margin estimation
  • Min-max formulation
  • Application 3D object segmentation
  • Certificate formulation
  • Application disulfide connectivity prediction

8
Structured models
scoring function
space of feasible outputs
  • Mild assumption
  • linear
    combination

9
Chain Markov Net (aka CRF)
P(yx) ? ?i ?(xi,yi) ?i ?(yi,yi1)
?(xi,yi) exp?? w?f?(xi,yi)
?(yi,yi1) exp?? w?f? (yi,yi1)
f?(y,y) I(yz,ya)
y
f?(x,y) I(xp1, yz)
x
Lafferty et al. 01
10
Chain Markov Net (aka CRF)
P(yx)?? ?i ?(xi,yi) ?i ?(yi,yi1)
  • expwTf(x,y)

w , w? , , w?, f(x,y) , f?(x,y) ,
, f?(x,y) ,
?i ?(xi,yi) exp?? w? ?i f?(xi,yi)
?i ?(yi,yi1) exp?? w? ?i f? (yi,yi1)
f?(x,y) (yz,ya)
y
f?(x,y) (xp1, yz)
x
Lafferty et al. 01
11
Associative Markov Nets
Edge features
Point features
spin-images, point height
length of edge, edge orientation
associative restriction
?i
yi
?ij
yj
12
PCFG
(NP ? DT NN) (PP ? IN NP) (NN ? sea)
13
Disulfide bonds non-bipartite matching
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
1
6
2
5
4
3
Fariselli Casadio 01, Baldi et al. 04
14
Scoring function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
String features residues, physical properties
15
Structured models
scoring function
space of feasible outputs
  • Mild assumption
  • Another mild assumption
  • ? linear
    programming

16
MAP inference ? linear program
  • LP inference for
  • Chains
  • Trees
  • Associative Markov Nets
  • Bipartite Matchings

17
Markov Net Inference LP
Has integral solutions y for chains, trees Gives
upper bound for general networks
18
Associative MN Inference LP
associative restriction
  • For K2, solutions are always integral (optimal)
  • For Kgt2, within factor of 2 of optimal
  • Constraint matrix A is linear in number of nodes
    and edges, regardless of the tree-width

19
Other Inference LPs
  • Context-free parsing
  • Dynamic programs
  • Bipartite matching
  • Network flow
  • Many other combinatorial problems

20
Outline
  • Structured prediction models
  • Sequences (CRFs)
  • Trees (CFGs)
  • Associative Markov networks (Special MRFs)
  • Matchings
  • Geometric View
  • Structured model polytopes
  • Linear programming inference
  • Structured large margin estimation
  • Min-max formulation
  • Application 3D object segmentation
  • Certificate formulation
  • Application disulfide connectivity prediction

21
Learning w
  • Training example (x, y)
  • Probabilistic approach
  • Maximize conditional likelihood
  • Problem computing Zw(x) is P-complete

22
Geometric Example
Training data
Goal
Learn w s.t. wTf( , y) points the right way
23
OCR Example
  • We want
  • argmaxword wT f( ,word) brace
  • Equivalently
  • wT f( ,brace) gt wT f( ,aaaaa)
  • wT f( ,brace) gt wT f( ,aaaab)
  • wT f( ,brace) gt wT f( ,zzzzz)

a lot!
24
Large margin estimation
  • Given training example (x, y), we want
  • Maximize margin
  • Mistake weighted margin

of mistakes in y
Taskar et al. 03
25
Large margin estimation
  • Brute force enumeration
  • Min-max formulation
  • Plug-in linear program for inference

26
Min-max formulation
Assume linear loss (Hamming)
Inference
LP inference
27
Min-max formulation
By strong LP duality
Minimize jointly over w, z
28
Min-max formulation
  • Formulation produces compact QP for
  • Low-treewidth Markov networks
  • Associative Markov networks
  • Context free grammars
  • Bipartite matchings
  • Any problem with compact LP inference

29
3D Mapping
Data provided by Michael Montemerlo Sebastian
Thrun
Laser Range Finder
GPS
IMU
Label ground, building, tree, shrub Training
30 thousand points Testing 3 million points
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Segmentation results
Hand labeled 180K test points
35
Fly-through
36
Certificate formulation
  • Non-bipartite matchings
  • O(n3) combinatorial algorithm
  • No polynomial-size LP known
  • Spanning trees
  • No polynomial-size LP known
  • Simple certificate of optimality
  • Intuition
  • Verifying optimality easier than optimizing
  • Compact optimality condition of y wrt.

kl
ij
37
Certificate for non-bipartite matching
  • Alternating cycle
  • Every other edge is in matching
  • Augmenting alternating cycle
  • Score of edges not in matching greater than edges
    in matching
  • Negate score of edges not in matching
  • Augmenting alternating cycle negative length
    alternating cycle
  • Matching is optimal ?? no negative alternating
    cycles

Edmonds 65
38
Certificate for non-bipartite matching
  • Pick any node r as root
  • length of shortest alternating
  • path from r to j
  • Triangle inequality
  • Theorem
  • No negative length cycle ? distance function d
    exists
  • Can be expressed as linear constraints
  • O(n) distance variables, O(n2) constraints

39
Certificate formulation
  • Formulation produces compact QP for
  • Spanning trees
  • Non-bipartite matchings
  • Any problem with compact optimality condition

40
Disulfide connectivity prediction
  • Dataset
  • Swiss Prot protein database, release 39
  • Fariselli Casadio 01, Baldi et al. 04
  • 446 sequences (4-50 cysteines)
  • Features window profiles (size 9) around each
    pair
  • Two modes bonded state known/unknown
  • Comparison
  • SVM-trained weights (ignoring constraints during
    learning)
  • DAG Recursive Neural Network Baldi et al. 04
  • Our model
  • Max-margin matching using RBF kernel
  • Training off-the-shelf LP/QP solver CPLEX (1
    hour)

41
Known bonded state
Precision / Accuracy
4-fold cross-validation
42
Unknown bonded state
Precision / Recall / Accuracy
4-fold cross-validation
43
Formulation summary
  • Brute force enumeration
  • Min-max formulation
  • Plug-in convex program for inference
  • Certificate formulation
  • Directly guarantee optimality of y

44
Estimation
Margin
Discriminative
MEMMs
CRFs
P(yx)
HMMs PCFGs
MRFs
Generative
P(x,y)
Local
Global
P(z) 1/Z ?c ?(zc)
P(z) ?i P(ziz?)
45
Omissions
  • Formulation details
  • Kernels
  • Multiple examples
  • Slacks for non-separable case
  • Approximate learning of intractable models
  • General MRFs
  • Learning to cluster
  • Structured generalization bounds
  • Scalable algorithms (no QP solver needed)
  • Structured SMO (works for chains, trees)
  • Structured EG (works for chains, trees)
  • Structured PG (works for chains, matchings, AMNs,
    )

46
Current Work
  • Learning approximate energy functions
  • Protein folding
  • Physical processes
  • Semi-supervised learning
  • Hidden variables
  • Mixing labeled and unlabeled data
  • Discriminative structure learning
  • Using sparsifying priors

47
Conclusion
  • Two general techniques for structured
    large-margin estimation
  • Exact, compact, convex formulations
  • Allow efficient use of kernels
  • Tractable when other estimation methods are not
  • Structured generalization bounds
  • Efficient learning algorithms
  • Empirical success on many domains
  • Papers at http//www.cs.berkeley.edu/taskar

48
(No Transcript)
49
Duals and Kernels
  • Kernel trick works!
  • Scoring functions (log-potentials) can use
    kernels
  • Same for certificate formulation

50
Handwriting Recognition
  • Length 8 chars
  • Letter 16x8 pixels
  • 10-fold Train/Test
  • 5000/50000 letters
  • 600/6000 words
  • Models
  • Multiclass-SVMs
  • CRFs
  • M3 nets

30
better
25
20
Test error (average per-character)
15
10
45 error reduction over linear CRFs 33 error
reduction over multiclass SVMs
5
0
CRFs
MCSVMs
M3 nets
Crammer Singer 01
51
Hypertext Classification
  • WebKB dataset
  • Four CS department websites 1300 pages/3500
    links
  • Classify each page faculty, course, student,
    project, other
  • Train on three universities/test on fourth

better
relaxed dual
53 error reduction over SVMs 38 error reduction
over RMNs
loopy belief propagation
Taskar et al 02
52
Projected Gradient
yk1
yk
  • Projecting y onto constraints
  • ? min-cost convex flow for Markov nets,
    matchings
  • Convergence same as steepest gradient
  • Conjugate gradient also possible (two-metric
    proj.)

yk3
yk2
yk4
53
Min-Cost Flow for Markov Chains
a
a
a
a
a
s
t
z
z
z
z
z
  • Capacities C
  • Edge costs
  • For edges from node s, to node t, cost 0

54
Min-Cost Flow for Bipartite Matchings
t
s
  • Capacities C
  • Edge costs
  • For edges from node s, to node t, cost 0

55
CFG Chart
  • CNF tree set of two types of parts
  • Constituents (A, s, e)
  • CF-rules (A ? B C, s, m, e)

56
CFG Inference LP
inside
outside
Has integral solutions y for trees
Write a Comment
User Comments (0)
About PowerShow.com