Title: Divergence measures and message passing
1Divergence measures and message passing
- Tom Minka
- Microsoft Research
- Cambridge, UK
with thanks to the Machine Learning and
Perception Group
2Message-Passing Algorithms
Mean-field MF Peterson,Anderson 87
Loopy belief propagation BP Frey,MacKay 97
Expectation propagation EP Minka 01
Tree-reweighted message passing TRW Wainwright,Jaakkola,Willsky 03
Fractional belief propagation FBP Wiegerinck,Heskes 02
Power EP PEP Minka 04
3Outline
- Example of message passing
- Interpreting message passing
- Divergence measures
- Message passing from a divergence measure
- Big picture
4Outline
- Example of message passing
- Interpreting message passing
- Divergence measures
- Message passing from a divergence measure
- Big picture
5Estimation Problem
b
y
d
e
x
z
f
a
c
6Estimation Problem
b
y
d
e
x
z
f
a
c
7Estimation Problem
y
x
z
8Estimation Problem
Queries
Want to do these quickly
9Belief Propagation
y
x
z
10Belief Propagation
Final
y
x
z
11Belief Propagation
Marginals
(Exact)
(BP)
Normalizing constant
0.45 (Exact) 0.44 (BP)
Argmax
(0,0,0) (Exact) (0,0,0) (BP)
12Outline
- Example of message passing
- Interpreting message passing
- Divergence measures
- Message passing from a divergence measure
- Big picture
13Message Passing Distributed Optimization
- Messages represent a simpler distribution q(x)
that approximates p(x) - A distributed representation
- Message passing optimizing q to fit p
- q stands in for p when answering queries
- Parameters
- What type of distribution to construct
(approximating family) - What cost to minimize (divergence measure)
14How to make a message-passing algorithm
- Pick an approximating family
- fully-factorized, Gaussian, etc.
- Pick a divergence measure
- Construct an optimizer for that measure
- usually fixed-point iteration
- Distribute the optimization across factors
15Outline
- Example of message passing
- Interpreting message passing
- Divergence measures
- Message passing from a divergence measure
- Big picture
16Let p,q be unnormalized distributions
Kullback-Leibler (KL) divergence
Alpha-divergence (? is any real number)
Asymmetric, convex
17Examples of alpha-divergence
18Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? -1
19Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 0
20Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 0.5
21Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 1
22Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 1
23Properties of alpha-divergence
- ? 0 seeks the mode with largest mass (not
tallest) - zero-forcing p(x)0 forces q(x)0
- underestimates the support of p
- ? 1 stretches to cover everything
- inclusive p(x)gt0 forces q(x)gt0
- overestimates the support of p
Frey,Patrascu,Jaakkola,Moran 00
24Structure of alpha space
inclusive (zero avoiding)
zero forcing
BP, EP
MF
?
0
1
TRW
FBP, PEP
25Other properties
- If q is an exact minimum of alpha-divergence
- Normalizing constant
- If a1 Gaussian q matches mean,variance of p
- Fully factorized q matches marginals of p
26Two-node example
x
y
- q is fully-factorized, minimizes a-divergence to
p - q has correct marginals only for ? 1 (BP)
27Two-node example
Bimodal distribution
Good Bad
Marginals Mass Zeros Peak heights
Zeros One peak Marginals Mass
? 1 (BP)
28Two-node example
Bimodal distribution
Good Bad
Peak heights Zeros Marginals
? 1
29Lessons
- Neither method is inherently superior depends
on what you care about - A factorized approx does not imply matching
marginals (only for ?1) - Adding y to the problem can change the estimated
marginal for x (though true marginal is unchanged)
30Outline
- Example of message passing
- Interpreting message passing
- Divergence measures
- Message passing from a divergence measure
- Big picture
31Distributed divergence minimization
32Distributed divergence minimization
- Write p as product of factors
- Approximate factors one by one
- Multiply to get the approximation
33Global divergence to local divergence
- Global divergence
- Local divergence
34Message passing
- Messages are passed between factors
- Messages are factor approximations
- Factor a receives
- Minimize local divergence to get
- Send to other factors
- Repeat until convergence
- Produces all 6 algs
35Global divergence vs. local divergence
MF
0
?
local ¹ global
local global no loss from message passing
- In general, local ¹ global
- but results are similar
- BP doesnt minimize global KL, but comes close
36Experiment
- Which message passing algorithm is best at
minimizing global Da(pq)? - Procedure
- Run FBP with various aL
- Compute global divergence for various aG
- Find best aL (best alg) for each aG
37Results
- Average over 20 graphs, random singleton and
pairwise potentials exp(wij xi xj) - Mixed potentials (w U(-1,1))
- best aL aG (local should match global)
- FBP with same a is best at minimizing Da
- BP is best at minimizing KL
38Outline
- Example of message passing
- Interpreting message passing
- Divergence measures
- Message passing from a divergence measure
- Big picture
39Hierarchy of algorithms
- Power EP
- exp family
- D?(pq)
- FBP
- fully factorized
- D?(pq)
- Structured MF
- exp family
- KL(qp)
- BP
- fully factorized
- KL(pq)
- MF
- fully factorized
- KL(qp)
- TRW
- fully factorized
- D?(pq),agt1
40Matrix of algorithms
- MF
- fully factorized
- KL(qp)
- Structured MF
- exp family
- KL(qp)
- TRW
- fully factorized
- D?(pq),agt1
approximation family
Other families? (mixtures)
divergence measure
- BP
- fully factorized
- KL(pq)
- FBP
- fully factorized
- D?(pq)
- Power EP
- exp family
- D?(pq)
Other divergences?
41Other Message Passing Algorithms
- Do they correspond to divergence measures?
- Generalized belief propagation Yedidia,Freeman,We
iss 00 - Iterated conditional modes Besag 86
- Max-product belief revision
- TRW-max-product Wainwright,Jaakkola,Willsky 02
- Laplace propagation Smola,Vishwanathan,Eskin 03
- Penniless propagation Cano,Moral,Salmerón 00
- Bound propagation Leisink,Kappen 03
42Future work
- Understand existing message passing algorithms
- Understand local vs. global divergence
- New message passing algorithms
- Specialized divergence measures
- Richer approximating families
- Other ways to minimize divergence
43Local vs. global results (2)
- Positive potentials (w 1)
- aG lt 0 best aL aG
- aG 0 best aL gt aG
- BP is not best at minimizing KL
- suffers from overcounting
- Large aL ! smoother approximations ! weaker msgs
! less overcounting - Best aL (1g1)aGg2
- (g1,g2) constant for given set of factors