Divergence measures and message passing - PowerPoint PPT Presentation

About This Presentation
Title:

Divergence measures and message passing

Description:

1. Divergence measures and message passing. Tom Minka. Microsoft Research. Cambridge, UK. with thanks to the Machine Learning and Perception Group. 2. Message ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 43
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Divergence measures and message passing


1
Divergence measures and message passing
  • Tom Minka
  • Microsoft Research
  • Cambridge, UK

with thanks to the Machine Learning and
Perception Group
2
Message-Passing Algorithms
Mean-field MF Peterson,Anderson 87
Loopy belief propagation BP Frey,MacKay 97
Expectation propagation EP Minka 01
Tree-reweighted message passing TRW Wainwright,Jaakkola,Willsky 03
Fractional belief propagation FBP Wiegerinck,Heskes 02
Power EP PEP Minka 04
3
Outline
  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a divergence measure
  • Big picture

4
Outline
  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a divergence measure
  • Big picture

5
Estimation Problem
b
y
d
e
x
z
f
a
c
6
Estimation Problem
b
y
d
e
x
z
f
a
c
7
Estimation Problem
y
x
z
8
Estimation Problem
Queries
Want to do these quickly
9
Belief Propagation
y
x
z
10
Belief Propagation
Final
y
x
z
11
Belief Propagation
Marginals
(Exact)
(BP)
Normalizing constant
0.45 (Exact) 0.44 (BP)
Argmax
(0,0,0) (Exact) (0,0,0) (BP)
12
Outline
  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a divergence measure
  • Big picture

13
Message Passing Distributed Optimization
  • Messages represent a simpler distribution q(x)
    that approximates p(x)
  • A distributed representation
  • Message passing optimizing q to fit p
  • q stands in for p when answering queries
  • Parameters
  • What type of distribution to construct
    (approximating family)
  • What cost to minimize (divergence measure)

14
How to make a message-passing algorithm
  • Pick an approximating family
  • fully-factorized, Gaussian, etc.
  • Pick a divergence measure
  • Construct an optimizer for that measure
  • usually fixed-point iteration
  • Distribute the optimization across factors

15
Outline
  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a divergence measure
  • Big picture

16
Let p,q be unnormalized distributions
Kullback-Leibler (KL) divergence
Alpha-divergence (? is any real number)
Asymmetric, convex
17
Examples of alpha-divergence
18
Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? -1
19
Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 0
20
Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 0.5
21
Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 1
22
Minimum alpha-divergence
q is Gaussian, minimizes D?(pq)
? 1
23
Properties of alpha-divergence
  • ? 0 seeks the mode with largest mass (not
    tallest)
  • zero-forcing p(x)0 forces q(x)0
  • underestimates the support of p
  • ? 1 stretches to cover everything
  • inclusive p(x)gt0 forces q(x)gt0
  • overestimates the support of p

Frey,Patrascu,Jaakkola,Moran 00
24
Structure of alpha space
inclusive (zero avoiding)
zero forcing
BP, EP
MF
?
0
1
TRW
FBP, PEP
25
Other properties
  • If q is an exact minimum of alpha-divergence
  • Normalizing constant
  • If a1 Gaussian q matches mean,variance of p
  • Fully factorized q matches marginals of p

26
Two-node example
x
y
  • q is fully-factorized, minimizes a-divergence to
    p
  • q has correct marginals only for ? 1 (BP)

27
Two-node example
Bimodal distribution
Good Bad
Marginals Mass Zeros Peak heights
Zeros One peak Marginals Mass
? 1 (BP)
  • 0 (MF)
  • ? 0.5

28
Two-node example
Bimodal distribution
Good Bad
Peak heights Zeros Marginals
? 1
29
Lessons
  • Neither method is inherently superior depends
    on what you care about
  • A factorized approx does not imply matching
    marginals (only for ?1)
  • Adding y to the problem can change the estimated
    marginal for x (though true marginal is unchanged)

30
Outline
  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a divergence measure
  • Big picture

31
Distributed divergence minimization
32
Distributed divergence minimization
  • Write p as product of factors
  • Approximate factors one by one
  • Multiply to get the approximation

33
Global divergence to local divergence
  • Global divergence
  • Local divergence

34
Message passing
  • Messages are passed between factors
  • Messages are factor approximations
  • Factor a receives
  • Minimize local divergence to get
  • Send to other factors
  • Repeat until convergence
  • Produces all 6 algs

35
Global divergence vs. local divergence
MF
0
?
local ¹ global
local global no loss from message passing
  • In general, local ¹ global
  • but results are similar
  • BP doesnt minimize global KL, but comes close

36
Experiment
  • Which message passing algorithm is best at
    minimizing global Da(pq)?
  • Procedure
  • Run FBP with various aL
  • Compute global divergence for various aG
  • Find best aL (best alg) for each aG

37
Results
  • Average over 20 graphs, random singleton and
    pairwise potentials exp(wij xi xj)
  • Mixed potentials (w U(-1,1))
  • best aL aG (local should match global)
  • FBP with same a is best at minimizing Da
  • BP is best at minimizing KL

38
Outline
  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a divergence measure
  • Big picture

39
Hierarchy of algorithms
  • Power EP
  • exp family
  • D?(pq)
  • EP
  • exp family
  • KL(pq)
  • FBP
  • fully factorized
  • D?(pq)
  • Structured MF
  • exp family
  • KL(qp)
  • BP
  • fully factorized
  • KL(pq)
  • MF
  • fully factorized
  • KL(qp)
  • TRW
  • fully factorized
  • D?(pq),agt1

40
Matrix of algorithms
  • MF
  • fully factorized
  • KL(qp)
  • Structured MF
  • exp family
  • KL(qp)
  • TRW
  • fully factorized
  • D?(pq),agt1

approximation family
Other families? (mixtures)
divergence measure
  • BP
  • fully factorized
  • KL(pq)
  • EP
  • exp family
  • KL(pq)
  • FBP
  • fully factorized
  • D?(pq)
  • Power EP
  • exp family
  • D?(pq)

Other divergences?
41
Other Message Passing Algorithms
  • Do they correspond to divergence measures?
  • Generalized belief propagation Yedidia,Freeman,We
    iss 00
  • Iterated conditional modes Besag 86
  • Max-product belief revision
  • TRW-max-product Wainwright,Jaakkola,Willsky 02
  • Laplace propagation Smola,Vishwanathan,Eskin 03
  • Penniless propagation Cano,Moral,Salmerón 00
  • Bound propagation Leisink,Kappen 03

42
Future work
  • Understand existing message passing algorithms
  • Understand local vs. global divergence
  • New message passing algorithms
  • Specialized divergence measures
  • Richer approximating families
  • Other ways to minimize divergence

43
Local vs. global results (2)
  • Positive potentials (w 1)
  • aG lt 0 best aL aG
  • aG 0 best aL gt aG
  • BP is not best at minimizing KL
  • suffers from overcounting
  • Large aL ! smoother approximations ! weaker msgs
    ! less overcounting
  • Best aL (1g1)aGg2
  • (g1,g2) constant for given set of factors
Write a Comment
User Comments (0)
About PowerShow.com