Title: Regularization of energy-based representations
1Regularization of energy-based representations
- Minimize total energy lEp(u) (1-l)Ed(u,d)
- Ep(u) Stabilizing function - a smoothness
constraint - Membrane stabilizer
- Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j) 2 - Thin plate stabilizer
- Ep(u) 0.5Si,j (ui,j1 ui,j-1 2ui,j)2
(ui1,j ui-1,j 2ui,j)2
(ui1,j1 ui,j ui1,j ui,j1)2 - Linear combinations of the two
- Ed(u,d) Energy function, measures compatibility
between observations and data - Ed(u,d) 0.5Si,j ci,j (di,j ui,j)2
- ci,j is the inverse of the variance in
measurement di,j
2Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
3Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
4Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
5Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
6Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
- ATOM
- ui,jui,j ui,j1ui,j1 ui,jui,j1
ui,j1ui,j
1
-1
7Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
-1
- ATOM
- ui,jui,j ui1,jui1,j ui,jui1,j
ui1,jui,j
2
-1
8Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
-1
- ATOM
- ui,jui,j ui,j1ui,j1 ui,jui,j1
ui,j1ui,j
3
-1
-1
9Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
-1
- ATOM
- ui,jui,j ui1,jui1,j ui,jui1,j
ui1,jui,j
4
-1
-1
-1
10Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
-1
- ATOM
- ui,jui,j ui1,jui1,j ui,jui1,j
ui1,jui,j
4
-1
-1
-1
u
11Stabilizing function membrane stabilizer
Ep(u) 0.5Si,j (ui,j1 ui,j)2 (ui1,j
ui,j)2
ui,j
i
j
-1
- ATOM
- ui,jui,j ui1,jui1,j ui,jui1,j
ui1,jui,j
4
-1
-1
-1
u
- Ep(u) 0.5uTApu
- Rows of Ap have the form
- 0 0 0 1 0 0 . 0 1 4 1 0 0 0 1 0 ..
12Stabilizing function thin plate stabilizer
Ep(u) 0.5Si,j(ui,j1 ui,j) (ui,j-1
ui,j)2 (ui1,j ui,j)
(ui-1,j ui,j)2 2(ui1,j1 ui,j)
(ui,j ui1,j) (ui,jui,j1)2
ui,j
i
j
13Stabilizing function thin plate stabilizer
Ep(u) 0.5Si,j(ui,j1 ui,j) (ui,j-1
ui,j)2 (ui1,j ui,j)
(ui-1,j ui,j)2 2(ui1,j1 ui,j)
(ui,j ui1,j) (ui,jui,j1)2
ui,j
i
j
14Stabilizing function thin plate stabilizer
Ep(u) 0.5Si,j(ui,j1 ui,j) (ui,j-1
ui,j)2 (ui1,j ui,j)
(ui-1,j ui,j)2 2(ui1,j1 ui,j)
(ui,j ui1,j) (ui,jui,j1)2
ui,j
i
j
15Stabilizing function thin plate stabilizer
Ep(u) 0.5Si,j(ui,j1 ui,j) (ui,j-1
ui,j)2 (ui1,j ui,j)
(ui-1,j ui,j)2 2(ui1,j1 ui,j)
(ui,j ui1,j) (ui,jui,j1)2
ui,j
i
j
16Stabilizing function thin plate stabilizer
Ep(u) 0.5Si,j(ui,j1 ui,j) (ui,j-1
ui,j)2 (ui1,j ui,j)
(ui-1,j ui,j)2 2(ui1,j1 ui,j)
(ui,j ui1,j) (ui,jui,j1)2
ui,j
i
j
1
-8
2
2
ATOM
-8
-8
1
1
20
-8
2
2
1
- Ep(u) 0.5uTApu
- Rows of Ap have the form
- 0 0 1 0 0 ... 0 2 8 2 0 0 .. 1 8 20 8 1 0 0
0 0 2 8 2 0 .. 0 1 0 ..
17Stabilizing function Examples (1-D)
points
membrane
thin plate
thin plate membrane
18Stabilizing function Examples (2-D)
membrane
Samples from u
thin plate
membrane thin plate
19Stabilizing function Examples (2-D)
membrane
Samples from u
thin plate
membrane thin plate
20Energy function
- Data on grid
- di,j ui,j ei,j (ei,j is N(0,s2))
- Ed(u,d) 0.5Si,j ci,j (di,j ui,j)2 (ci,j
s-2) - Data off grid
- dk h0,0 ui,j h0,1 ui,j1 h1,0 ui1,j h1,1
ui1,j1 ei,j - Ed(u,d) 0.5Sk ck (dk, Hku)2
- In all examples here we assume data on grid
- Ed(u,d) 0.5 (u-d)TAd(u-d)
- Ad s-2 I measurement variance assumed
constant for all data
21Overall energy
- E(u) lEp(u) (1-l)Ed(u,d) (l is
regularization factor) - 0.5luTApu (1-l)(u-d)TAd(u-d)
- 0.5uTAu uTb const
- Where
- A Ap (1-l)Ad
- b (1-l) Ad d
- Solution for u can be directly obtained by
minimizing E(u) - u A-1 b
22Minimizing overall energy 1-D (l 0.5)
From noisy observation
membrane
No observation noise
From noisy observation
thin plate
No observation noise
23Minimizing overall energy 2-D (l 0.5)
Original
Noisy
Added 0 mean unit variance Gaussian noise to all
elements
24Minimizing overall energy 2-D (l 0.5)
Original
From Noisy
membrane
thin plate
25Minimizing overall energy 2-D (l 0.5)
Original
Noisy
Added 0 mean unit variance Gaussian noise to all
elements
26Minimizing overall energy 2-D (l 0.5)
Original
From Noisy
membrane
thin plate
27Minimizing energy by Relaxation
- Direct computation of A-1 is inefficient
- Large matrices for a 256x256 grid, A has size
65536 x 65536 - Sparseness of A not utilized only a small
fraction of elements have non zero values - Relaxation replaces inversion of A with many
local estimates - ui ai,i-1(bi Sai,juj)
- Updates can be done in parallel
- All local computations very simple
- Can be slow to converge
28Minimizing energy by relaxation 1-D (l 0.5)
Membrane
100 iters
500 iters
1000 iters
29Minimizing energy by relaxation 1-D (l 0.5)
Thin plate much slower to converge
1000 iters
10000 iters
100000 iters
30Minimizing energy by relaxation 2-D (l 0.5)
Original
1000 iters
Membrane
10000 iters
100000 iters
31Minimizing energy by relaxation 2-D (l 0.5)
Original
1000 iters
Thin plate much slower to converge
10000 iters
100000 iters
32Prior Models
- A Boltzmann distribution based on the stabilizing
function - P(u) K.exp(-Ep(u)/Tp)
- K is a normalizing constant, Tp is temperature
- Samples can be generated by repeated sampling of
local distributions P(uiu) - P(uiu) Ziexp(-ai,i-1(ui ui)/2Tp)
- ui ai,i-1(bi Sai,juj)
- This is the local estimate of ui in the
relaxation method - The variance of the local sample is Tp/ai,i
33Samples from prior distribution 1-D
Membrane stabilizer based Boltzmann
Thin plate stabilizer based Boltzmann
34Samples from prior distribution 2-D
Membrane prior
35Samples from prior distribution 2-D
Thin plate prior
36Sampling prior distributions
- Samples are fractal
- Tend to favour high frequencies
- Multi-grid sampling to get smoother samples
Initially generate sample for a very coarse grid
37Sampling prior distributions
- Samples are fractal
- Tend to favour high frequencies
- Multi-grid sampling to get smoother samples
Interpolate from coarse grid to finer grid, use
the interpolated values to initilize gibbs
sampling for a less coarse grid.
38Sampling prior distributions
- Samples are fractal
- Tend to favour high frequencies
- Multi-grid sampling to get smoother samples
Repeat process on a finer grid
39Sampling prior distributions
- Samples are fractal
- Tend to favour high frequencies
- Multi-grid sampling to get smoother samples
Final sample for entire grid
40Multigrid sampling of prior distribution
Membrane prior
Thin plate prior
41Sensor models
- Sparse data model
- Uses a simple energy function
- Assumption data points are all on grid
- Only use sparse data model used in examples
- Others such as force field models, optical flow,
image intensity etc. not simulated for this
presentation - Measurement variance assumed constant for all
data points
42Posterior model
- Simple Bayes rule
- P(ud) K.exp(-Ep(u)/Tp - Ed(u))
- Also a Gibbs distribution
- 1/Tp is the equivalent of the regularization
factor - Tp (1-l)/ l
- In following figures only thin plate prior
considered
43Sampling the posterior model (T1)
44MAP estimation from the Gibbs posterior
- Restate Gibbs posterior distribution as
- P(u) K.exp(-E(u)/T)
- E(u) is the total energy
- T again is temperature
- Not to be confused with regularization term Tp
- Reduce T with iterations
- iteration is defined as a complete sweep through
the data - Guaranteed convergence to MAP estimate as T goes
to 0, provided T does not go down faster than
1/log(iter), where iter is the iteration number - In practice, much faster cooling is possible
- For simple sparse data sensor model, MAP estimate
must be identical to that obtained using
relaxation or matrix inversion
45MAP estimates from posterior 1-D
Relaxation 100000 iters
Annealed Gibbs sampling 100000 iters
46MAP estimates from posterior 2-D
Actual MAP solution
Annealed Gibbs Sampling based MAP solution
47The contaminated Gaussian sensor model
- Also a sparse data sensor model
- Assumes measurement error has two modes
- 1. A high probability, low variance Gaussian
- 2. A low probability, high variance Gaussian
- P(di,j u) (1-e)N(ui,j ,s12) e
N(ui,j , s22) - 0.05 lt e lt 0.1 and s22 gtgt s12
- Posterior probability is also a mixture of
Gaussians - (1-e) P1(di,j u) e P2(di,j u)
48Samples from posterior using contaminated Gaussian
49MAP estimates of contaminated Gaussian 1-D
MAP estimate using single Gaussian sensor model
MAP estimate using contaminated Gaussian sensor
model
- For contaminated Gaussian there is no closed form
estimate MAP estimate - Gibbs sampling provides a MAP estimate
50MAP estimates of contaminated Gaussian 2-D
MAP estimate using a single Gaussian sensor model
MAP estimate using a contaminated Gaussian
sensor model
- For contaminated Gaussian MAP estimate obtained
using annealed Gibbs sampling
51Why Bayesian?
- Bayesian and regularization solutions identical
for some models - Bayesian approach provides several other
advantages - For complex sensor models, e.g. contaminated
Gaussian model - Provides uncertainty estimates
- Provides handle to estimate optimal
regularization factor - Provides formalism for methods such as Kalman
filtering - Etc.
52Why Bayesian? Uncertainty measurement
- Blue curve is MAP estimate
- Red curve shows 1 standard deviation on either
side
53Why Bayesian? Uncertainty measurement (T1)
- Figure is actually a sandwich
- Surface in middle is MAP estimate
- Boundaries indicate one standard deviation
54Why Bayesian? Uncertainty measurement
- Variance field
- For thin plate prior variance is constant except
at boundaries - Variance of posterior fluctuates from thin plate
variance only at measured data points - Other prior distributions would have prettier
variance and covariance fields
55Why Bayesian Optimize regularization factor
- E(ud) is a Gaussian
- Has two terms, 1/sqrt(2ps2) and
exp(-0.5(u-u)2/s2) - -Log (E(ud)) has two terms
- E1(d) 0.5log(2ps2)
- E2(d) 0.5(u-u)2/s2
- Both terms are functions of s2
- s2 is a function of regularization factor l
- As l increases E1(d) increases, but E2(d)
decreases - There is a specific value of l at which E1(d)
E2(d) is minimum - This is the maximum likelihood estimate of l
56Why Bayesian Optimize regularization factor
l 0.25
l 0.5
l 0.75
- Black curve is MAP estimate without measurement
noise
57Why Bayesian Optimize regularization factor
l 0.25
No measurement noise
l 0.5
l 0.75
58Why Bayesian Optimize regularization factor 1-D
E1E2
E1
E2
log(T)
- Optimal log(T) around 1.9
59Why Bayesian Optimize regularization factor 1-D
60Why Bayesian Optimize regulariztion factor 2-D
E1E2
E1
E2
T
- X axis is T (not log(T)). Optimal T about 2.7
- Estimation E1 and E2 requires computation of
determinant of A and Ap - Ap singular for thin plate prior
- Diagonal loading not sufficient.
- Compute determinant from eigenvalues to avoid
underflow/overflow
61Why Bayesian Optimize regularization factor 2-D
No observation noise
Maximum Likelihood estimate
62Why Bayesian Kalman filter
63Why Bayesian Kalman filter
64Why Bayesian Kalman filter