Variational%20Bayes%20101 - PowerPoint PPT Presentation

About This Presentation

Title:

Variational%20Bayes%20101

Description:

Hansen & Rasmussen, Neural Comp (1994) Tipping 'Relevance vector machine' (1999) ... Hansen & Rasmussen, Neural Comp (1994) Approximations needed for posteriors ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 49

Provided by: tor121

Category:

more less

Transcript and Presenter's Notes

Title: Variational%20Bayes%20101

1
Variational Bayes 101
2
The Bayes scene

Exact averaging in discrete/small models (Bayes
networks)
Approximate averaging
- Monte Carlo methods
- Ensemble/mean field
- Variational Bayes methods

Variational-Bayes .org MLpedia Wikipedia

ISP Bayes
ICA mean field, Kalman, dynamical systems
NeuroImaging Optimal signal detector
Approximate inference
Machine learning methods

3
Bayes methodology
Minimal error rate obtained when detector is
based on posterior probability (Bayes decision
theory)
Likelihood may contain unknown parameters
4
Bayes methodology
Conventional approach is to use most probable
parameters
However averaged model is generalization
optimal (Hansen, 1999), i.e.
5
The hidden agenda of learning

Typically learning proceeds by generalization
from limited set of samplesbut
We would like to identify the model that
generated the data
.Choose the least complex model compatible with
data

That I figured out in 1386
6
Generalization!

Generalizability is defined as the expected
performance on a random new sample ... the mean
performance of a model on a fresh data set is
an unbiased estimate of generalization
Typical loss functions
lt-log p(x)gt , lt prediction errors gt
lt g(x)-g(x) 2 gt,
ltlog p(x,g)/p(x)p(g)gt, etc
Results can be presented as bias-variance
trade-off curves or learning curves

7
Generalization optimal predictive distribution

The game of guessing a pdf
Assume Random teacher drawn from P(?), random
data set, D, drawn from P(x?)
The prediction / generalization error is

Predictive distribution of model A
Test sample distribution
8
Generalization optimal predictive distribution

We define the generalization functional
(Hansen, NIPS 1999)
Minimized by the Bayesian averaging predictive
distribution

9
Bias-variance trade-off and averaging

Now averaging is good, can we average too much?
Define the family of tempered posterior
distributions
Case univariate normal dist. w. unknown mean
parameter
High temperature widened posterior average
Low temperature Narrow average

10
Bayes model selection, example

Let three models A,B,C be given
A) x is normal N(0,1)
B) x is normal N(0,s2), s2 is uniform U(0,8)
C) x is normal N(µ,s2), µ, s2 are uniform U(0,8)

11
Model A
The likelihood of N samples is given by
12
Model B
The likelihood of N samples is given by
13
Model C
The likelihood of N samples is given by
14
Model A maximum likelihood
The likelihood of N samples is given by
15
Model B
The likelihood of N samples is given by
16
Model C
The likelihood of N samples is given by
17

Bayesian model selection
C(green) is the correct model,
what if only A(red)B(blue) are known?

Bayesian model selection
A (red) is the correct model

19
Bayesian inference

Bayesian averaging
Caveats
Bayes can rarely be implemented exactly
Not optimal if the model family is incorrect
Bayes can not detect bias
However, still asymptotically optimal if
observation model is
correct prior is weak (Hansen, 1999).

20
Hierarchical Bayes models

Multi-level models in Bayesian averaging

C.P. Robert The Bayesian Choice - A
Decision-Theoretic Motivation.
Springer Texts in Statistics, Springer Verlag,
New
York (1994).
G. Golub, M. Heath and G. Wahba, Generalized
crossvalidation
as a method for choosing a good ridge parameter,
Technometrics 21 pp. 215223, (1979).
K. Friston A theory of Cortical Responses. Phil.
Trans. R. Soc. B 360815-836 (2005)

21
Hierarchical Bayes models
Posterior
learning hyper- parameters by adjusting prior
expectations -empirical Bayes -MacKay, (1992)
Prior
Evidence
Hansen et al. (Eusipco, 2006) Cf. Boltzmann
learning (Hinton et al. 1983)
Target at Maximal evidence
22
Hyperparameter dynamics
Gaussian prior w adaptive hyperparameter
?2A is a signal-to-noise measure ?ML is
maximum lik. opt.
Discontinuity Parameter is pruned at Low
signal-to-noise Hansen Rasmussen, Neural Comp
(1994) Tipping Relevance vector machine (1999)
23
Hyperparameter dynamics

Hyperparameters dynamically updated implies
pruning
Pruning decisions based on SNR
Mechanism for cognitive selection, attention?

24
Hansen Rasmussen, Neural Comp (1994)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Approximations needed for posteriors