Monte Carlo methods Tom Griffiths UC Berkeley - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Monte Carlo methods Tom Griffiths UC Berkeley

Description:

Monte Carlo methods. Tom Griffiths. UC Berkeley. Two uses of Monte Carlo methods ... can be shown to be equivalent to Gibbs sampling (Griffiths & Kalish, 2005) ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 60

Provided by: josht83

Category:

more less

Transcript and Presenter's Notes

Title: Monte Carlo methods Tom Griffiths UC Berkeley

1
Monte Carlo methodsTom GriffithsUC Berkeley
2
(No Transcript)
3
Two uses of Monte Carlo methods

For solving problems of probabilistic inference
involved in developing computational models
As a source of hypotheses about how the mind
might solve problems of probabilistic inference

4
Answers and expectations

For a function f(x) and distribution P(x), the
expectation of f with respect to P is
The expectation is the average of f, when x is
drawn from the probability distribution P

5
Answers and expectations

Example 1 The average of spots on a die roll
x 1, , 6, f(x) x, P(x) is uniform
Example 2 The probability two observations
belong to the same mixture component
x is an assignment of observations to components,
f(x) 1 if observations belong to same
component and 0 otherwise, P(x) is posterior
over assignments

6
The Monte Carlo principle

The expectation of f with respect to P can be
approximated by
where the xi are sampled from P(x)
Example 1 the average of spots on a die roll

7
The Monte Carlo principle
The law of large numbers
Average number of spots
Number of rolls
8
More formally

?MC is consistent, (?MC - ?) ? 0 a.s. as n ? ?
?MC is unbiased, with E?MC ?
?MC is asymptotically normal, with

9
When simple Monte Carlo fails

Efficient algorithms for sampling only exist for
a relatively small number of distributions

10
Inverse cumulative distribution
1
0

(requires CDF be invertible)

11
Rejection sampling
p(x)
12
Rejection sampling from the posterior

Generate samples of all variables following the
generative process in the model
Reject all samples that do not match the observed
data

X3
X4
X1
X2
13
When simple Monte Carlo fails

Efficient algorithms for sampling only exist for
a relatively small number of distributions
Sampling from distributions over large discrete
state spaces is computationally expensive
mixture model with n observations and k
components, HMM with n observations and k states,
kn possibilities

14
When simple Monte Carlo fails

Efficient algorithms for sampling only exist for
a relatively small number of distributions
Sampling from distributions over large discrete
state spaces is computationally expensive
mixture model with n observations and k
components, HMM with n observations and k states,
kn possibilities
Sometimes we want to sample from distributions
for which we only know the probability of each
state up to a multiplicative constant

15
Why Bayesian inference is hard
Evaluating the posterior probability of a
hypothesis requires summing over all hypotheses
(statistical physics computing partition
function)
16
Modern Monte Carlo methods

Sampling schemes for distributions with large
state spaces known up to a multiplicative
constant
Two approaches
importance sampling
Markov chain Monte Carlo
(Major competitors variational inference,
sophisticated numerical quadrature methods)

17
Importance sampling

Basic idea generate from the wrong
distribution, assign weights to samples to
correct for this

18
Importance sampling
works when sampling from proposal is easy, target
is hard
19
An alternative scheme
works when p(x) is known up to a multiplicative
constant
20
More formally

?IS is consistent, (?IS - ?) ? 0 a.s. as n ? ?
?IS is asymptotically normal, with
?IS is biased, with

21
Optimal importance sampling

Asymptotic variance is
This is minimized by

22
Optimal importance sampling
23
(No Transcript)
24
Likelihood weighting

A particularly simple form of importance sampling
for posterior distributions
Use the prior as the proposal distribution
Weights

25
Likelihood weighting

Generate samples of all variables except observed
variables
Assign weights proportional to probability of
observed data given values in sample

X3
X4
X1
X2
(contrast to rejection sampling from the
posterior)
26
Importance sampling

A general scheme for sampling from complex
distributions that have simpler relatives
Simple methods for sampling from posterior
distributions in some cases (easy to sample from
prior, prior and posterior are close)
Can be more efficient than simple Monte Carlo
particularly for, e.g., tail probabilities
Also provides a solution to the question of how
people can update beliefs as data come in

27
Particle filtering
d1
d2
d3
d4
We want to generate samples from P(s4d1, , d4)
We can use likelihood weighting if we can sample
from P(s4s3) and P(s3d1, , d3)
28
Particle filtering
samples from P(s3d1,,d3)
29
Tweaks and variations

If we can enumerate values of s4, can sample from
No need to resample at every step, since we can
accumulate weights over multiple observations
resampling reduces diversity in samples
only necessary when variance of weights is large
Stratification and clever resampling schemes
reduce variance (Fearnhead, 2001)

30
The promise of particle filters

People need to be able to update probability
distributions over large hypothesis spaces as
more data become available
Particle filters provide a way to do this with
limited computing resources
maintain a fixed finite number of samples
Not just for dynamic models
can work with a fixed set of hypotheses, although
this requires some further tricks for maintaining
diversity

31
Markov chain Monte Carlo

Basic idea construct a Markov chain that will
converge to the target distribution, and draw
samples from that chain
Just uses something proportional to the target
distribution (good for Bayesian inference!)
Can work in state spaces of arbitrary (including
unbounded) size (good for nonparametric Bayes)

32
Markov chains
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))

Variables x(t1) independent of all previous
variables given immediate predecessor x(t)

33
An example card shuffling

Each state x(t) is a permutation of a deck of
cards (there are 52! permutations)
Transition matrix T indicates how likely one
permutation will become another
The transition probabilities are determined by
the shuffling procedure
riffle shuffle
overhand
one card

34
Convergence of Markov chains

Why do we shuffle cards?
Convergence to a uniform distribution takes only
7 riffle shuffles
Other Markov chains will also converge to a
stationary distribution, if certain simple
conditions are satisfied (called ergodicity)
e.g. every state can be reached in some number of
steps from every other state

35
Markov chain Monte Carlo
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))

States of chain are variables of interest
Transition matrix chosen to give target
distribution as stationary distribution

36
Metropolis-Hastings algorithm

Transitions have two parts
proposal distribution Q(x(t1)x(t))
acceptance take proposals with probability
A(x(t),x(t1)) min( 1,
)

P(x(t1)) Q(x(t)x(t1)) P(x(t)) Q(x(t1)x(t))
37
Metropolis-Hastings algorithm
p(x)
38
Metropolis-Hastings algorithm
p(x)
39
Metropolis-Hastings algorithm
p(x)
40
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 0.5
41
Metropolis-Hastings algorithm
p(x)
42
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 1
43
Metropolis-Hastings in a slide
44
Metropolis-Hastings algorithm

For right stationary distribution, we want
Sufficient condition is detailed balance

45
Metropolis-Hastings algorithm
This is symmetric in (x,y) and thus satisfies
detailed balance
46
Gibbs sampling

Particular choice of proposal distribution
For variables x x1, x2, , xn
Draw xi(t1) from P(xix-i)
x-i x1(t1), x2(t1),, xi-1(t1), xi1(t), ,
xn(t)
(this is called the full conditional distribution)

47
In a graphical model
X3
X4
X3
X4
X1
X2
X1
X2
X3
X4
X3
X4
X1
X2
X1
X2
Sample each variable conditioned on its Markov
blanket
48
Gibbs sampling
X1
X2
X1
X2
(MacKay, 2002)
49
Gibbs sampling in mixture models
?
sample assignments to components given data and
parameters
z
z
z
?
x
x
x
?
sample parameters given data and assignments to
components
z
z
z
?
x
x
x
50
MCMC vs. EM
EM converges to a single solution
MCMC converges to a distribution of solutions
51
Evaluating convergence

Basic formal result justifying MCMC
expectations over sequences of variables converge
to expectations over the stationary distribution
Under this result, just run MCMC as long as
possible to get as close as possible to the truth
In practice, a variety of heuristics are used to
assess convergence
e.g., start overdispersed chains, check ratio
of variance between and within chains (Gelman)

52
Evaluating convergence
53
Collapsed Gibbs sampler
?
z
z
z
?
x
x
x
sum out ?, ?
z
z
z
sample assignments given data and other
assignments
x
x
x
54
Collapsed Gibbs sampler
z
z
z
sample assignments given data and other
assignments
x
x
x
with K components and a Dirichlet(?/K) prior on ?
becomes a Dirichlet process mixture when K??
55
The magic of MCMC

Since we only ever need to evaluate the relative
probabilities of two states, we can have huge
state spaces (much of which we rarely reach)
In fact, our state spaces can be infinite
common with nonparametric Bayesian models
But the guarantees it provides are asymptotic
making algorithms that converge in practical
amounts of time is a significant challenge

56
MCMC and cognitive science

The main use of MCMC is for probabilistic
inference in complex models (for modelers and
learners)
The Metropolis-Hastings algorithm seems like a
good metaphor for aspects of development
A form of cultural evolution can be shown to be
equivalent to Gibbs sampling (Griffiths Kalish,
2005)
We can also use MCMC algorithms as the basis for
experiments with people
(see breakout session tomorrow!)

57
Samples from Subject 3(projected onto plane from
LDA)
58
Two uses of Monte Carlo methods
Three

For solving problems of probabilistic inference
involved in developing computational models
As a source of hypotheses about how the mind
might solve problems of probabilistic inference
As a way to explore peoples subjective
probability distributions

59
(No Transcript)

Write a Comment

User Comments (0)