Title: Monte Carlo methods Tom Griffiths UC Berkeley
1Monte Carlo methodsTom GriffithsUC Berkeley
2(No Transcript)
3Two uses of Monte Carlo methods
- For solving problems of probabilistic inference
involved in developing computational models - As a source of hypotheses about how the mind
might solve problems of probabilistic inference
4Answers and expectations
- For a function f(x) and distribution P(x), the
expectation of f with respect to P is - The expectation is the average of f, when x is
drawn from the probability distribution P -
5Answers and expectations
- Example 1 The average of spots on a die roll
- x 1, , 6, f(x) x, P(x) is uniform
- Example 2 The probability two observations
belong to the same mixture component - x is an assignment of observations to components,
f(x) 1 if observations belong to same
component and 0 otherwise, P(x) is posterior
over assignments
6The Monte Carlo principle
- The expectation of f with respect to P can be
approximated by - where the xi are sampled from P(x)
- Example 1 the average of spots on a die roll
-
7The Monte Carlo principle
The law of large numbers
Average number of spots
Number of rolls
8More formally
- ?MC is consistent, (?MC - ?) ? 0 a.s. as n ? ?
- ?MC is unbiased, with E?MC ?
- ?MC is asymptotically normal, with
9When simple Monte Carlo fails
- Efficient algorithms for sampling only exist for
a relatively small number of distributions
10Inverse cumulative distribution
1
0
- (requires CDF be invertible)
11Rejection sampling
p(x)
12Rejection sampling from the posterior
- Generate samples of all variables following the
generative process in the model - Reject all samples that do not match the observed
data
X3
X4
X1
X2
13When simple Monte Carlo fails
- Efficient algorithms for sampling only exist for
a relatively small number of distributions - Sampling from distributions over large discrete
state spaces is computationally expensive - mixture model with n observations and k
components, HMM with n observations and k states,
kn possibilities
14When simple Monte Carlo fails
- Efficient algorithms for sampling only exist for
a relatively small number of distributions - Sampling from distributions over large discrete
state spaces is computationally expensive - mixture model with n observations and k
components, HMM with n observations and k states,
kn possibilities - Sometimes we want to sample from distributions
for which we only know the probability of each
state up to a multiplicative constant
15Why Bayesian inference is hard
Evaluating the posterior probability of a
hypothesis requires summing over all hypotheses
(statistical physics computing partition
function)
16Modern Monte Carlo methods
- Sampling schemes for distributions with large
state spaces known up to a multiplicative
constant - Two approaches
- importance sampling
- Markov chain Monte Carlo
- (Major competitors variational inference,
sophisticated numerical quadrature methods)
17Importance sampling
- Basic idea generate from the wrong
distribution, assign weights to samples to
correct for this
18Importance sampling
works when sampling from proposal is easy, target
is hard
19An alternative scheme
works when p(x) is known up to a multiplicative
constant
20More formally
- ?IS is consistent, (?IS - ?) ? 0 a.s. as n ? ?
- ?IS is asymptotically normal, with
- ?IS is biased, with
21Optimal importance sampling
- Asymptotic variance is
- This is minimized by
22Optimal importance sampling
23(No Transcript)
24Likelihood weighting
- A particularly simple form of importance sampling
for posterior distributions - Use the prior as the proposal distribution
- Weights
25Likelihood weighting
- Generate samples of all variables except observed
variables - Assign weights proportional to probability of
observed data given values in sample
X3
X4
X1
X2
(contrast to rejection sampling from the
posterior)
26Importance sampling
- A general scheme for sampling from complex
distributions that have simpler relatives - Simple methods for sampling from posterior
distributions in some cases (easy to sample from
prior, prior and posterior are close) - Can be more efficient than simple Monte Carlo
- particularly for, e.g., tail probabilities
- Also provides a solution to the question of how
people can update beliefs as data come in
27Particle filtering
d1
d2
d3
d4
We want to generate samples from P(s4d1, , d4)
We can use likelihood weighting if we can sample
from P(s4s3) and P(s3d1, , d3)
28Particle filtering
samples from P(s3d1,,d3)
29Tweaks and variations
- If we can enumerate values of s4, can sample from
- No need to resample at every step, since we can
accumulate weights over multiple observations - resampling reduces diversity in samples
- only necessary when variance of weights is large
- Stratification and clever resampling schemes
reduce variance (Fearnhead, 2001)
30The promise of particle filters
- People need to be able to update probability
distributions over large hypothesis spaces as
more data become available - Particle filters provide a way to do this with
limited computing resources - maintain a fixed finite number of samples
- Not just for dynamic models
- can work with a fixed set of hypotheses, although
this requires some further tricks for maintaining
diversity
31Markov chain Monte Carlo
- Basic idea construct a Markov chain that will
converge to the target distribution, and draw
samples from that chain - Just uses something proportional to the target
distribution (good for Bayesian inference!) - Can work in state spaces of arbitrary (including
unbounded) size (good for nonparametric Bayes)
32Markov chains
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
- Variables x(t1) independent of all previous
variables given immediate predecessor x(t)
33An example card shuffling
- Each state x(t) is a permutation of a deck of
cards (there are 52! permutations) - Transition matrix T indicates how likely one
permutation will become another - The transition probabilities are determined by
the shuffling procedure - riffle shuffle
- overhand
- one card
34Convergence of Markov chains
- Why do we shuffle cards?
- Convergence to a uniform distribution takes only
7 riffle shuffles - Other Markov chains will also converge to a
stationary distribution, if certain simple
conditions are satisfied (called ergodicity) - e.g. every state can be reached in some number of
steps from every other state
35Markov chain Monte Carlo
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
- States of chain are variables of interest
- Transition matrix chosen to give target
distribution as stationary distribution
36Metropolis-Hastings algorithm
- Transitions have two parts
- proposal distribution Q(x(t1)x(t))
- acceptance take proposals with probability
- A(x(t),x(t1)) min( 1,
)
P(x(t1)) Q(x(t)x(t1)) P(x(t)) Q(x(t1)x(t))
37Metropolis-Hastings algorithm
p(x)
38Metropolis-Hastings algorithm
p(x)
39Metropolis-Hastings algorithm
p(x)
40Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 0.5
41Metropolis-Hastings algorithm
p(x)
42Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 1
43Metropolis-Hastings in a slide
44Metropolis-Hastings algorithm
- For right stationary distribution, we want
- Sufficient condition is detailed balance
45Metropolis-Hastings algorithm
This is symmetric in (x,y) and thus satisfies
detailed balance
46Gibbs sampling
- Particular choice of proposal distribution
- For variables x x1, x2, , xn
- Draw xi(t1) from P(xix-i)
- x-i x1(t1), x2(t1),, xi-1(t1), xi1(t), ,
xn(t) - (this is called the full conditional distribution)
47In a graphical model
X3
X4
X3
X4
X1
X2
X1
X2
X3
X4
X3
X4
X1
X2
X1
X2
Sample each variable conditioned on its Markov
blanket
48Gibbs sampling
X1
X2
X1
X2
(MacKay, 2002)
49Gibbs sampling in mixture models
?
sample assignments to components given data and
parameters
z
z
z
?
x
x
x
?
sample parameters given data and assignments to
components
z
z
z
?
x
x
x
50MCMC vs. EM
EM converges to a single solution
MCMC converges to a distribution of solutions
51Evaluating convergence
- Basic formal result justifying MCMC
- expectations over sequences of variables converge
to expectations over the stationary distribution - Under this result, just run MCMC as long as
possible to get as close as possible to the truth - In practice, a variety of heuristics are used to
assess convergence - e.g., start overdispersed chains, check ratio
of variance between and within chains (Gelman)
52Evaluating convergence
53Collapsed Gibbs sampler
?
z
z
z
?
x
x
x
sum out ?, ?
z
z
z
sample assignments given data and other
assignments
x
x
x
54Collapsed Gibbs sampler
z
z
z
sample assignments given data and other
assignments
x
x
x
with K components and a Dirichlet(?/K) prior on ?
becomes a Dirichlet process mixture when K??
55The magic of MCMC
- Since we only ever need to evaluate the relative
probabilities of two states, we can have huge
state spaces (much of which we rarely reach) - In fact, our state spaces can be infinite
- common with nonparametric Bayesian models
- But the guarantees it provides are asymptotic
- making algorithms that converge in practical
amounts of time is a significant challenge
56MCMC and cognitive science
- The main use of MCMC is for probabilistic
inference in complex models (for modelers and
learners) - The Metropolis-Hastings algorithm seems like a
good metaphor for aspects of development - A form of cultural evolution can be shown to be
equivalent to Gibbs sampling (Griffiths Kalish,
2005) - We can also use MCMC algorithms as the basis for
experiments with people - (see breakout session tomorrow!)
57Samples from Subject 3(projected onto plane from
LDA)
58Two uses of Monte Carlo methods
Three
- For solving problems of probabilistic inference
involved in developing computational models - As a source of hypotheses about how the mind
might solve problems of probabilistic inference - As a way to explore peoples subjective
probability distributions
59(No Transcript)