Title: Review of Bayesian Analysis
1Review of Bayesian Analysis and Modeling
James B. Elsner Department of Geography,
Florida State University
2- Bayesian analysis and modeling using MCMC is good
for you because - It has a solid decision-theoretical framework.
Output from a probability model can be linked
naturally to a utility model. - It is intuitive. Combines the prior
distribution (prior beliefs and/or experience)
with the likelihood (experiment) to obtain the
posterior distribution (accumulated information).
In this context, knowledge is equivalent to a
distribution, and knowledge precision can be
quantified by the precision parameter. - It is exact. Asymptotic approximations are not
used. The plug-in principle is avoided. - Computations are tractable for practically all
models. - Focus shifts from model estimation to model
correctness.
3http//garnet.fsu.edu/jelsner/www/research.html
Anticipating Florida Hurricanes, November 2004.
4Frequentist
Bayesian
Definition of probability Point
estimate Confidence intervals of
parameters Confidence intervals of
non-parameters Model selection Difficulties
Long-run expected frequency in repeated (actual
or hypothetical) experiments. Maximum likelihood
estimate. Based on the likelihood ratio test,
which is based on the expected probability
distribution of the maximum likelihood over many
experiments. Based on likelihood profile, or by
resampling from the sampling distribution of the
parameter. Discard terms that are not
significantly different from a null model
(nested) at a previously set confidence
level. CI are confusing (range which will
contain true value in proportion a of repeated
experiments) rejection of model terms for
non-significance.
Relative degree of belief in the state of the
world. Mean, mode, or median of the posterior
probability distribution. Credible intervals
based on the posterior probability
distribution. Calculated directly from the
distribution of parameters. Retain terms in
models on the argument that processes are not
absent simply because they are not statistically
significant. Subjectivity need to specify
priors.
5Why use Bayesian analysis and models? When you
have good (informative) prior information that
you want to use (from previous experiments, or
reliable information from literature).
Frequentist meta-analysis can also be used in
this case, but the Bayesian approach is
natural. When you want to specify the
distribution of non-parameters (future values),
Bayesian models work well. In particular, if you
are going to try to make a practical decision
based on your model (using decision theory), for
example picking a management regime that
minimizes expected risks or maximizes expected
returns, having the posterior distribution of all
the parameters makes the decision calculations
easier. Complicated models such as hierarchical
models or models with missing or unobserved data
can be particularly conveniently fit using
Bayesian methods. Frequentists call these
random-effect or mixed models and have their own
estimation procedures.
6How Gibbs Sampling Works
The contours in the plot represent the joint
distribution of q and the labels (0), (1) etc.,
denote the simulated values. Note that one
iteration of the algorithm is complete after both
components are revised. Also notice that each
component is revised along the direction of the
coordinate axes. This feature can be a source of
problems if the two components are correlated
because then the contours get compressed and
movements along the coordinate axes tend to
produce small moves.
q2
(3)
Gibbs sampling algorithm in two dimensions
starting from an initial point and then
completing three iterations.
(2)
(1)
(0)
q1
7- After the model has converged, samples from the
conditional distributions are used to summarize
the posterior distribution of parameters of
interest. - Convergence refers to the idea that eventually
the Gibbs samples will reach a stationary
distribution. - The question is
- 1) How many samples are needed before
convergence? The samples discarded before
convergence are called burn-in. - 2) After convergence, how many samples are
needed to summarize the posterior distribution? - Answers vary from model to model and diagnostics
are used to examine the samples for convergence. - Possible convergence problems.
- The assumed model may not be realistic from a
substantive point of view or may not fit. - Errors in calculation or programming. Often,
simple syntax mistakes may be responsible
however, it is possible that the algorithm may
not converge to a proper distribution. - Slow convergence this is the problem we are
most likely to run into. The simulation can
remain for many iterations in a region heavily
influenced by the starting distribution. If the
iterations are used to summarize the target
distribution, they can yield falsely precise
estimates.
8- Diagnostics for examining convergence.
- One intuitive and easily implemented diagnostic
tool is a trace plot (or history plot) which
plots the parameter value as a function of sample
number. If the model has converged, the trace
plot will move up and down around the mode of the
distribution. A clear sign of non-convergence
occurs when we observe some trending in the trace
plot. In WinBugs, you can setup trace plots to
monitor parameters while the program runs.
Trace plots
convergence
non-convergence
9An autocorrelation plot will reveal whether or
not there is correlation between successive
samples. The presence of correlation indicates
that the samples are not effective in moving
through the entire posterior distribution.
WinBUGS plots the autocorrelation out to 50 lags.
No large autocorrelation
Autocorrelation
10- If the model has converged, additional samples
from a parameters posterior distribution should
not influence the calculation of the mean.
Running means of the samples will reveal if the
posterior mean has settled to a particular value. - A lumpy posterior may indicate non-convergence.
It may be necessary to let the sampler generate
additional samples.
Geweke Time-Series Diagnostic If a model has
converged, then the series of samples from the
first half of the chain will have the same
time-series properties of the series of samples
from the second half of the chain.
11Gelman-Rubin Diagnostic Another way to identify
convergence is to simulate multiple sequences
from different starting points. The behavior of
the sequence of chains should be the same. That
is, the variance within the chains should be the
same as the variance across the chains.
convergence
non-convergence
12- If more than 1 chain is initialized, the
Gelman-Rubin statistic is reported in WinBugs. - The statistic is based on the following
procedure - 1) estimate your model with a variety of
different initial values and iterate for an
n-iteration burn-in and an n-iteration monitored
period. - 2) take the n-monitored draws of m parameters
and calculate the following statistics
Before convergence, W underestimates total
posterior variance in?? because it has not fully
explored the target distribution. V(?) on the
other hand overestimates variance in ? because
the starting points are over-dispersed relative
to the target.
13Once convergence is reached, W and V(?) should be
almost equivalent because variation within the
chains and variations between the chains should
coincide, so R should approximately equal one.
The normalized width of the central 80 interval
of the pooled runs is green The normalized
average width of the 80 intervals within the
individual runs is blue R is red. R would
generally be expected to be greater than 1 if the
starting values are suitably over-dispersed.
Brooks and Gelman (1998) emphasize that one
should be concerned both with convergence of R to
1, and with convergence of both the pooled and
within interval widths to stability.
14Speeding up the sampling Convergence sometimes
requires generating 10s or 100s of thousands of
samples. If this is the case, it is important to
speed up the sampling. Standardize your input
variables. Subtract the mean and divide by the
standard deviation. This decreases the
correlation between model variables which can
decrease the autocorrelation between samples. Use
WinBugs over-relaxation algorithm. This
generates multiple samples at each iteration and
then selects one that is negatively correlated
with the current value. The time per iteration
will increase, but the within-chain correlations
should be reduced and hence fewer iterations may
be necessary. However, this method is not always
effective and should be used with caution. The
autocorrelation function should be used to check
whether the mixing of the chain improves. Pick
good initial values. If your initial values are
near their posterior modes, then convergence
should occur relatively quickly, especially if
there is not a big problem with
autocorrelation. Just wait. With models involving
many parameters it may take 100s of thousands of
samples. Using the thin option, you do not
need to save all the sample values for all the
parameters.