Title: Bayes or Bust ?
1Bayes or Bust ?
Keith Arnaud
Thou shalt not answer questionnaires Or quizzes
upon world affairs, Nor with compliance Take any
test. Thou shalt not sit with statisticians nor
commit A social science.
-- W.H. Auden
(with thanks to Eric Feigelson and Bill Press)
2"As an undergraduate, I always found the subject
of statistics to be rather mysterious. I was
already familiar with the binomial, Poisson and
normal distributions. Most of this made sense,
but only seemed to relate to things like rolling
dice, flipping coins, shuffling cards and so on.
However, having aspirations of becoming a
scientist, what I really wanted to know was how
to analyse experimental data. Thus, I eagerly
looked forward to the lectures on statistics.
Sadly, they were a great disappointment.
Although many of the tests and procedures
expounded were intuitively reasonable, there was
something deeply unsatisfactory about the whole
affair there didn't seem to be any underlying
principles ! Hence, the course on 'probability
and statistics' had led to an unfortunate
dichotomy probability made sense, but was just a
game statistics was important, but it was a
bewildering collection of tests with little
obvious rhyme or reason."
- D.S. Sivia
(Data
Analysis A Bayesian Tutorial)
3Most astronomers and physicists don't
understand statistics
Suppose you derive a 90 confidence region on
some parameter of interest. What does this mean
? Does it tell you that there is a probability
of 0.9 that the true value of the parameter lies
in the range calculated ?
4 Deductive logic "Probability"
Cause
Effects
Possible causes
Inference "Statistics"
Effects
5James Bernoulli (1713) asked whether the rules of
probability (then being worked out to calculate
odds in gambling) could also be used in the
problem of inference or inductive logic.
Rev. Thomas Bayes answered this in the
affirmative in an essay published posthumously in
1763.
Pierre-Simon Laplace reformulated Bayes' theorem
in 1812. He expressed it with far greater clarity
than Bayes and used it to solve problems in
celestial mechanics and medical statistics.
Laplace combined the available astronomical data
to provide an estimate (and uncertainty) on the
mass of Saturn. He stated "...it is a bet of
11,000 to 1 that the error in this result is not
1/100th of its value". (The modern estimate
differs from Laplace's by 0.63)
6(No Transcript)
7(No Transcript)
8Bayes and Laplace thought of probability as
measuring a degree of belief in something. They
used the language of odds and betting in their
arguments.
However, with the push in the 19th and early 20th
centuries towards more rigorous arguments in
mathematics, this view was considered too
subjective and a new definition of probability
was developed. Probability was defined solely in
terms of the long-term frequency of events. Thus
to say that the probability of rolling a 4 with a
die is 1/6 is to mean that repeated rolling of
the die will give a 4 one sixth of the time.
This definition works fine for rolling dice and
tossing coins but where does it leave Laplace's
measurement of the mass of Saturn ? Laplace
derived a probability distribution for the mass
of Saturn. But there is only one planet - it
makes no sense to talk about its mass in terms of
frequencies. One could perhaps consider an
ensemble of Universes where the only thing that
changes is the mass of Saturn but this seems very
contrived.
9The solution arrived at was to relate the mass to
the data through some function. Since the data
has random noise this function can be considered
a random variable so we can talk about its
probability distribution The function is called
a statistic. This definition of statistics is
often referred to as frequentist or orthodox.
A frequentist statistic is best thought of as a
predictor of future events. If you talk to a good
orthodox statistician they will refer to what
they do as prediction, not as inference. The
central problem of orthodox statistics is how to
choose the function of the data so as to provide
a good predictor. This was the work of Fisher,
Neymann, Pearson, ... who gave us the statistics
we know and love.
10We can now return to the definition of a 90
confidence region. It does not express the
probability of the true value of the parameter
being within the confidence region because
orthodox statistics does not allow us to talk
about the probability of a parameter. The
parameter has a single true value (which we don't
know) and is not a random variable. Instead,
the confidence region is a predictor of future
events. It tells us that if we make many more
identical observations then 90 of the time the
confidence region will contain the true value of
the parameter.
Hypothesis testing must be thought of in a
similar fashion. For instance, we are not testing
whether the true value of some parameter is
non-zero but are instead making a statement about
future observations.
11In addition to the problems of interpretation,
orthodox statistics do present practical
difficulties. As we all know from personal
experience, there are a bewildering array of
different statistics to choose from.
These problems have led to a revival in recent
years of the Bayesian and Laplacian idea of
probability. Richard Cox showed in the 1940s that
if we assign numbers to express our relative
belief in various propositions then logical
consistency can only be achieved if these numbers
map to a set of real positive numbers that obey
the usual rules of probability theory.
Philosophers of science continue to argue about
the details of Bayesian inference (see eg. Bayes
or Bust ? by John Earman).
Practical Bayesian inference was pioneered by
scientists such as Harold Jeffreys and Ed Jaynes
but has now established bridgeheads inside
statistics departments. The Bayesians are still
in the minority but are gaining in numbers and
influence.
Interestingly, recent work in cognitive science
indicates that our minds may well operate by
Bayesian inference (see eg How the Mind Works
by Steven Pinker) a possibility advanced by
Jaynes over 40 years ago.
12Prior
Likelihood
p(HD,I) p(DH,I) x p(HI)
p(DI)
Posterior
The posterior probability distribution contains
all the information available after the data have
been acquired. As such it is the starting point
for any further inference.
13Is This Coin Fair ?
Suppose we have a coin and we wish to find out
whether or not it is fair. Let H be a number
between 0 and 1 such that if the coin has H0
then it will always produce Tails when tossed and
if H1 then Heads. We now toss the coin N times
and produce n Heads. By Bayes theorem the
posterior probability distribution for H is
proportional to
p(HI) Hn (1-H)(N-n)
where p(HI) is the prior probability
distribution for H.
14Tails
Heads
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42X- or gamma-ray spectroscopy
Suppose we have a model with M parameters mj.
Then we are interested in the probability
distribution
p(mj
Si, Bi, I)
Where Si and Bi are the source and background
observations (over N channels, i1,2,...,N) and I
is the prior information. However, we also have
an unknown background which we can model as a set
of N parameters bi. Our probability
distribution is thus the marginalization over
these N nuisance parameters so is the
N-dimensional integral over
p(mj, bi
Si, Bi, I)
Using Bayes theorem and the independence of the
model and background parameters we can rewrite
the integrand as proportional to
43 p(Si mj, bi , I) p(Bi bi
, I) p(mj I) p(bi I)
If the observed counts in each channel in the
source and background are independent and
distributed as Poisson variables then the first
two terms in this integrand can be expressed as
the product of Poisson probabilities.
Assuming that we have no prior information on the
background we can represent it as a uniform
distribution between 0 and some maximum. After
considerable algebraic manipulation (originally
due to Tom Loredo) we can show that the posterior
probability distribution for the model parameters
is proportional to
p(mj I) f(mj,
Si, Bi)
Where f is a complicated (but analytic) function
of the model parameters and the observables.
44Computational Issues
The end result of a Bayesian analysis will
probably be a multi-dimensional probability
density function. We will need to characterize
its shape and/or integrate (marginalize) over
some dimensions. Both these processes are very
computationally intensive.
The major practical issue in Bayesian inference
is how to perform the multi-dimensional integrals
required. There are a number of packages
available and they do work well for a limited
class of problems. However, a more general
approach is necessary.
The most promising method at present seems to be
Markov Chain Monte Carlo. A Markov Chain is a
sequence of random variables X0, X1, X2, such
that each state Xt1 is sampled from a
distribution p(Xt1Xt) which depends only on the
current state Xt. The fundamental theorem of
Markov Chains shows that for larger enough t the
Xt are drawn from a stationary distribution which
is independent of t and the starting point of the
chain.
45The MCMC method
The MCMC method then consists of setting up a
Markov Chain such that the stationary
distribution is the posterior probability
distribution of interest. The Markov Chain values
then provide a sampling of the probability
distribution which we can use for integration or
characterization.
Constructing such Markov Chains turns out to be
remarkably simple. The method was first developed
in 1953 by Metropolis et al. (in the context of
statistical physics) and generalized in 1970 by
Hastings.
Suppose that our posterior distribution is p(X).
We are at Xt in the chain. We sample a candidate
point Y from a proposal distribution q(Xt). We
now accept Y with a probability
min(1,p(Y)q(XY)/p(X)/q(YX)). If the candidate
is accepted we set Xt1Y otherwise Xt1Xt.
Remarkably, q can be any distribution and the
stationary distribution of the chain will still
be p. However, it should be chosen so that the
chain converges quickly (a short burn-in) and
mixes well ie it samples all parts of the
distribution p. There are a number of canonical
choices for q and this is an active area of
research in the statistical community.
46Why not to use the F-test
A common problem in high energy astrophysics is
to determine whether an emission (or absorption)
line is present in a spectrum or whether a source
is present in the field of view. A standard
(frequentist) technique has been to use the
F-test or, more generally, likelihood ratio test
(LRT).
In a recent preprint, Protassov et al. point out
that this is incorrect. The LRT is not valid if
the test hypothesis is at the boundary of the
parameter space eg if we are testing for a
non-zero value of the normalization of a spectral
line.
These authors go on to argue that the best
approach to this problem is to derive the
posterior distribution for the line normalization
and focus on estimating its strength rather than
making a statement on whether or not it exists.
However, if a test is required one valid method
is to use MCMC to generate N samples from the
posterior distribution under the null hypothesis
(ie no line). For each sample simulate data and
compute an LRT statistic. The fraction of these
fake datasets with a value of the statistic
greater than that observed for the actual data
gives the false positive probability.
47Protassov et al.
48Conclusions
Most astronomers are implicit Bayesians. However,
most statistical inference in the literature is
frequentist. This is problematic, especially when
the frequentist methods are used outside their
range of validity.
Bayesian methods are conceptually much simpler
and provide the sort of information that
astronomers expect.
Their drawbacks are practical. There is the
problem of assigning prior probability
distributions. They also typically require more
computation, particularly when marginalizing over
nuisance parameters.
Markov Chain Monte Carlo looks to be the most
promising numerical technique.
To get Bayesian methods widely used by
astronomers will require the availability of
fast, reliable, user-friendly software.
49References
Sivia, D.S., 1996, "Data Analysis A Bayesian
Tutorial", Clarendon Press
O'Hagan, A., 1994, "Kendall's Advanced Theory of
Statistics Volume 2B
Bayesian Inference", Wiley
Gelman, A. et al, 1995, "Bayesian Data Analysis",
Chapman Hall
Gilks, W.R. et al, 1996, "Markov Chain Monte
Carlo in Practice", Chapman
Hall
Protassov et al., 2002, astro-ph/0201547
http//astrosun.tn.cornell.edu/staff/loredo/bayes