Title: Statistical inference for astrophysics
1Statistical inference for astrophysics
- A short course for astronomers
- Cardiff University 16-17 Feb 2005Graham Woan,
University of Glasgow
2Lecture Plan
- Why statistical astronomy?
- What is probability?
- Estimating parameters values
- the Bayesian way
- the Frequentist way
- Testing hypotheses
- the Bayesian way
- the Frequentist way
- Assigning Bayesian probabilities
- Monte-Carlo methods
Lectures 1 2
Lectures 3 4
3Why statistical astronomy?
Generally, statistics has got a bad reputation
There are three types of lies lies, damned lies
and statistics
Benjamin Disraeli
Mark Twain
Often for good reason
Jun 3rd 2004 two researchers at the University
of Girona in Spain, have found that 38 of a
sample of papers in Nature contained one or more
statistical errors
The Economist
4Why statistical astronomy?
Data analysis methods are often regarded as
simple recipes
http//www.numerical-recipes.com/
5Why statistical astronomy?
Data analysis methods are often regarded as
simple recipes
but in analysis as in life, sometimes the
recipes dont work as you expect.
- Low number counts
- Distant sources
- Correlated residuals
- Incorrect assumptions
- Systematic errors and/or
- misleading results
6Why statistical astronomy?
and the tools can be clunky
" The trouble is that what we statisticians
call modern statistics was developed under strong
pressure on the part of biologists. As a result,
there is practically nothing done by us which is
directly applicable to problems of astronomy."
Jerzy Neyman, founder of frequentist hypothesis
testing.
7Why statistical astronomy?
For example, we can observe only the one Universe
(From Bennett et al 2003)
8The Astrophysicists Shopping List
We want tools capable of
- dealing with very faint sources
- handling very large data sets
- correcting for selection effects
- diagnosing systematic errors
- avoiding unnecessary assumptions
- estimating parameters and testing models
9Why statistical astronomy?
Key question
How do we infer properties of the Universe from
incomplete and imprecise astronomical data?
Our goal
To make the best inference, based on our observed
data and any prior knowledge, reserving the right
to revise our position if new information comes
to light.
Lets come to this problem afresh with an
astrophyicists eye, and bypass some of the
jargon of orthodox statistics, going right back
to plain probability
10Right-thinking gentlemen 1
A decision was wise, even though it led to
disastrous consequences, if with the evidence at
hand indicated it was the best one to make and a
decision was foolish, even though it led to the
happiest possible consequences, if it was
unreasonable to expect those consequencesWe
should do the best with what we have, not what we
wished we had.
Herodotus, c.500 BC
11Right-thinking gentlemen 2
Probability theory is nothing but common sense
reduced to calculation
Pierre-Simon Laplace (1749 1827)
12Right-thinking gentlemen 3
Occams Razor
Frustra fit per plura, quod fieri potest per
pauciora. It is vain to do with more what can
be done with less.
Everything else being equal, we favour models
which are simple.
William of Occam (1288 1348 AD)
13The meaning of probability
- Definitions, algebra and useful distributions
14But what is probability?
- There are three popular interpretations of the
word, each with an interesting history - Probability as a measure of our degree of belief
in a statement - Probability as a measure of limiting relative
frequency of outcome of a set of identical
experiments - Probability as the fraction of favourable
(equally likely) possibilities - We will call these the Bayesian, Frequentist and
Combinatorial interpretations. - Note there are signs of trouble here
- How do you quantify degree of belief?
- How do you define relative frequency without
using ideas of probability? - What does equally likely mean?
- Thankfully, at some level, all three
interpretations agree on the algebra of
probability, which we will present in Bayesian
terms
15Algebra of (Bayesian) probability
- Probability of a statement, such as y 3,
the mass of a neutron star is 1.4 solar masses
or it will rain tomorrow is a number between 0
and 1, such that - For some statement X, where the bar denotes
the negation of the statement -- The Sum Rule - If there are two statements X and Y, then joint
probabilitywhere the vertical line denotes the
conditional statement X given Y is true The
Product Rule
16Algebra of (Bayesian) probability
- From these we can deduce thatwhere denotes
or and I represents common background
information -- The Extended Sum Rule - and, because p(X,Y)p(Y,X), we getwhich is
called Bayes Theorem - Note that these results are also applicablein
Frequentist probability theory, with a suitable
change in meaning of p.
17Algebra of (Bayesian) probability
Bayes theorem is the appropriate rule for
updating our degree of belief when we have new
data
Prior
Likelihood
Posterior
Evidence
note that the word evidence is sometimes used
for something else (the log odds). We will
stick to the p(dI) definition here.
18Algebra of (Bayesian) probability
- We can also deduce the marginal probabilities.
If X and Y are propositions that can take on
values drawn from
and then
this gives use the probability of X when
we dont care about Y. In these circumstances, Y
is known as a nuisance parameter. - All these relationships can be smoothly extended
from discrete probabilities to probability
densities, e.g.where p(y)dy is the
probability that y lies in the range y to ydy.
1
19Algebra of (Bayesian) probability
These equations are the key to Bayesian
Inference the methodology upon which (much)
astronomical data analysis is now founded. Clear
introduction by Devinder Sivia (OUP). (fuller
bibliography tomorrow)
See also the free book by Praesenjit Saha (QMW,
London). http//ankh-morpork.maths.qmw.ac.uk/7Esa
ha/book
20Example
- A gravitational wave detector may have seen a
type II supernova as a burst of gravitational
radiation. Burst-like signals can also come from
instrumental glitches, and only 1 in 10,000
bursts is really a supernova, so the data are
checked for glitches by examining veto channels.
The test is expected to confirm the burst is
astronomical in 95 of cases in which it truly
is, and in 1 when it truly is not.The burst
passes the veto test!! What is the probability
we have seen a supernova?Answer
DenoteLet I represent the information that
the burst seen is typical of those used to deduce
the information in the question. Then we are
told that
Prior probabilities for S and G
Likelihoods for S and G
21Example cont
- But we want to know the probability that its a
supernova, given it passed the veto testBy
Bayes, theoremand we are directly told
everything on the rhs except , the
probability that any burst candidate would pass
the veto test. If the burst is either a supernova
or a hardware glitch then we can marginalise over
these alternativesso
0.9999 0.01 0.0001 0.95
22Example cont
- So, despite passing the test there is only a 1
probability that the burst is a supernova!Veto
tests have to be blisteringly good if supernovae
are rare. - Why? Because most bursts that pass the test are
just instrumental glitches it really is just
common sense reduced to calculation. - Note however that by passing the test, the
probability that this burst is from a supernova
has increased by a factor of 100 (from 0.0001 to
0.01). - Moral
Probability that a supernova burst gets through
the veto (0.95)
Probability that its a supernova burst if it
gets through the veto(0.01)
23 the basis of frequentist probability
24Basis of frequentist probability
- Bayesian probability theory is simultaneously a
very old and a very young field - Old original interpretation of Bernoulli,
Bayes, Laplace - Young state of the art in (astronomical) data
analysis - But BPT was rejected for several centuries
because probability as degree of belief was seen
as too subjective -
Frequentist approach
25Basis of frequentist probability
Probability long run relative frequency of
an event it appears at first that this can be
measured objectively e.g. rolling a die. What
is ? If die is fair we
expect These probabilities are fixed (but
unknown) numbers. Can imagine rolling die M
times. Number rolled is a random variable
different outcome each time.
26Basis of frequentist probability
- We define If
die is fair - But objectivity is an illusion
- assumes each outcome equally likely
- (i.e. equally probable)
- Also assumes infinite series of identical trials
-
- What can we say about the fairness of the die
after - (say) 5 rolls, or 10, or 100 ?
-
27Basis of frequentist probability
- In the frequentist approach, a lot of
mathematical machinery is specifically defined to
let us address frequency questions - We take the data as a random sample of size M ,
drawn from an assumed underlying pdf - Sampling distribution, derived from the
underlying pdf and M - Define an estimator function of the sample
data that is used to estimate properties of the
pdf - But how do we decide what makes an
acceptable estimator?
28Basis of frequentist probability
- Example measuring a galaxy redshift
- Let the true redshift z0 -- a fixed but
unknown parameter. We use two telescopes to
estimate z0 and compute sampling distributions
for and modelling errors - Small telescope
- low dispersion spectrometer
- Unbiased
- Repeat observation a
- large number of times
- ? average estimate is
- equal to z0
- BUT
is large due to the low dispersion.
29Basis of frequentist probability
- Example measuring a galaxy redshift
- Let the true redshift z0 -- a fixed but
unknown parameter. We use two telescopes to
estimate z0 and compute sampling distributions
for and modelling errors - Large telescope
- high dispersion spectrometer
- but faulty astronomer!
- (e.g. wrong calibration)
- Biased
- BUT is small. Which is the better
estimator?
30Basis of frequentist probability
What about the sample mean?
- Let be a random sample from pdf
with mean - and variance . Then
- sample
mean - Can show that -- an unbiased estimator
- But bias is defined formally in terms of an
infinite set of randomly chosen samples, each of
size M. - What can we say with a finite number of samples,
each of finite size? Before that
31Some important probability distributions quick
definitions
32Some important pdfs discrete case
- Poisson pdf
- e.g. number of photons / second counted by a
CCD, - number of galaxies / degree2 counted by a
survey
n number of detections Poisson pdf assumes
detections are independent, and there is a
constant rate Can show that
33Some important pdfs discrete case
- Poisson pdf
- e.g. number of photons / second counted by a
CCD, - number of galaxies / degree2 counted by a
survey
34Some important pdfs discrete case
- Binomial pdf
- number of successes from N observations, for
two mutually - exclusive outcomes (Heads and Tails)
- e.g. number of binary stars, Seyfert galaxies,
supernovae
r number of successes probability
of success for single observation
Can show that
35Some important pdfs continuous case
36Some important pdfs continuous case
- Central, or Normal pdf
- (also known as Gaussian )
p(x)
x
37Cumulative distribution function (CDF)
Prob( x lt a )
38Measures and moments of a pdf
The nth moment of a pdf is defined as
Discrete case
Continuous case
39Measures and moments of a pdf
The 1st moment is called the mean or
expectation value
Discrete case
Continuous case
40Measures and moments of a pdf
The 2nd moment is called the mean square
Discrete case
Continuous case
41Measures and moments of a pdf
The variance is defined as and is
often written . is called the
standard deviation
Discrete case
Continuous case
In general
42Measures and moments of a pdf
pdf mean variance
Poisson Binomial Uniform Normal
discrete
continuous
43Measures and moments of a pdf
The Median divides the CDF into two equal
halves
Prob( x lt xmed ) Prob( x gt xmed ) 0.5
44Measures and moments of a pdf
The Mode is the value of x for which the
pdf is a maximum
p(x)
x
For a Normal pdf, mean median mode
45Parameter estimation
46Bayesian parameter estimation
In the Bayesian approach, we can test our model,
in the light of our data (i.e. rolling the die)
and see how our degree of belief in its
fairness evolves, for any sample size,
considering only the data that we did actually
observe.
Prior
Likelihood
Posterior
Influence of our observations
What we knew before
What we know now
47Bayesian parameter estimation
Astronomical example 1 Probability that a
galaxy is a Seyfert 1
We want to know the fraction of Seyfert galaxies
that are type 1. How large a sample do we need
to reliably measure this? Model as a binomial
pdf global fraction of Seyfert
1s Suppose we sample N Seyferts, and observe
r Seyfert 1s
probability of obtaining observed data, given
model the likelihood of
48Bayesian parameter estimation
- But what do we choose as our prior? This has
been the source of much argument between
Bayesians and frequentists, though it is often
not that important. - We can sidestep this for a moment, realising that
if our data are good enough, the choice of prior
shouldnt matter!
Prior
Likelihood
Posterior
Dominates
49Bayesian parameter estimation
Consider a simulation of this problem using two
different priors
Flat prior all values of ? equally probable
p(? I )
Normal prior peaked at ? 0.5
?
50After observing 0 galaxies
p(? data, I )
?
51After observing 1 galaxy Seyfert 1
p(? data, I )
?
52After observing 2 galaxies S1 S1
p(? data, I )
?
53After observing 3 galaxies S1 S1 S2
p(? data, I )
?
54After observing 4 galaxies S1 S1 S2 S2
p(? data, I )
?
55After observing 5 galaxies S1 S1 S2 S2
S2
p(? data, I )
?
56After observing 10 galaxies 5 S1 5 S2
p(? data, I )
?
57After observing 20 galaxies 7 S1 13 S2
p(? data, I )
?
58After observing 50 galaxies 17 S1 33 S2
p(? data, I )
?
59After observing 100 galaxies 32 S1 68 S2
p(? data, I )
?
60After observing 200 galaxies 59 S1 141 S2
p(? data, I )
?
61After observing 500 galaxies 126 S1 374 S2
p(? data, I )
?
62After observing 1000 galaxies 232 S1 768 S2
p(? data, I )
?
63Bayesian parameter estimation
What do we learn from all this?
- As our data improve (e.g. our sample increases),
the posterior pdf narrows and becomes less
sensitive to our choice of prior. - The posterior conveys our (evolving) degree of
belief in different values of ? , in the light of
our data - If we want to express our result as a single
number we could perhaps adopt the mean, median,
or mode - We can use the variance of the posterior pdf
to assign an - uncertainty for ?
- It is very straightforward to define
confidence intervals
64Bayesian parameter estimation
Bayesian confidence intervals
?
We are 95 sure that lies between and Note
the confidence interval is not unique, but we can
define the shortest C.I.
?1
?2
95 of area under pdf
p(? data, I )
?2
?1
?
65Bayesian parameter estimation
Astronomical example 2 Flux density of a GRB
Take Gamma Ray Bursts to be equally luminous
events, distributed homogeneously in the
Universe. We see three gamma ray photons from a
GRB in an interval of 1 s. What is the flux of
the source, F?
- The seat-of-the-pants answer is F3 photons/s,
with an uncertainty of about , but we can do
better than that by including our prior
information on luminosity and homogeneity. Call
this background information I
Homogeneity implies that the probability the
source is in any particular volume of space is
proportional to the volume, so the prior
probability that the source is in a thin shell of
radius r is
66Bayesian parameter estimation
- But the sources have a fixed luminosity, L, so r
and F are directly related by - The prior on F is thereforeInterpretation
low flux sources are intrinsically more probable,
as there is more space for them to sit in. - We now apply Bayes theorem to determine the
posterior for F after seeing n photons
hence
p(FI)
F
67Bayesian parameter estimation
- The Likelihood for F comes from the Poisson
nature of photonsso finally, For n3 we
get thiswith the most probable value ofF
equalling 0.5 photons/sec. - Clearly it is more probable this is a distant
sourcefrom which we have seen an unusually high
number of photons than it is an unusually nearby
source from which we have seen an expected number
of photons. (The most probable value of F is
n-5/2, approaching n for ngtgt1)
p(Fn3,I)
Fn/T
Fmost prob
F
68Parameter estimation
69Frequentist parameter estimation
- Recall that in frequentist (orthodox) statistics,
probability is limiting relative frequency of
outcome, so only random variables can
have frequentist probabilitiesas only these
show variation with repeated measurement. So we
cant talk about the probability of a model
parameter, or of a hypothesis. E.g., a
measurement of a mass is a RV, but the mass
itself is not. - So no orthodox probabilistic statement can be
interpreted as directly referring to the
parameter in question! For example, orthodox
confidence intervals do not indicate the range in
which we are confident the parameter value lies.
Thats what Bayesian intervals do. - So what do they mean?
70Frequentist parameter estimation
- Orthodox parameter estimate proceeds as follows
Imagine we have some data, di, that we want to
use to estimate the value of a parameter, a, in a
model. These data must depend on the true value
of a in a known way, but as they are random
variables all we can say is that we know, or can
estimate, p(di) for some given a. - Use the data to construct a statistic (i.e. a
function of the data) that can be used as an
estimator for a called . A good estimator
will have a pdf that depends heavily on a, and
which is sharply peaked at, or very close to, a.
Measured value of
71Frequentist parameter estimation
- 2. One such estimator is the maximum
likelihood (ML) estimator, constructed in the
following way given the distribution from which
the data are drawn, p(da), construct the
sampling distribution p(dia), which is just
p(d1a). p(d2a). p(d3a) if the data are
independent. Interpret this as a function of
a, called the Likelihood of a, and call the value
of a that maximises it for the given data .
The corresponding sampling distribution,
, is the one from which the data were most
likely drawn.
72Frequentist parameter estimation
- but remember that this is just one value of
, albeit drawn from a population that has an
attractive average behaviour - And we still havent said anything about a.
- 3. Define a confidence interval around a
enclosing, say, 95 of the expected values of
from repeated experiments
Measured value of
73Frequentist parameter estimation
- may be known from the sampling pdf or it may have
to be estimated too. - 4. We can now say that
. Note this is a probabilistic
statement about the estimator, not about a.
However this expression can be restated 0.95 is
the relative frequency with which the statement
a lies in the region is true,
over many repeated experiments.
a
No
Yes
Yes
The disastrous shorthand for this is Note that
this is not a statement about our degree of
belief that a lies in the numerical interval
generated in our particular experiment.
Yes
No
Different experiments
74Example B and F compared
- Two gravitational wave (GW) detectors see 7
simultaneous burst-like signal in the data from
one hour, consisting of GW signals and spurious
signals. When the two sets of data are offset in
time by 10 minutes there are 9 simultaneous
signals. What is the true GW burst rate?(Note
no need to be a expert to realise there is not
much GW activity here!) - A frequentist solutionTake the events as
Poisson, characterised by the background rate, rb
and the GW rate, rg. We get -
Where is the variance of the sampling
distribution
75Example B and F compared
- So we would quote our result as
- But burst rates cant be negative! What does
this mean? Infact it is quite consistent with
our definition of a frequentist confidence
interval. Our value of is drawn from its
sampling distribution, that will look something
like
So, in the shorthand of the subject, this result
is quite correct.
4
Our particular sample
76Example B and F compared
- The Bayesian solution Well go through this
carefully. Our job is to determine rg. In
Bayesian terms that means we are after
p(rgdata). We start by realising there are two
experiments here one to determine the background
rate and one to determine the joint rate, so we
will also determine p(rbdata).If the
background count comes from a Poisson process of
mean rb then the probability of n events
iswhich is our Bayesian likelihood for rb. We
will choose a prior probability for rb
proportional to 1/rb. This is the scale
invariant prior that encapsulates total ignorance
of the non-zero rate (of course in reality we may
have something that constrains rb more tightly a
priori). See later
77Example B and F compared
- So our posterior for rb, based on the background
counts, is which, normalising and setting
n9, gives - The probability of seeing m coincident bursts,
given the two rates, is
Background rate
0.1
n9)
b
p(r
0.05
0
0
2
4
6
8
10
12
14
16
18
20
r
b
78Example B and F compared
- And our joint posterior for the rates is, by
Bayes theorem,The joint prior in the above
can be split into a probability for rb, which we
have just evaluated, and a prior for rg. This
may be zero, so we will say that we are equally
ignorant over whether to expect 0,1,2, counts
from this source. This translates into a uniform
prior on rg, so our joint prior isFinally we
get the posterior on rg by marginalising over rb
likelihood
prior
79Example B and F compared
...giving our final answer to the problem
p(rgn9,m7)
Compare with our frequentist resultto 1-sigma
rg
Note that p(rglt0)0, due to the prior, and that
the true value of rg could very easily be as high
as 4 or 5
80Example B and F compared
Lets see how this result would change if the
background count was 3, rather than 9 (joint
count still 7)
p(rgn3,m7)
rg
Again, this looks very reasonable
81- Bayesian hypothesis testing
- And why we already know how to do it
82Bayesian hypothesis testing
- In Bayesian analysis, hypothesis testing can be
performed as a generalised application of Bayes
theorem. Generally a hypothesis differs from a
parameter in that many values of the parameter(s)
are consistent with one hypothesis. Hypotheses
are models that depend of parameters. - Note however that we cannot define the
probability of one hypothesis given some data, d,
without defining all the alternatives in its
class, i.e.and often this is impossible. So
questions like do the data fit a gaussian? are
not well-formed until you list all the curves the
data could fit. A well-formed question would be
do the data fit a gaussian better than these
other curves?, or more usually do the data fit
a gaussian better than a lorentzian?. - Simple comparisons can be expressed as an odds
ratio, O
83Bayesian hypothesis testing
- The odds ratio an be divided into the prior odds
and the Bayes factor - The prior odds simply express our prior
preference for H1 over H2, and is set to 1 if you
are indifferent. - The Bayes factor is just the ratio of the
evidences, as defined in the earlier lectures.
Recall that for a model that depends on a
parameter aso the evidence is simply the
joint probability of the parameter(s) and the
data, marginalised over all hypothesis parameter
values
84Bayesian hypothesis testing
- Example A spacecraft is sent to a moon of
Saturn and, using a penetrating probe, detects a
liquid sea deep under the surface at 1 atmosphere
pressure and a temperature of -3C. However, the
thermometer has a fault, so that the temperature
reading may differ from the true temperature by
as much as 5C, with a uniform probability
within this range.Determine the temperature of
the liquid, assuming it is water (liquid within
0ltTlt100C) and then assuming it is ethanol
(liquid within -80ltTlt80C). What are the odds of
it being ethanol? based loosely
on a problem by John Skilling
measurement
-80 0
80 100 C
Ethanol is liquid
water is liquid
85Bayesian hypothesis testing
- Call the water hypothesis H1 and the ethanol
hypothesis H2.For H1The prior on the
temperature is the likelihood of the
temperature is the probability of the data d,
given the temperaturethought of as a
function of T, for d-3,
0 100 T
T
d
10
-3
T
10
86Bayesian hypothesis testing
- The posterior for T is the normalised product of
the prior and the likelihood, giving - H1 only allows the temperature to be between 0
and 2C. - The evidence for water (as we defined it) is
- For H2By the same argumentsand the
evidence for ethanol is
0 2 T
-8 0 2 T
87Bayesian hypothesis testing
- So under the water hypothesis we have a tighter
possible range for the liquids temperature, but
it may not be water. In fact, the odds of it
being water rather than ethanol arewhich
means 31 in favour of ethanol. Of course this
depends on our prior odds too, which we have set
to 1. If the choice was between water and whisky
under the surface of the moon the result would be
very different, though the Bayes factor would be
roughly the same! - Why do we prefer the ethanol option? Because too
much of the prior for temperature, assuming
water, falls at values that are excluded by the
data. In other words, the water hypothesis is
unnecessarily complicated . Bayes factors
naturally implement Occams razor in a
quantitative way.
88Bayesian hypothesis testing
0.1
0.01
T
-8 0 2
0.1
0.00625
T
-8 0 2
Overlap integral (evidence) is greater for H2
(ethanol) than for H1 (water)
89Bayesian hypothesis testing
- To look at this a bit more generally, we can
split the evidence into two approximate parts,
the maximum of the likelihood and an Occam
factor
i.e., evidence maximum likelihood x
Occam factorThe Occam factor penalises models
that include wasted parameter space, even if they
show a good ML fit.
likelihood
Lmax
Likelihood_range
prior
x
Prior_range
90Bayesian hypothesis testing
- This is a very powerful property of the Bayesian
method. Example your given a time series of
1000 data points comprising a number of sinusoids
embedded in gaussian noise. Determine the number
of sinusoids and the standard deviation of the
noise. - We could think of this as comparing hypotheses Hn
that there are n sinusoids in the data, with n
ranging from 0 to nmax. Equivalently, we could
consider it as a parameter fitting problem, with
n an unknown parameter within the model. The
joint posterior for the n signals, with
amplitudes An and frequencies ?n and noise
variance given the overall model and data
D isThe likelihood term, based on gaussian
noise is and we can set the priors as
independent and uniform over sensible ranges.
91Bayesian hypothesis testing
- Our result for n is just its marginal
probabilityand similarly we could
marginalise for ?. Recent work, in collaboration
with Umstatter and Christensen, has explored
this
- 1000 data points with 50 embedded signals
- ? 2.6
Result around 33 signals can be recovered from
the data, the rest are indistinguishable from
noise, and ? is consequentially higher
92Bayesian hypothesis testing
- This has implications for the analysis of LISA
data, which is expected to contain many (perhaps
50,000) signals from white dwarf binaries. The
data will contain resolvable binaries and
binaries that just contribute to the overall
noise (either because they are faint or because
their frequencies are too close
together).Bayes can sort these
out without having to introduce ad hoc acceptance
and rejection criteria, and without needing to
know the true noise level (whatever that
means!).
93 Frequentist hypothesis testing And why we
should tread carefully, if at all
94Frequentist hypothesis testing significance
tests
- A note on why these should really be avoided!
- The method goes like this
- To test a hypothesis H1 consider another
hypothesis, called the null hypothesis, H0, the
truth of which would deny H1. Then argue against
H0 - Use the data you have gathered to compute a test
statistic Tobs (eg, the chisquared statistic)
which has a calculable pdf if H0 is true. This
can be calculated analytically or by Monte Carlo
methods. - Look where your observed value of the statistic
lies in the pdf, and reject H0 based on how far
in the wings of the distribution you have fallen.
p(TH0)
Reject H0 if your result lies in here
T
95Frequentist hypothesis testing significance
tests
- H0 is rejected at the x level if x of the
probability lies to the right of the observed
value of the statistic (or is worse in some
other sense)and makes no reference to
how improbable the value is under any alternative
hypothesis (not even H1!). So
p(TH0)
X of the area
T
Tobs
An hypothesis H0 that may be true is rejected
because it has failed to predict observable
results that have not occurred. This seems a
remarkable procedure. On the face of it, the
evidence might more reasonably be taken as
evidence for the hypothesis, not against it.
Harold Jeffreys
Theory of Probability 1939
96Frequentist hypothesis testing significance
tests
- Eg, You are given a data point Tobs affected by
gaussian noise, drawn either from
N(mean-1,?0.5) or N(1,0.5). H0 The datum
comes from ? -1 H1 The datum comes from ?
1 Test whether H1 is true. - Here our statistic is simply the value of the
observed data. We will chose a critical region
of Tgt0, so that if Tobsgt0 we reject H0 and
therefore accept H1.
H0
H1
T
-1 1
97Frequentist hypothesis testing
Type II error
Type I error
T
- Formally, this can go wrong in two ways
- A Type I error occurs when we reject the null
hypothesis when it is true - A Type II error occurs when we accept the null
hypothesis when it is false - both of which we should strive to minimise
- The probabilities (p-values) of these are shown
as the coloured areas above (about 5 for this
problem) - If Tobsgt0 then we reject the null hypothesis at
the 5 level and therefore accept H1.
98Frequentist hypothesis testing
- But note that we do not consider the relative
probabilities of the hypotheses (we cant!
Orthodox statistics does not allow the idea of
hypothesis probabilities), so the results can be
misleading. - For example, let Tobs0.01. This lies just on
the boundary of the critical region, so we
reject H0 in favour of H1 at the 5 level,
despite the fact that we know a value of 0.01 is
just as likely under both hypotheses (acutally
just as unlikely, but it has happened). - Note that the Bayes factor for this same result
is 1, reflecting the intuitive answer that you
cant decide between the hypotheses based on such
a datum.
Moral - The subtleties of p-values are rarely
reflected in papers that quote them, and many
errors of Type III (misuse!) occur. Always take
them with a pinch of salt, and avoid them if
possible there are better tools available.
99- Assigning Bayesian probabilities
- (you have to start somewhere!)
100Assigning Bayesian probabilities
- We have done much on the manipulation of
probabilities, but not much on the initial
assignments of likelihoods and priors. Here are
a few wordsThe principle of insufficient
reason (Bernoulli, 1713) helps us out If we
have N mutually exclusive possibilities for an
outcome, and no reason to believe one more than
another, then each should be assigned the
probability 1/N. - So the probability of throwing a 6 on a die is
1/6, if all you know is it has 6 faces. - The key is to realise that this is one example of
a more general principle. Your state of
knowledge is such that the probabilities are
invariant under some transformation group (here,
the exchange of numbers on faces). - Using this idea we can extend the principle of
insufficient reason to continuous parameters.
101Assigning Bayesian probabilities
- So, a parameter that is not known to within a
change of origin (a location parameter) should
have a uniform probability - A parameter for which you have no knowledge of
scale (a scale parameter) is a location
parameter in its log so the so-called
Jeffreys prior. - Note that both these prior are improper you
cant normalise them, so their unfettered use is
restricted. However, you are rarely that
ignorant about the value of a parameter.
102Assigning Bayesian probabilities
- More generally, given some information, I, that
we wish to use to assign probabilities p1pn to
n different possibilities, then the most honest
way of assigning the probabilities is the one
that maximisessubject to the constraints
imposed by I. H is traditionally called the
information entropy of the distribution and
measures the amount of uncertainty in the
distribution in the sense of Shannon (though
there are several routes to the same result).
Honesty demands we maximise this, otherwise we
are implicitly imposing further constraints not
contained in I. - For example, the maximum entropy solution for the
probability of seeing k events given only the
information that they are characterised by a
single rate constant ? is the Poisson
distribution. If you are told the first and
second moments of a continuous distribution the
maxent solution is a gaussian etc
103 Monte Carlo methods How to do those nasty
marginalisations
104Monte Carlo Methods
- Uniform random number, U0,1
- See Numerical Recipes!
http//www.numerical-recipes.com/
105Monte Carlo Methods
2. Transformed random variables Suppose we
have Let Then
Probability of number between y and ydy
Probability of number between x and xdx
Because probability must be positive
106Monte Carlo Methods
- Transformed random variables
- Suppose we have
- Let
- Then
p(y)
1
b-a
y
a
0
b
Numerical Recipes uses the transformation method
to provide
107Monte Carlo Methods
- 3. Probability integral transform
- Suppose we can compute the CDF of some desired
random variable - 1)
- Compute
- Then
108Monte Carlo Methods
4. Rejection Sampling Suppose we want to
sample from some pdf p(x) and we know that
where q(x) is a simpler pdf called the
proposal distribution
- Sample x1 from q(x)
- Sample yU0,q(x1)
- If yltp(x) ACCEPT
- otherwise REJECT
Set of accepted values xi are a sample from
p(x).
109Monte Carlo Methods
4. Rejection Sampling
The method can be very slow if the shaded region
is too large -particularly in high-N problems,
such as the LISA problem considered earlier.
LISA
110Monte Carlo Methods
- Metropolis-Hastings Algorithm
- Speed this up by letting the proposal
distribution depend on the current sample
- Sample initial point
- Sample tentative new state
- from (e.g. Gaussian)
- Compute
- If Accept
- Otherwise Accept with probability a
Acceptance Rejection
X is a Markov Chain. Note that rejected samples
appear in the chain as repeated values of the
current state.
111Monte Carlo Methods
A histogram of the contents of the chain for a
parameter converges to the marginal pdf for that
parameter. In this way, high-dimensional
posteriors can be explored and marginalised to
return parameter ranges and interrelations
inferred fro complex data sets (eg, the WMAP
results)
http//www.statslab.cam.ac.uk/mcmc/pages/links.ht
ml
112(No Transcript)
113 The end nearly
114Final thoughts
- Things to keep in mind
- Priors are fundamental, but not always
influential. If you have good data most
reasonable priors return very similar
posteriors. - Degree of belief is subjective but not arbitrary.
Two people with the same degree of belief agree
about the value of the corresponding
probability. - Write down the probability of everything, and
marginalise over those parameters that are of no
interest. - Dont pre-filter if you can help it. Work with
the data, not statistics of the data.
115Bayesian bibliography for astronomy
- There are an increasing number of good books
and articles on Bayesian methods. Here are just
a few - E.T. Jaynes, Probability theory the logic of
science, CUP 2003 - D.J.C. Mackay, Information theory, inference and
learning algorithms, CUP, 2004 - D.S. Sivia, Data analysis a Bayesian tutorial,
OUP 1996 - T.J. Loredo, From Laplace to Supernova SN 1987A
Bayesian Inference in Astrophysics, 1990, copy at
http//bayes.wustl.edu/gregory/articles.pdf - G.L. Bretthorst, Bayesian Spectrum Analysis and
Parameter Estimation, 1988 copy at
http//bayes.wustl.edu/glb/book.pdf, - G. D'Agostini, Bayesian reasoning in high-energy
physics principles and applications
http//preprints.cern.ch/cgi-bin/setlink?basecern
repcategYellow_Reportid99-03 - Soon to appear P. Gregory, Bayesian Logical Data
Analysis for the Physical Sciences, 2005, CUP - Review sites
- http//bayes.wustl.edu/
- http//www.bayesian.org/