Statistical inference for astrophysics

About This Presentation

Title:

Statistical inference for astrophysics

Description:

Cardiff University 16-17 Feb 2005. Graham Woan, University of Glasgow. Cardiff Feb 16-17 2005 ... 'Frustra fit per plura, quod fieri potest per pauciora. ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 116

Provided by: Graha49

Category:

more less

Transcript and Presenter's Notes

Title: Statistical inference for astrophysics

1
Statistical inference for astrophysics

A short course for astronomers
Cardiff University 16-17 Feb 2005Graham Woan,
University of Glasgow

2
Lecture Plan

Why statistical astronomy?
What is probability?
Estimating parameters values
the Bayesian way
the Frequentist way
Testing hypotheses
the Bayesian way
the Frequentist way
Assigning Bayesian probabilities
Monte-Carlo methods

Lectures 1 2
Lectures 3 4
3
Why statistical astronomy?
Generally, statistics has got a bad reputation
There are three types of lies lies, damned lies
and statistics
Benjamin Disraeli
Mark Twain
Often for good reason
Jun 3rd 2004 two researchers at the University
of Girona in Spain, have found that 38 of a
sample of papers in Nature contained one or more
statistical errors
The Economist
4
Why statistical astronomy?
Data analysis methods are often regarded as
simple recipes
http//www.numerical-recipes.com/
5
Why statistical astronomy?
Data analysis methods are often regarded as
simple recipes
but in analysis as in life, sometimes the
recipes dont work as you expect.

Low number counts
Distant sources
Correlated residuals
Incorrect assumptions
Systematic errors and/or
misleading results

6
Why statistical astronomy?
and the tools can be clunky
" The trouble is that what we statisticians
call modern statistics was developed under strong
pressure on the part of biologists. As a result,
there is practically nothing done by us which is
directly applicable to problems of astronomy."
Jerzy Neyman, founder of frequentist hypothesis
testing.
7
Why statistical astronomy?
For example, we can observe only the one Universe
(From Bennett et al 2003)
8
The Astrophysicists Shopping List
We want tools capable of

dealing with very faint sources
handling very large data sets
correcting for selection effects
diagnosing systematic errors
avoiding unnecessary assumptions
estimating parameters and testing models

9
Why statistical astronomy?
Key question
How do we infer properties of the Universe from
incomplete and imprecise astronomical data?
Our goal
To make the best inference, based on our observed
data and any prior knowledge, reserving the right
to revise our position if new information comes
to light.
Lets come to this problem afresh with an
astrophyicists eye, and bypass some of the
jargon of orthodox statistics, going right back
to plain probability
10
Right-thinking gentlemen 1
A decision was wise, even though it led to
disastrous consequences, if with the evidence at
hand indicated it was the best one to make and a
decision was foolish, even though it led to the
happiest possible consequences, if it was
unreasonable to expect those consequencesWe
should do the best with what we have, not what we
wished we had.
Herodotus, c.500 BC
11
Right-thinking gentlemen 2
Probability theory is nothing but common sense
reduced to calculation
Pierre-Simon Laplace (1749 1827)
12
Right-thinking gentlemen 3
Occams Razor
Frustra fit per plura, quod fieri potest per
pauciora. It is vain to do with more what can
be done with less.
Everything else being equal, we favour models
which are simple.
William of Occam (1288 1348 AD)
13
The meaning of probability

Definitions, algebra and useful distributions

14
But what is probability?

There are three popular interpretations of the
word, each with an interesting history
Probability as a measure of our degree of belief
in a statement
Probability as a measure of limiting relative
frequency of outcome of a set of identical
experiments
Probability as the fraction of favourable
(equally likely) possibilities
We will call these the Bayesian, Frequentist and
Combinatorial interpretations.
Note there are signs of trouble here
How do you quantify degree of belief?
How do you define relative frequency without
using ideas of probability?
What does equally likely mean?
Thankfully, at some level, all three
interpretations agree on the algebra of
probability, which we will present in Bayesian
terms

15
Algebra of (Bayesian) probability

Probability of a statement, such as y 3,
the mass of a neutron star is 1.4 solar masses
or it will rain tomorrow is a number between 0
and 1, such that
For some statement X, where the bar denotes
the negation of the statement -- The Sum Rule
If there are two statements X and Y, then joint
probabilitywhere the vertical line denotes the
conditional statement X given Y is true The
Product Rule

16
Algebra of (Bayesian) probability

From these we can deduce thatwhere denotes
or and I represents common background
information -- The Extended Sum Rule
and, because p(X,Y)p(Y,X), we getwhich is
called Bayes Theorem
Note that these results are also applicablein
Frequentist probability theory, with a suitable
change in meaning of p.

17
Algebra of (Bayesian) probability
Bayes theorem is the appropriate rule for
updating our degree of belief when we have new
data
Prior
Likelihood
Posterior
Evidence
note that the word evidence is sometimes used
for something else (the log odds). We will
stick to the p(dI) definition here.
18
Algebra of (Bayesian) probability

We can also deduce the marginal probabilities.
If X and Y are propositions that can take on
values drawn from
and then
this gives use the probability of X when
we dont care about Y. In these circumstances, Y
is known as a nuisance parameter.
All these relationships can be smoothly extended
from discrete probabilities to probability
densities, e.g.where p(y)dy is the
probability that y lies in the range y to ydy.

1
19
Algebra of (Bayesian) probability
These equations are the key to Bayesian
Inference the methodology upon which (much)
astronomical data analysis is now founded. Clear
introduction by Devinder Sivia (OUP). (fuller
bibliography tomorrow)
See also the free book by Praesenjit Saha (QMW,
London). http//ankh-morpork.maths.qmw.ac.uk/7Esa
ha/book
20
Example

A gravitational wave detector may have seen a
type II supernova as a burst of gravitational
radiation. Burst-like signals can also come from
instrumental glitches, and only 1 in 10,000
bursts is really a supernova, so the data are
checked for glitches by examining veto channels.
The test is expected to confirm the burst is
astronomical in 95 of cases in which it truly
is, and in 1 when it truly is not.The burst
passes the veto test!! What is the probability
we have seen a supernova?Answer
DenoteLet I represent the information that
the burst seen is typical of those used to deduce
the information in the question. Then we are
told that

Prior probabilities for S and G

Likelihoods for S and G
21
Example cont

But we want to know the probability that its a
supernova, given it passed the veto testBy
Bayes, theoremand we are directly told
everything on the rhs except , the
probability that any burst candidate would pass
the veto test. If the burst is either a supernova
or a hardware glitch then we can marginalise over
these alternativesso

0.9999 0.01 0.0001 0.95
22
Example cont

So, despite passing the test there is only a 1
probability that the burst is a supernova!Veto
tests have to be blisteringly good if supernovae
are rare.
Why? Because most bursts that pass the test are
just instrumental glitches it really is just
common sense reduced to calculation.
Note however that by passing the test, the
probability that this burst is from a supernova
has increased by a factor of 100 (from 0.0001 to
0.01).
Moral

Probability that a supernova burst gets through
the veto (0.95)
Probability that its a supernova burst if it
gets through the veto(0.01)
23
the basis of frequentist probability
24
Basis of frequentist probability

Bayesian probability theory is simultaneously a
very old and a very young field
Old original interpretation of Bernoulli,
Bayes, Laplace
Young state of the art in (astronomical) data
analysis
But BPT was rejected for several centuries
because probability as degree of belief was seen
as too subjective

Frequentist approach
25
Basis of frequentist probability
Probability long run relative frequency of
an event it appears at first that this can be
measured objectively e.g. rolling a die. What
is ? If die is fair we
expect These probabilities are fixed (but
unknown) numbers. Can imagine rolling die M
times. Number rolled is a random variable
different outcome each time.
26
Basis of frequentist probability

We define If
die is fair
But objectivity is an illusion
assumes each outcome equally likely
(i.e. equally probable)
Also assumes infinite series of identical trials
What can we say about the fairness of the die
after
(say) 5 rolls, or 10, or 100 ?

27
Basis of frequentist probability

In the frequentist approach, a lot of
mathematical machinery is specifically defined to
let us address frequency questions
We take the data as a random sample of size M ,
drawn from an assumed underlying pdf
Sampling distribution, derived from the
underlying pdf and M
Define an estimator function of the sample
data that is used to estimate properties of the
pdf
But how do we decide what makes an
acceptable estimator?

28
Basis of frequentist probability

Example measuring a galaxy redshift
Let the true redshift z0 -- a fixed but
unknown parameter. We use two telescopes to
estimate z0 and compute sampling distributions
for and modelling errors
Small telescope
low dispersion spectrometer
Unbiased
Repeat observation a
large number of times
? average estimate is
equal to z0
BUT
is large due to the low dispersion.

29
Basis of frequentist probability

Example measuring a galaxy redshift
Let the true redshift z0 -- a fixed but
unknown parameter. We use two telescopes to
estimate z0 and compute sampling distributions
for and modelling errors
Large telescope
high dispersion spectrometer
but faulty astronomer!
(e.g. wrong calibration)
Biased
BUT is small. Which is the better
estimator?

30
Basis of frequentist probability
What about the sample mean?

Let be a random sample from pdf
with mean
and variance . Then
sample
mean
Can show that -- an unbiased estimator
But bias is defined formally in terms of an
infinite set of randomly chosen samples, each of
size M.
What can we say with a finite number of samples,
each of finite size? Before that

31
Some important probability distributions quick
definitions
32
Some important pdfs discrete case

Poisson pdf
e.g. number of photons / second counted by a
CCD,
number of galaxies / degree2 counted by a
survey

n number of detections Poisson pdf assumes
detections are independent, and there is a
constant rate Can show that
33
Some important pdfs discrete case

Poisson pdf
e.g. number of photons / second counted by a
CCD,
number of galaxies / degree2 counted by a
survey

34
Some important pdfs discrete case

Binomial pdf
number of successes from N observations, for
two mutually
exclusive outcomes (Heads and Tails)
e.g. number of binary stars, Seyfert galaxies,
supernovae

r number of successes probability
of success for single observation
Can show that
35
Some important pdfs continuous case

Uniform pdf

36
Some important pdfs continuous case

Central, or Normal pdf
(also known as Gaussian )

p(x)
x
37
Cumulative distribution function (CDF)
Prob( x lt a )
38
Measures and moments of a pdf
The nth moment of a pdf is defined as
Discrete case
Continuous case
39
Measures and moments of a pdf
The 1st moment is called the mean or
expectation value
Discrete case
Continuous case
40
Measures and moments of a pdf
The 2nd moment is called the mean square
Discrete case
Continuous case
41
Measures and moments of a pdf
The variance is defined as and is
often written . is called the
standard deviation
Discrete case
Continuous case
In general
42
Measures and moments of a pdf
pdf mean variance
Poisson Binomial Uniform Normal
discrete
continuous
43
Measures and moments of a pdf
The Median divides the CDF into two equal
halves
Prob( x lt xmed ) Prob( x gt xmed ) 0.5
44
Measures and moments of a pdf
The Mode is the value of x for which the
pdf is a maximum
p(x)
x
For a Normal pdf, mean median mode
45
Parameter estimation

The Bayesian way

46
Bayesian parameter estimation
In the Bayesian approach, we can test our model,
in the light of our data (i.e. rolling the die)
and see how our degree of belief in its
fairness evolves, for any sample size,
considering only the data that we did actually
observe.
Prior
Likelihood
Posterior
Influence of our observations
What we knew before
What we know now
47
Bayesian parameter estimation
Astronomical example 1 Probability that a
galaxy is a Seyfert 1
We want to know the fraction of Seyfert galaxies
that are type 1. How large a sample do we need
to reliably measure this? Model as a binomial
pdf global fraction of Seyfert
1s Suppose we sample N Seyferts, and observe
r Seyfert 1s
probability of obtaining observed data, given
model the likelihood of
48
Bayesian parameter estimation

But what do we choose as our prior? This has
been the source of much argument between
Bayesians and frequentists, though it is often
not that important.
We can sidestep this for a moment, realising that
if our data are good enough, the choice of prior
shouldnt matter!

Prior
Likelihood
Posterior
Dominates
49
Bayesian parameter estimation
Consider a simulation of this problem using two
different priors
Flat prior all values of ? equally probable
p(? I )
Normal prior peaked at ? 0.5
?
50
After observing 0 galaxies
p(? data, I )
?
51
After observing 1 galaxy Seyfert 1
p(? data, I )
?
52
After observing 2 galaxies S1 S1
p(? data, I )
?
53
After observing 3 galaxies S1 S1 S2
p(? data, I )
?
54
After observing 4 galaxies S1 S1 S2 S2
p(? data, I )
?
55
After observing 5 galaxies S1 S1 S2 S2
S2
p(? data, I )
?
56
After observing 10 galaxies 5 S1 5 S2
p(? data, I )
?
57
After observing 20 galaxies 7 S1 13 S2
p(? data, I )
?
58
After observing 50 galaxies 17 S1 33 S2
p(? data, I )
?
59
After observing 100 galaxies 32 S1 68 S2
p(? data, I )
?
60
After observing 200 galaxies 59 S1 141 S2
p(? data, I )
?
61
After observing 500 galaxies 126 S1 374 S2
p(? data, I )
?
62
After observing 1000 galaxies 232 S1 768 S2
p(? data, I )
?
63
Bayesian parameter estimation
What do we learn from all this?

As our data improve (e.g. our sample increases),
the posterior pdf narrows and becomes less
sensitive to our choice of prior.
The posterior conveys our (evolving) degree of
belief in different values of ? , in the light of
our data
If we want to express our result as a single
number we could perhaps adopt the mean, median,
or mode
We can use the variance of the posterior pdf
to assign an
uncertainty for ?
It is very straightforward to define
confidence intervals

64
Bayesian parameter estimation
Bayesian confidence intervals
?
We are 95 sure that lies between and Note
the confidence interval is not unique, but we can
define the shortest C.I.
?1
?2
95 of area under pdf
p(? data, I )
?2
?1
?
65
Bayesian parameter estimation
Astronomical example 2 Flux density of a GRB
Take Gamma Ray Bursts to be equally luminous
events, distributed homogeneously in the
Universe. We see three gamma ray photons from a
GRB in an interval of 1 s. What is the flux of
the source, F?

The seat-of-the-pants answer is F3 photons/s,
with an uncertainty of about , but we can do
better than that by including our prior
information on luminosity and homogeneity. Call
this background information I

Homogeneity implies that the probability the
source is in any particular volume of space is
proportional to the volume, so the prior
probability that the source is in a thin shell of
radius r is
66
Bayesian parameter estimation

But the sources have a fixed luminosity, L, so r
and F are directly related by
The prior on F is thereforeInterpretation
low flux sources are intrinsically more probable,
as there is more space for them to sit in.
We now apply Bayes theorem to determine the
posterior for F after seeing n photons

hence
p(FI)
F
67
Bayesian parameter estimation

The Likelihood for F comes from the Poisson
nature of photonsso finally, For n3 we
get thiswith the most probable value ofF
equalling 0.5 photons/sec.
Clearly it is more probable this is a distant
sourcefrom which we have seen an unusually high
number of photons than it is an unusually nearby
source from which we have seen an expected number
of photons. (The most probable value of F is
n-5/2, approaching n for ngtgt1)

p(Fn3,I)
Fn/T
Fmost prob
F
68
Parameter estimation

The frequentist way

69
Frequentist parameter estimation

Recall that in frequentist (orthodox) statistics,
probability is limiting relative frequency of
outcome, so only random variables can
have frequentist probabilitiesas only these
show variation with repeated measurement. So we
cant talk about the probability of a model
parameter, or of a hypothesis. E.g., a
measurement of a mass is a RV, but the mass
itself is not.
So no orthodox probabilistic statement can be
interpreted as directly referring to the
parameter in question! For example, orthodox
confidence intervals do not indicate the range in
which we are confident the parameter value lies.
Thats what Bayesian intervals do.
So what do they mean?

70
Frequentist parameter estimation

Orthodox parameter estimate proceeds as follows
Imagine we have some data, di, that we want to
use to estimate the value of a parameter, a, in a
model. These data must depend on the true value
of a in a known way, but as they are random
variables all we can say is that we know, or can
estimate, p(di) for some given a.
Use the data to construct a statistic (i.e. a
function of the data) that can be used as an
estimator for a called . A good estimator
will have a pdf that depends heavily on a, and
which is sharply peaked at, or very close to, a.

Measured value of
71
Frequentist parameter estimation

2. One such estimator is the maximum
likelihood (ML) estimator, constructed in the
following way given the distribution from which
the data are drawn, p(da), construct the
sampling distribution p(dia), which is just
p(d1a). p(d2a). p(d3a) if the data are
independent. Interpret this as a function of
a, called the Likelihood of a, and call the value
of a that maximises it for the given data .
The corresponding sampling distribution,
, is the one from which the data were most
likely drawn.

72
Frequentist parameter estimation

but remember that this is just one value of
, albeit drawn from a population that has an
attractive average behaviour
And we still havent said anything about a.
3. Define a confidence interval around a
enclosing, say, 95 of the expected values of
from repeated experiments

Measured value of
73
Frequentist parameter estimation

may be known from the sampling pdf or it may have
to be estimated too.
4. We can now say that
. Note this is a probabilistic
statement about the estimator, not about a.
However this expression can be restated 0.95 is
the relative frequency with which the statement
a lies in the region is true,
over many repeated experiments.

a
No
Yes
Yes
The disastrous shorthand for this is Note that
this is not a statement about our degree of
belief that a lies in the numerical interval
generated in our particular experiment.
Yes
No
Different experiments
74
Example B and F compared

Two gravitational wave (GW) detectors see 7
simultaneous burst-like signal in the data from
one hour, consisting of GW signals and spurious
signals. When the two sets of data are offset in
time by 10 minutes there are 9 simultaneous
signals. What is the true GW burst rate?(Note
no need to be a expert to realise there is not
much GW activity here!)
A frequentist solutionTake the events as
Poisson, characterised by the background rate, rb
and the GW rate, rg. We get

Where is the variance of the sampling
distribution
75
Example B and F compared

So we would quote our result as
But burst rates cant be negative! What does
this mean? Infact it is quite consistent with
our definition of a frequentist confidence
interval. Our value of is drawn from its
sampling distribution, that will look something
like

So, in the shorthand of the subject, this result
is quite correct.
4
Our particular sample
76
Example B and F compared

The Bayesian solution Well go through this
carefully. Our job is to determine rg. In
Bayesian terms that means we are after
p(rgdata). We start by realising there are two
experiments here one to determine the background
rate and one to determine the joint rate, so we
will also determine p(rbdata).If the
background count comes from a Poisson process of
mean rb then the probability of n events
iswhich is our Bayesian likelihood for rb. We
will choose a prior probability for rb
proportional to 1/rb. This is the scale
invariant prior that encapsulates total ignorance
of the non-zero rate (of course in reality we may
have something that constrains rb more tightly a
priori). See later

77
Example B and F compared

So our posterior for rb, based on the background
counts, is which, normalising and setting
n9, gives
The probability of seeing m coincident bursts,
given the two rates, is

Background rate
0.1
n9)
b
p(r
0.05
0
0
2
4
6
8
10
12
14
16
18
20
r
b
78
Example B and F compared

And our joint posterior for the rates is, by
Bayes theorem,The joint prior in the above
can be split into a probability for rb, which we
have just evaluated, and a prior for rg. This
may be zero, so we will say that we are equally
ignorant over whether to expect 0,1,2, counts
from this source. This translates into a uniform
prior on rg, so our joint prior isFinally we
get the posterior on rg by marginalising over rb

likelihood
prior
79
Example B and F compared
...giving our final answer to the problem
p(rgn9,m7)
Compare with our frequentist resultto 1-sigma
rg
Note that p(rglt0)0, due to the prior, and that
the true value of rg could very easily be as high
as 4 or 5
80
Example B and F compared
Lets see how this result would change if the
background count was 3, rather than 9 (joint
count still 7)
p(rgn3,m7)
rg
Again, this looks very reasonable
81

Bayesian hypothesis testing
And why we already know how to do it

82
Bayesian hypothesis testing

In Bayesian analysis, hypothesis testing can be
performed as a generalised application of Bayes
theorem. Generally a hypothesis differs from a
parameter in that many values of the parameter(s)
are consistent with one hypothesis. Hypotheses
are models that depend of parameters.
Note however that we cannot define the
probability of one hypothesis given some data, d,
without defining all the alternatives in its
class, i.e.and often this is impossible. So
questions like do the data fit a gaussian? are
not well-formed until you list all the curves the
data could fit. A well-formed question would be
do the data fit a gaussian better than these
other curves?, or more usually do the data fit
a gaussian better than a lorentzian?.
Simple comparisons can be expressed as an odds
ratio, O

83
Bayesian hypothesis testing

The odds ratio an be divided into the prior odds
and the Bayes factor
The prior odds simply express our prior
preference for H1 over H2, and is set to 1 if you
are indifferent.
The Bayes factor is just the ratio of the
evidences, as defined in the earlier lectures.
Recall that for a model that depends on a
parameter aso the evidence is simply the
joint probability of the parameter(s) and the
data, marginalised over all hypothesis parameter
values

84
Bayesian hypothesis testing

Example A spacecraft is sent to a moon of
Saturn and, using a penetrating probe, detects a
liquid sea deep under the surface at 1 atmosphere
pressure and a temperature of -3C. However, the
thermometer has a fault, so that the temperature
reading may differ from the true temperature by
as much as 5C, with a uniform probability
within this range.Determine the temperature of
the liquid, assuming it is water (liquid within
0ltTlt100C) and then assuming it is ethanol
(liquid within -80ltTlt80C). What are the odds of
it being ethanol? based loosely
on a problem by John Skilling

measurement
-80 0
80 100 C
Ethanol is liquid
water is liquid
85
Bayesian hypothesis testing

Call the water hypothesis H1 and the ethanol
hypothesis H2.For H1The prior on the
temperature is the likelihood of the
temperature is the probability of the data d,
given the temperaturethought of as a
function of T, for d-3,

0 100 T
T
d
10
-3
T
10
86
Bayesian hypothesis testing

The posterior for T is the normalised product of
the prior and the likelihood, giving
H1 only allows the temperature to be between 0
and 2C.
The evidence for water (as we defined it) is
For H2By the same argumentsand the
evidence for ethanol is

0 2 T
-8 0 2 T
87
Bayesian hypothesis testing

So under the water hypothesis we have a tighter
possible range for the liquids temperature, but
it may not be water. In fact, the odds of it
being water rather than ethanol arewhich
means 31 in favour of ethanol. Of course this
depends on our prior odds too, which we have set
to 1. If the choice was between water and whisky
under the surface of the moon the result would be
very different, though the Bayes factor would be
roughly the same!
Why do we prefer the ethanol option? Because too
much of the prior for temperature, assuming
water, falls at values that are excluded by the
data. In other words, the water hypothesis is
unnecessarily complicated . Bayes factors
naturally implement Occams razor in a
quantitative way.

88
Bayesian hypothesis testing
0.1
0.01
T
-8 0 2
0.1
0.00625
T
-8 0 2
Overlap integral (evidence) is greater for H2
(ethanol) than for H1 (water)
89
Bayesian hypothesis testing

To look at this a bit more generally, we can
split the evidence into two approximate parts,
the maximum of the likelihood and an Occam
factor
i.e., evidence maximum likelihood x
Occam factorThe Occam factor penalises models
that include wasted parameter space, even if they
show a good ML fit.

likelihood
Lmax
Likelihood_range
prior
x
Prior_range
90
Bayesian hypothesis testing

This is a very powerful property of the Bayesian
method. Example your given a time series of
1000 data points comprising a number of sinusoids
embedded in gaussian noise. Determine the number
of sinusoids and the standard deviation of the
noise.
We could think of this as comparing hypotheses Hn
that there are n sinusoids in the data, with n
ranging from 0 to nmax. Equivalently, we could
consider it as a parameter fitting problem, with
n an unknown parameter within the model. The
joint posterior for the n signals, with
amplitudes An and frequencies ?n and noise
variance given the overall model and data
D isThe likelihood term, based on gaussian
noise is and we can set the priors as
independent and uniform over sensible ranges.

91
Bayesian hypothesis testing

Our result for n is just its marginal
probabilityand similarly we could
marginalise for ?. Recent work, in collaboration
with Umstatter and Christensen, has explored
this

1000 data points with 50 embedded signals
? 2.6

Result around 33 signals can be recovered from
the data, the rest are indistinguishable from
noise, and ? is consequentially higher
92
Bayesian hypothesis testing

This has implications for the analysis of LISA
data, which is expected to contain many (perhaps
50,000) signals from white dwarf binaries. The
data will contain resolvable binaries and
binaries that just contribute to the overall
noise (either because they are faint or because
their frequencies are too close
together).Bayes can sort these
out without having to introduce ad hoc acceptance
and rejection criteria, and without needing to
know the true noise level (whatever that
means!).

93
Frequentist hypothesis testing And why we
should tread carefully, if at all
94
Frequentist hypothesis testing significance
tests

A note on why these should really be avoided!
The method goes like this
To test a hypothesis H1 consider another
hypothesis, called the null hypothesis, H0, the
truth of which would deny H1. Then argue against
H0
Use the data you have gathered to compute a test
statistic Tobs (eg, the chisquared statistic)
which has a calculable pdf if H0 is true. This
can be calculated analytically or by Monte Carlo
methods.
Look where your observed value of the statistic
lies in the pdf, and reject H0 based on how far
in the wings of the distribution you have fallen.

p(TH0)
Reject H0 if your result lies in here
T
95
Frequentist hypothesis testing significance
tests

H0 is rejected at the x level if x of the
probability lies to the right of the observed
value of the statistic (or is worse in some
other sense)and makes no reference to
how improbable the value is under any alternative
hypothesis (not even H1!). So

p(TH0)
X of the area
T
Tobs
An hypothesis H0 that may be true is rejected
because it has failed to predict observable
results that have not occurred. This seems a
remarkable procedure. On the face of it, the
evidence might more reasonably be taken as
evidence for the hypothesis, not against it.

Harold Jeffreys
Theory of Probability 1939
96
Frequentist hypothesis testing significance
tests

Eg, You are given a data point Tobs affected by
gaussian noise, drawn either from
N(mean-1,?0.5) or N(1,0.5). H0 The datum
comes from ? -1 H1 The datum comes from ?
1 Test whether H1 is true.
Here our statistic is simply the value of the
observed data. We will chose a critical region
of Tgt0, so that if Tobsgt0 we reject H0 and
therefore accept H1.

H0
H1
T
-1 1
97
Frequentist hypothesis testing
Type II error
Type I error
T

Formally, this can go wrong in two ways
A Type I error occurs when we reject the null
hypothesis when it is true
A Type II error occurs when we accept the null
hypothesis when it is false
both of which we should strive to minimise
The probabilities (p-values) of these are shown
as the coloured areas above (about 5 for this
problem)
If Tobsgt0 then we reject the null hypothesis at
the 5 level and therefore accept H1.

98
Frequentist hypothesis testing

But note that we do not consider the relative
probabilities of the hypotheses (we cant!
Orthodox statistics does not allow the idea of
hypothesis probabilities), so the results can be
misleading.
For example, let Tobs0.01. This lies just on
the boundary of the critical region, so we
reject H0 in favour of H1 at the 5 level,
despite the fact that we know a value of 0.01 is
just as likely under both hypotheses (acutally
just as unlikely, but it has happened).
Note that the Bayes factor for this same result
is 1, reflecting the intuitive answer that you
cant decide between the hypotheses based on such
a datum.
Moral
The subtleties of p-values are rarely
reflected in papers that quote them, and many
errors of Type III (misuse!) occur. Always take
them with a pinch of salt, and avoid them if
possible there are better tools available.

Assigning Bayesian probabilities
(you have to start somewhere!)

100
Assigning Bayesian probabilities

We have done much on the manipulation of
probabilities, but not much on the initial
assignments of likelihoods and priors. Here are
a few wordsThe principle of insufficient
reason (Bernoulli, 1713) helps us out If we
have N mutually exclusive possibilities for an
outcome, and no reason to believe one more than
another, then each should be assigned the
probability 1/N.
So the probability of throwing a 6 on a die is
1/6, if all you know is it has 6 faces.
The key is to realise that this is one example of
a more general principle. Your state of
knowledge is such that the probabilities are
invariant under some transformation group (here,
the exchange of numbers on faces).
Using this idea we can extend the principle of
insufficient reason to continuous parameters.

101
Assigning Bayesian probabilities

So, a parameter that is not known to within a
change of origin (a location parameter) should
have a uniform probability
A parameter for which you have no knowledge of
scale (a scale parameter) is a location
parameter in its log so the so-called
Jeffreys prior.
Note that both these prior are improper you
cant normalise them, so their unfettered use is
restricted. However, you are rarely that
ignorant about the value of a parameter.

102
Assigning Bayesian probabilities

More generally, given some information, I, that
we wish to use to assign probabilities p1pn to
n different possibilities, then the most honest
way of assigning the probabilities is the one
that maximisessubject to the constraints
imposed by I. H is traditionally called the
information entropy of the distribution and
measures the amount of uncertainty in the
distribution in the sense of Shannon (though
there are several routes to the same result).
Honesty demands we maximise this, otherwise we
are implicitly imposing further constraints not
contained in I.
For example, the maximum entropy solution for the
probability of seeing k events given only the
information that they are characterised by a
single rate constant ? is the Poisson
distribution. If you are told the first and
second moments of a continuous distribution the
maxent solution is a gaussian etc

103
Monte Carlo methods How to do those nasty
marginalisations
104
Monte Carlo Methods

Uniform random number, U0,1
See Numerical Recipes!

http//www.numerical-recipes.com/
105
Monte Carlo Methods
2. Transformed random variables Suppose we
have Let Then
Probability of number between y and ydy
Probability of number between x and xdx
Because probability must be positive
106
Monte Carlo Methods

Transformed random variables
Suppose we have
Let
Then

p(y)
1
b-a
y
a
0
b
Numerical Recipes uses the transformation method
to provide
107
Monte Carlo Methods

3. Probability integral transform
Suppose we can compute the CDF of some desired
random variable
1)
Compute
Then

108
Monte Carlo Methods

4. Rejection Sampling Suppose we want to
sample from some pdf p(x) and we know that
where q(x) is a simpler pdf called the
proposal distribution

Sample x1 from q(x)
Sample yU0,q(x1)
If yltp(x) ACCEPT
otherwise REJECT

Set of accepted values xi are a sample from
p(x).
109
Monte Carlo Methods

4. Rejection Sampling
The method can be very slow if the shaded region
is too large -particularly in high-N problems,
such as the LISA problem considered earlier.
LISA
110
Monte Carlo Methods

Metropolis-Hastings Algorithm
Speed this up by letting the proposal
distribution depend on the current sample

Sample initial point
Sample tentative new state
from (e.g. Gaussian)
Compute
If Accept
Otherwise Accept with probability a

Acceptance Rejection
X is a Markov Chain. Note that rejected samples
appear in the chain as repeated values of the
current state.
111
Monte Carlo Methods

A histogram of the contents of the chain for a
parameter converges to the marginal pdf for that
parameter. In this way, high-dimensional
posteriors can be explored and marginalised to
return parameter ranges and interrelations
inferred fro complex data sets (eg, the WMAP
results)
http//www.statslab.cam.ac.uk/mcmc/pages/links.ht
ml
112
(No Transcript)
113
The end nearly
114
Final thoughts

Things to keep in mind
Priors are fundamental, but not always
influential. If you have good data most
reasonable priors return very similar
posteriors.
Degree of belief is subjective but not arbitrary.
Two people with the same degree of belief agree
about the value of the corresponding
probability.
Write down the probability of everything, and
marginalise over those parameters that are of no
interest.
Dont pre-filter if you can help it. Work with
the data, not statistics of the data.

115
Bayesian bibliography for astronomy

There are an increasing number of good books
and articles on Bayesian methods. Here are just
a few
E.T. Jaynes, Probability theory the logic of
science, CUP 2003
D.J.C. Mackay, Information theory, inference and
learning algorithms, CUP, 2004
D.S. Sivia, Data analysis a Bayesian tutorial,
OUP 1996
T.J. Loredo, From Laplace to Supernova SN 1987A
Bayesian Inference in Astrophysics, 1990, copy at
http//bayes.wustl.edu/gregory/articles.pdf
G.L. Bretthorst, Bayesian Spectrum Analysis and
Parameter Estimation, 1988 copy at
http//bayes.wustl.edu/glb/book.pdf,
G. D'Agostini, Bayesian reasoning in high-energy
physics principles and applications
http//preprints.cern.ch/cgi-bin/setlink?basecern
repcategYellow_Reportid99-03
Soon to appear P. Gregory, Bayesian Logical Data
Analysis for the Physical Sciences, 2005, CUP
Review sites
http//bayes.wustl.edu/
http//www.bayesian.org/