Title: Bayesian Statistics
1Bayesian Statistics
- HSTAT1101 10. november 2004
- Arnoldo Frigessi
- frigessi_at_medisin.uio.no
2Reverend Thomas Bayes 1702-1761
Mathematician who first used probability
inductively and established a mathematical basis
for probability inference He set down his
findings on probability in "Essay Towards Solving
a Problem in the Doctrine of Chances" (1763),
published posthumously in the Philosophical
Transactions of the Royal Society of London.
- Inventor of a Bayesian analysis for the
binomial model - Laplace at the same time discovered Bayes
Theorem and a new analytic tool for approximating
integrals - Bayesian statistics was the method of statistics
until about 1910.
3Statistics estimates unknown parameters (like the
mean of a population). Parameters represent
things that are unknown. They are some
properties of the population from which the data
arise. Questions of interest are expressed as
questions on such parameters confidence
intervalls, hypothesis testing etc. Classical
(frequentist) statistics considers parameters as
specific to the problem, so that they are not
subject to random variability. Hence
parameters are just unknown numbers, they are not
random, and it is not possible to make
probabilistic statements about parameters (like
the parameter has 35 chances to be larger than
0.75). Bayesian statistics considers parameters
as unknown and random and hence it is allowed to
make probabilistic statements about them (like
the above). In Bayesian statistics parameters
are uncertain either because they are random or
because of our imperfect knowledge of them.
4Example Treatment 2 is more cost-effective
than treatment 1 for a certain hospital. Paramet
ers involved in this statement - mean cost and
mean efficacy for treatment 1 - mean cost and
mean efficacy for treatment 2 across all patients
in the population for which the hospital is
responsible. Bayesian point of view we are
uncertain about the statement, hence
this uncertainty is described by a probability.
We will exactly calculate the probability that
treatment 2 is more cost-effective than treatment
1 for the hospital. Classical point of view
either treatment 2 is more cost-effective or it
is not. Since this experiment cannot be repeated
(it happens only once), we cannot talk about its
probability.
5... but, in classical statistics we can make a
test! Null-hypothesis Ho treatment 2 is NOT
more cost-effective than treatment 1 ... and we
can obtain a p-value! What is a p-value?
6(No Transcript)
7Correct 2 (but it is quite a complicated
explanation, isnt it?) 1 and 3 are ways
significance is commonly interpreted. BUT they
are not correct. Answer 3 makes a probabilistic
statement about the hypothesis, which is not
random but either true or false. Answer 1 is
about individual patients, while the test is on
cost-efficacy.
8We cannot interprete a p-value as a probability,
because in the classical setting it is irrelevant
how probable the hypothesis was a priori,
before the data where collected. Example Can
a healer cure cancer? A healer treated 52 cancer
patients and 33 of these were better after one
session. Null Hypothesis Ho the healer does not
heal. p-value (one sided) 3,5. Hence reject
at 5 level. Should we believe that it is 96.5
sure that the healer heals? Most doctors would
regard healers as highly unreliable and in no
way they would be persuaded by a single small
experiment. After seeing the experiment, most
doctors would continue to believe in Ho. The
experiment was due to chance.
9In practice, classical statistics would recognise
that a much stronger evidence would be needed to
reject a very likely Ho. So, the p-value in
reality does not mean the same thing in all
situations. To interprete the p-value as the
probability of the null hypothesis is not
only wrong but dangerous when the hypothesis is a
priori highly unlikely. All practical
statisticians are disturbed that a p-value cannot
be seen as the probability that the null
hypotheis is true. Similarly, it is disturbing
that a 95 confidence interval for a treatment
difference does NOT mean that the true difference
has 95 chance of lying in this interval.
10Classical confidence interval 3.5 11.6 is a
95 confidence interval for the mean cost of
Interpretation There is a 95 chance that the
mean lies between 3.5 and 11.6. Correct? NO! It
cannot mean this since the mean cost is not
random! In the Bayesian context, parameters are
random and when we compute a Bayesian interval
for the mean it means exactly the interpretation
usually given to a confidence interval. In
classical inference, the words confidence and
significance are technical terms and should be
interpreted as such!
11One widely used way of presenting a
cost-effectiveness analysis is through the
Cost-Effectiveness Acceptability Curve
(CEAC), introduced by van Hout et al (1994). For
each value of the threshold willingness to pay ?,
the CEAC plots the probability that one treatment
is more cost-effective than another. This
probability can only be meaningful in a Bayesian
framework. It refers to the probability of a
one-off event (the relative cost-effectiveness of
these two particular treatments is one-off, and
not repeatable).
12Example randomised clinical trial evidence
- Studies 1 RCT (n 107)
- Comparators dopexamine vs standard care
- Follow-up 28 days
- Economics Single cost analysis
Boyd O, Grounds RM, Bennett ED. A randomised
clinical trial of the effect of deliberate
perioperative increase of oxygen delivery on
mortality in high-risk surgical patients. JAMA
1993 2702699-707
13Trial results
Costs ()
Survival (days)
mean
se
mean
se
Standard
11,885
3,477
541
50.2
Adrenaline
10,847
3,644
615
64.1
Dopexamine
7,976
1,407
657
34.7
14Trial CEAC curves
1
Control
Dopexamine
Adrenaline
0.9
0.8
0.7
0.6
Probability strategy is cost-effective
0.5
0.4
0.3
0.2
0.1
0
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
100,000
l
15The Bayesian method learn from the data
The role of data is to add to our knowledge and
so to update what we can say about hypothesis and
parameters. If we want to learn from a new data
set, we have to first say what we already know
about the hypothesis, a priori, before we see the
data. Bayesian statistics summerises the a
priori known things on an unknown parameter (say
the mean cost of something) in a distribution for
the unknown quantity, called prior distribution.
The prior distribution synthetises what is
known or believed to be true before we analyse
the new data. Then we will analyse the new data
and summerise again the total information about
the unknown hypothesis (or parameter) in a
distribution called posterior distribution. Baye
s formula is the mathematical way to calculate
the posterior distribution given the prior
distribution and the data.
16posterior
prior
data
17Bayes recognises the strength of each curve the
posterior is more influenced by the data then by
the prior, since the data have a more narrow
distribution. Peaks prior 0 data
1,60 posterior 1,30
18The data curve is called likelihood, and it is
also important in classical statistics. It
describes the support that come from the data for
the various possible values of the unknown
parameter. Classical statistics uses only the
likelihood, bayesian statistics all three
curves. The classical estimate here would be the
peak of the likelihood (1.6) The bayes estimate
is about 1,3, since this includes our prior
believe that the partameter should have a value
which is below 2 or so.
19The bayesian estimate is a compromise between
data and prior knowledge. In this sense,
bayesian statistics is able to make use of
more information than classical statistics and
obtain hence stronger results.
200.89
Bayesian statistics reads confidence intervals,
estimates etc from the posterior distribution.
A point estimate for the parameter is the
a-posteriori most likely value, the peak of the
posterior, or the expected value of the
posterior. If we hav an hypothesis (for example
that the paramter is positive), then we read
from the posterior that the posterior probability
for the paramter to be larger than zero is 0.89.
21If we are less sure about the parameter a priori,
then we would use a flatter prior. Consequence is
that the posterior looks more similar to the
likelihood (data).
22posterior (constant number) ? prior ?
likelihood ? prior ? likelihood P(parameter
data) ? P(parameter) ? P(data parameter) P(?
data) ? P(?) ? P(data ?) P(? data) P(?) ?
P(data ?) / P(data) (Bayes formula)
23How do we choose a prior distribution? The prior
is subjective. Two different experts can have
different knowledge and believes, which would
lead to two different priors. If you have
no opinion then it is possible to use a totally
flat prior, which adds no information to what is
in the data. If we want to have clear
probabilistic interpretations of confidence and
significance, then we need to have priors.
This is considered as a weakness by many who
are trained to reject subjectivity whenever
possible. BUT - Science is not objective in
any case. Why should the binomial or the gaussian
distribution we the true ones for a data
set? - subjective evidence can be tuned down as
much as one wishes. - if there is no consensus,
and different priors lead to different decisions,
why hiding it?
24(No Transcript)
25Example Cancer at Slater School
- (Example taken from an article by Paul Brodeur in
the New Yorker in Dec. 1992.) - ? Slater School is an elementary school where
the staff was concerned that their high cancer
rate could be due to two nearby high voltage
transmission lines. - Key Facts
- ? there were 8 cases of invasive cancer over a
long time among 145 staff members whose average
age was between 40 and 44 - ? based on the national cancer rate among woman
this age (approximately 3/100), the expected
number of cancers is 4.2 - Assumptions
- 1) the 145 staff members developed cancer
independently of each other - 2) the chance of cancer, ?, was the same for
each staff person. - Therefore, the number of cancers, X, follows a
binomial distribution X bin (145, ?) - How well do each of four simplified Competing
Theories explain the data? - Theory A ? .03 (the national rate, i.e. no
effect of lines) - Theory B ? .04
- Theory C ? .05
- Theory D ? .06
26The Likelihood of Theories A-D
- To compare the theories, we see how well each
explains the data. - That is, for each hypothesized ?, we calculate
the binomial distribution
Theory A Pr(X 8 ? .03 ) ? .036 Theory B
Pr(X 8 ? .04 ) ? .096 Theory C Pr(X 8
? .05 ) ? .134 Theory D Pr(X 8 ? .06 ) ?
.136 This is a ratio of approximately 1344.
So, Theory B explains the data about 3 times as
well as theory A. There seems to be an effect of
the lines!
27A Bayesian Analysis
- There are other sources of information about
whether cancer can be induced by proximity to
high-voltage transmission lines. - - Some epidemiologists show positive
correlations between cancer and proximity - - Other epidemiologists dont show these
correlations, and physicists and biologists
maintain believe that energy in magnetic fields
associated with high-voltage power lines is too
small to have an appreciable biological effect. - Supposes we judge the opposite expert
knowledge equally reliable. Therefore, Theory A
(no effect) is as likely as Theories B, C, and D
together, and we judge theories B, C, and D to be
equally likely. - So, Pr(A) ? 0.5 ? Pr(B) Pr(C) Pr(D)
- Also, Pr(B) ? Pr(C) ? Pr(D) ? 0.5 / 3 1/6
- These quantities will represent our prior
distribution on the four possible hypothesis.
prior ?
28Bayes Theorem
P( A X 8 ) 0.23 Likewise, Pr( B X
8 ) 0.21 Pr( C X 8 ) 0.28 Pr( D X
8 ) 0.28 Accordingly, wed say that each of
these four theories is almost equally likely, So
the probability that there is an effect of the
lines at Slater is about 0.21 0.28 0.28
0.77. So the probability of an effect is pretty
high, but not close enough to 1 to be a proof.
posterior ?
29A non-Bayesian Analysis
- Classical test of the hypothesis
- Ho ? .03 (no effect)
- against the alternative hypothesis.
- Calculate the p-value we find
- p-value Pr(X8 ? 0.03 ) Pr(X9 ? 0.03
) Pr(X10 ? 0.03 ) - Pr(X145 ? 0.03 ) (138 terms to be
added) - ? .07
- Under a classical hypothesis test, we would
not reject the null hypothesis. So there is no
indication of an effect of the lines. - By comparison, the Bayesian analysis revealed
that the probability that Pr(? gt .03) ? 0.77
30Todays posterior is the prior of tomorrow!
31Example Hospitalisation
A new drug seems to have good efficacy relative
to a standard treatment. Is it cost-effective?
Assume that it would be so if it would also
reduce hospitalisation. Data 100 patients in
each treatment group. Standard treatment group
25 days. (sample variance was 1.2) New treatment
group 5 days. (sample variance was
0.248) Classical test (do it!) would show that
the difference is significant at 5
level. Pharmaceutical company would then say
The mean number of days in hospital under the
new treatment is 0.05 per patient (5/100) while
it is 0.25 with the standard treatment. Cost
effective!!!!!
32Example Hospitalisation genuine prior
information
BUT this was a rather small trial. Is there
other evidence available? Suppose a much larger
trial of a similar drug produced a mean number of
days in hospital per patient of 0.21, with a
standard error of 0.03 only. This would suggest
that the 0.05 of the new drug is optimistic and
there is a doubt on the real difference between
new and standard treatment cost. BUT, the
interpretation of how pertinent this evidence is,
is subjective. It was a similar, not the same
drug. It is however reasonable to suppose that
the new drug and this similar one should be
rather similar. Because the drug is not the
same, we cannot simply put the two data
sets together. Classical statistics does not know
what to do, except to lower the required
significance.
33Example Hospitalisation
Bayesian statistics solves the problem by
treating the early trial as giving prior
information to the new trial. Assume that our
prior says that the mean number of days
in hospital per patient with the new treatment
should be 0.21 but with a standard deviation
which is larger, say 0.08, to mark that the
two drugs are not the same. Now we compute the
posterior estimate, given the new small
trial, and obtain that the number of days in
hospital per patient is 0.095. So this is still
better than the standard treatment (0.25 days).
In fact we can compute the probability that the
new drug reduces hospitalisation with respect to
the standard one and we get 0.90! Conlcusion
the new treatment has a 90 chance to reduce
hospitalisation (but not 95) and that the mean
number of days is about 0.1 (not 0.05).
34http//www.bayesian-initiative.com/
35I hope I have not confused you too much! BUT I
also hope that you are a bit confused now and
that at later stages in your education and
profession you will want to learn this
better! For now Bayesian statistics is NOT part
of the pensum.