Title: Why%20I%20am%20a%20Bayesian%20(and%20why%20you%20should%20become%20one,%20too)%20or%20Classical%20statistics%20considered%20harmful
1Why I am a Bayesian(and why you should become
one, too)orClassical statistics considered
harmful
- Kevin Murphy
- UBC CS Stats
- 9 February 2005
2Where does the title come from?
- Why I am not a Bayesian, Glymour, 1981
- Why Glymour is a Bayesian, Rosenkrantz, 1983
- Why isnt everyone a Bayesian?,Efron, 1986
- Bayesianism and causality, or, why I am only a
half-Bayesian, Pearl, 2001
Many other such philosophical essays
3Frequentist vs Bayesian
- Prob objective relative frequencies
- Params are fixed unknown constants, so cannot
write e.g. P(?0.5D) - Estimators should be good when averaged across
many trials
- Prob degrees of belief (uncertainty)
- Can write P(anythingD)
- Estimators should be good for the available data
Source All of statistics, Larry Wasserman
4Outline
- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
5Coin flipping
HHTHT
HHHHH
What process produced these sequences?
The following slides are from Tenenbaum
Griffiths
6Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
- Fair coin, P(H) 0.5
- Coin with P(H) p
- Markov model
- Hidden Markov model
- ...
7Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
- Fair coin, P(H) 0.5
- Coin with P(H) p
- Markov model
- Hidden Markov model
- ...
8Representing generative models
- Graphical model notation
- Pearl (1988), Jordan (1998)
- Variables are nodes, edges indicate dependency
- Directed edges show causal process of data
generation
9Models with latent structure
- Not all nodes in a graphical model need to be
observed - Some variables reflect latent structure, used in
generating D but unobserved
How do we select the best model?
10Bayes rule
Sum over space of hypotheses
11The origin of Bayes rule
- A simple consequence of using probability to
represent degrees of belief - For any two random variables
12Why represent degrees of belief with
probabilities?
- Good statistics
- consistency, and worst-case error bounds.
- Cox Axioms
- necessary to cohere with common sense
- Dutch Book Survival of the Fittest
- if your beliefs do not accord with the laws of
probability, then you can always be out-gambled
by someone whose beliefs do so accord. - Provides a theory of incremental learning
- a common currency for combining prior knowledge
and the lessons of experience.
13Hypotheses in Bayesian inference
- Hypotheses H refer to processes that could have
generated the data D - Bayesian inference provides a distribution over
these hypotheses, given D - P(DH) is the probability of D being generated by
the process identified by H - Hypotheses H are mutually exclusive only one
process could have generated D
14Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
15Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
16Comparing two simple hypotheses
- Contrast simple hypotheses
- H1 fair coin, P(H) 0.5
- H2always heads, P(H) 1.0
- Bayes rule
- With two hypotheses, use odds form
17Bayes rule in odds form
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
x
Prior odds
Posterior odds
Bayes factor(likelihood ratio)
18Data HHTHT
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 0 P(H2) 1/1000
- P(H1D) / P(H2D) infinity
x
19Data HHHHH
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 30
x
20Data HHHHHHHHHH
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHHHHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/210 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 1
x
21Coin flipping
- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p
22Comparing simple and complex hypotheses
vs.
- Which provides a better account of the data the
simple hypothesis of a fair coin, or the complex
hypothesis that P(H) p?
23Comparing simple and complex hypotheses
- P(H) p is more complex than P(H) 0.5 in two
ways - P(H) 0.5 is a special case of P(H) p
- for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5
24Comparing simple and complex hypotheses
Probability
25Comparing simple and complex hypotheses
Probability
HHHHH p 1.0
26Comparing simple and complex hypotheses
Probability
HHTHT p 0.6
27Comparing simple and complex hypotheses
- P(H) p is more complex than P(H) 0.5 in two
ways - P(H) 0.5 is a special case of P(H) p
- for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5 - How can we deal with this?
- frequentist hypothesis testing
- information theorist minimum description length
- Bayesian just use probability theory!
28Comparing simple and complex hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- Computing P(DH1) is easy
- P(DH1) 1/2N
- Compute P(DH2) by averaging over p
x
29Comparing simple and complex hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- Computing P(DH1) is easy
- P(DH1) 1/2N
- Compute P(DH2) by averaging over p
x
Marginal likelihood
likelihood
Prior
30Likelihood and prior
- Likelihood
- P(D p) pNH (1-p)NT
- NH number of heads
- NT number of tails
- Prior
- P(p) ? pFH-1 (1-p)FT-1
?
31A simple method of specifying priors
- Imagine some fictitious trials, reflecting a set
of previous experiences - strategy often used with neural networks
- e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair - In fact, this is a sensible statistical idea...
32Likelihood and prior
- Likelihood
- P(D p) pNH (1-p)NT
- NH number of heads
- NT number of tails
- Prior
- P(p) ? pFH-1 (1-p)FT-1
- FH fictitious observations of heads
- FT fictitious observations of tails
Beta(FH,FT)
(pseudo-counts)
33Posterior / prior x likelihood
- Prior
- Likelihood
- Posterior
Same form!
34Conjugate priors
- Exist for many standard distributions
- formula for exponential family conjugacy
- Define prior in terms of fictitious observations
- Beta is conjugate to Bernoulli (coin-flipping)
FH FT 1 FH FT 3 FH FT 1000
35Normalizing constants
- Prior
- Normalizing constant for Beta distribution
- Posterior
- Hence marginal likelihood is
36Comparing simple and complex hypotheses
- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- Computing P(DH1) is easy
- P(DH1) 1/2N
- Compute P(DH2) by averaging over p
x
Likelihood for H1
Marginal likelihood (evidence) for H2
37Marginal likelihood for H1 and H2
Probability
Marginal likelihood is an average over all values
of p
38Sensitivity to hyper-parameters
39Bayesian model selection
- Simple and complex hypotheses can be compared
directly using Bayes rule - requires summing over latent variables
- Complex hypotheses are penalized for their
greater flexibility Bayesian Occams razor - Maximum likelihood cannot be used for model
selection (always prefers hypothesis with largest
number of parameters)
40Outline
- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
41Example Belgian euro-coins
- A Belgian euro spun N250 times came up heads
X140. - It looks very suspicious to me. If the coin were
unbiased the chance of getting a result as
extreme as that would be less than 7 Barry
Blight, LSE (reported in Guardian, 2002)
Source Mackay exercise 3.15
42Classical hypothesis testing
- Null hypothesis H0 eg. q 0.5 (unbiased coin)
- For classical analysis, dont need to specify
alternative hypothesis, but later we will useH1
? ? 0.5 - Need a decision rule that maps data D to accept/
reject of H0. - Define a scalar measure of deviance d(D) from the
null hypothesis e.g., Nh or ?2
43P-values
- Define p-value of threshold ? as
- Intuitively, p-value of data is probability of
getting data at least that extreme given H0
44P-values
R
- Define p-value of threshold ? as
- Intuitively, p-value of data is probability of
getting data at least that extreme given H0 - Usually choose ? so that false rejection rate of
H0 is below significance level ? 0.05
45P-values
R
- Define p-value of threshold ? as
- Intuitively, p-value of data is probability of
getting data at least that extreme given H0 - Usually choose ? so that false rejection rate of
H0 is below significance level ? 0.05 - Often use asymptotic approximation to
distribution of d(D) under H0 as N ! 1
46P-value for euro coins
- N 250 trials, X140 heads
- P-value is less than 7
- If N250 and X141, pval 0.0497, so we can
reject the null hypothesis at the significance
level of 5. - This does not mean P(H0D)0.07!
Pval(1-binocdf(139,n,0.5)) binocdf(110,n,0.5)
47Bayesian analysis of euro-coin
- Assume P(H0)P(H1)0.5
- Assume P(p) Beta(?,?)
- Setting ?1 yields a uniform (non-informative)
prior.
48Bayesian analysis of euro-coin
- If ?1,so H0 (unbiased) is (slightly) more
probable than H1 (biased). - By varying ? over a large range, the best we can
do is make B1.9, which does not strongly support
the biased coin hypothesis. - Other priors yield similar results.
- Bayesian analysis contradicts classical analysis.
49Outline
- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
50Outline
- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense
51The likelihood principle
- In order to choose between hypotheses H0 and H1
given observed data, one should ask how likely
the observed data are do not ask questions about
data that we might have observed but did not,
such as - This principle can be proved from two simpler
principles called conditionality and sufficiency.
52Frequentist statistics violates the likelihood
principle
- The use of P-values implies that a hypothesis
that may be true can be rejected because it has
not predicted observable results that have not
actually occurred. Jeffreys, 1961
53Another example
- Suppose X N(?,?2) we observe x3
- Compare H0 ?0 with H1 ?gt0
- P-value P(X 3H0)0.001, so reject H0
- Bayesian approach update P(?X) using conjugate
analysis compute Bayes factor to compare H0 and
H1
54When are P-values valid?
- Suppose X N(?,?2) we observe Xx.
- One-sided hypothesis test H0 ? ?0
vs H1 ? gt ?0 - If P(?) / 1, then P(?x) N(x,?2), so
- P-value is the same in this case, since Gaussian
is symmetric in its arguments
55Outline
- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense
56Stopping rule principle
- Inferences you make should only depend on the
observed data, not the reasons why this data was
collected. - If you look at your data to decide when to stop
collecting, this should not change any
conclusions you draw. - Follows from likelihood principle.
57Frequentist statistics violates stopping rule
principle
- Observe DHHHTHHHHTHHT. Is there evidence of bias
(Pt gt Ph)? - Let X3 heads be observed random variable and
N12 trials be fixed constant. Define H0 Ph0.5.
Then, at the 5 level, there is no significant
evidence of bias
58Frequentist statistics violates stopping rule
principle
- Suppose the data was generated by tossing coins
until we got X3 heads. - Now X3 heads is a fixed constant and N12 is a
random variable. Now there is significant
evidence of bias!
First n-1 trials contain x-1 heads last trial
always heads
59Ignoring stopping criterion can mislead classical
estimators
- Let Xi Bernoulli(?)
- Max lik. estimator
- MLE is unbiased
- Toss coin if head, stop, else toss second coin.
P(H)?, P(HT)? (1-?), P(TT)(1-?)2. - Now MLE is biased!
- Many classical rules for assessing significance
when complex stopping rules are used.
60Outline
- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense
61Confidence intervals
- An interval (?min(D),?max(D)) is a 95 CI if ?
lies inside this interval 95 of the time across
repeated draws DP(.?) - This does not mean P(? 2 CID) 0.95!
Mackay sec 37.3
62Example
- Draw 2 integers from
- If ?39, we would expect
63Example
- If ?39, we would expect
- Define confidence interval as
- eg (x1,x2)(40,39), CI(39,39)
- 75 of the time, this will contain the true ?
64CIs violate common sense
- If ?39, we would expect
- If (x1,x2)(39,39), then CI(39,39) at level 75.
But clearly P(?39D)P(?38D)0.5 - If (x1,x2)(39,40), then CI(39,39), but clearly
P(?39D)1.0.
65Whats wrong with the classical approach?
- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense
66Whats right about the Bayesian approach?
- Simple and natural
- Optimal mechanism for reasoning under uncertainty
- Generalization of Aristotelian logic that reduces
to deductive logic if our hypotheses are either
true or false - Supports interesting (human-like) kinds of
learning
67(No Transcript)
68Bayesian humor
- A Bayesian is one who, vaguely expecting a
horse, and catching a glimpse of a donkey,
strongly believes he has seen a mule.