Why%20I%20am%20a%20Bayesian%20(and%20why%20you%20should%20become%20one,%20too)%20or%20Classical%20statistics%20considered%20harmful - PowerPoint PPT Presentation

About This Presentation

Title:

Why%20I%20am%20a%20Bayesian%20(and%20why%20you%20should%20become%20one,%20too)%20or%20Classical%20statistics%20considered%20harmful

Description:

Comparing two simple hypotheses. P(H) = 0.5 vs. P(H) = 1.0 ... Comparing simple and complex hypotheses ... hypotheses can be compared directly using Bayes' ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 69

Provided by: josht58

Category:

more less

Transcript and Presenter's Notes

Title: Why%20I%20am%20a%20Bayesian%20(and%20why%20you%20should%20become%20one,%20too)%20or%20Classical%20statistics%20considered%20harmful

1
Why I am a Bayesian(and why you should become
one, too)orClassical statistics considered
harmful

Kevin Murphy
UBC CS Stats
9 February 2005

2
Where does the title come from?

Why I am not a Bayesian, Glymour, 1981
Why Glymour is a Bayesian, Rosenkrantz, 1983
Why isnt everyone a Bayesian?,Efron, 1986
Bayesianism and causality, or, why I am only a
half-Bayesian, Pearl, 2001

Many other such philosophical essays
3
Frequentist vs Bayesian

Prob objective relative frequencies
Params are fixed unknown constants, so cannot
write e.g. P(?0.5D)
Estimators should be good when averaged across
many trials

Prob degrees of belief (uncertainty)
Can write P(anythingD)
Estimators should be good for the available data

Source All of statistics, Larry Wasserman
4
Outline

Hypothesis testing Bayesian approach
Hypothesis testing classical approach
Whats wrong the classical approach?

5
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
The following slides are from Tenenbaum
Griffiths
6
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT

Fair coin, P(H) 0.5
Coin with P(H) p
Markov model
Hidden Markov model
...

7
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT

Fair coin, P(H) 0.5
Coin with P(H) p
Markov model
Hidden Markov model
...

8
Representing generative models

Graphical model notation
Pearl (1988), Jordan (1998)
Variables are nodes, edges indicate dependency
Directed edges show causal process of data
generation

9
Models with latent structure

Not all nodes in a graphical model need to be
observed
Some variables reflect latent structure, used in
generating D but unobserved

How do we select the best model?
10
Bayes rule
Sum over space of hypotheses
11
The origin of Bayes rule

A simple consequence of using probability to
represent degrees of belief
For any two random variables

12
Why represent degrees of belief with
probabilities?

Good statistics
consistency, and worst-case error bounds.
Cox Axioms
necessary to cohere with common sense
Dutch Book Survival of the Fittest
if your beliefs do not accord with the laws of
probability, then you can always be out-gambled
by someone whose beliefs do so accord.
Provides a theory of incremental learning
a common currency for combining prior knowledge
and the lessons of experience.

13
Hypotheses in Bayesian inference

Hypotheses H refer to processes that could have
generated the data D
Bayesian inference provides a distribution over
these hypotheses, given D
P(DH) is the probability of D being generated by
the process identified by H
Hypotheses H are mutually exclusive only one
process could have generated D

14
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p

15
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p

16
Comparing two simple hypotheses

Contrast simple hypotheses
H1 fair coin, P(H) 0.5
H2always heads, P(H) 1.0
Bayes rule
With two hypotheses, use odds form

17
Bayes rule in odds form

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)

x
Prior odds
Posterior odds
Bayes factor(likelihood ratio)
18
Data HHTHT

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHTHT
H1, H2 fair coin, always heads
P(DH1) 1/25 P(H1) 999/1000
P(DH2) 0 P(H2) 1/1000
P(H1D) / P(H2D) infinity

x
19
Data HHHHH

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHHHH
H1, H2 fair coin, always heads
P(DH1) 1/25 P(H1) 999/1000
P(DH2) 1 P(H2) 1/1000
P(H1D) / P(H2D) ? 30

x
20
Data HHHHHHHHHH

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
D HHHHHHHHHH
H1, H2 fair coin, always heads
P(DH1) 1/210 P(H1) 999/1000
P(DH2) 1 P(H2) 1/1000
P(H1D) / P(H2D) ? 1

x
21
Coin flipping

Comparing two simple hypotheses
P(H) 0.5 vs. P(H) 1.0
Comparing simple and complex hypotheses
P(H) 0.5 vs. P(H) p

22
Comparing simple and complex hypotheses
vs.

Which provides a better account of the data the
simple hypothesis of a fair coin, or the complex
hypothesis that P(H) p?

23
Comparing simple and complex hypotheses

P(H) p is more complex than P(H) 0.5 in two
ways
P(H) 0.5 is a special case of P(H) p
for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5

24
Comparing simple and complex hypotheses
Probability
25
Comparing simple and complex hypotheses
Probability
HHHHH p 1.0
26
Comparing simple and complex hypotheses
Probability
HHTHT p 0.6
27
Comparing simple and complex hypotheses

P(H) p is more complex than P(H) 0.5 in two
ways
P(H) 0.5 is a special case of P(H) p
for any observed sequence X, we can choose p such
that X is more probable than if P(H) 0.5
How can we deal with this?
frequentist hypothesis testing
information theorist minimum description length
Bayesian just use probability theory!

28
Comparing simple and complex hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
Computing P(DH1) is easy
P(DH1) 1/2N
Compute P(DH2) by averaging over p

x
29
Comparing simple and complex hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
Computing P(DH1) is easy
P(DH1) 1/2N
Compute P(DH2) by averaging over p

x
Marginal likelihood
likelihood
Prior
30
Likelihood and prior

Likelihood
P(D p) pNH (1-p)NT
NH number of heads
NT number of tails
Prior
P(p) ? pFH-1 (1-p)FT-1

?
31
A simple method of specifying priors

Imagine some fictitious trials, reflecting a set
of previous experiences
strategy often used with neural networks
e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair
In fact, this is a sensible statistical idea...

32
Likelihood and prior

Likelihood
P(D p) pNH (1-p)NT
NH number of heads
NT number of tails
Prior
P(p) ? pFH-1 (1-p)FT-1
FH fictitious observations of heads
FT fictitious observations of tails

Beta(FH,FT)
(pseudo-counts)
33
Posterior / prior x likelihood

Prior
Likelihood
Posterior

Same form!
34
Conjugate priors

Exist for many standard distributions
formula for exponential family conjugacy
Define prior in terms of fictitious observations
Beta is conjugate to Bernoulli (coin-flipping)

FH FT 1 FH FT 3 FH FT 1000
35
Normalizing constants

Prior
Normalizing constant for Beta distribution
Posterior
Hence marginal likelihood is

36
Comparing simple and complex hypotheses

P(H1D) P(DH1) P(H1)
P(H2D) P(DH2) P(H2)
Computing P(DH1) is easy
P(DH1) 1/2N
Compute P(DH2) by averaging over p

x
Likelihood for H1
Marginal likelihood (evidence) for H2
37
Marginal likelihood for H1 and H2
Probability
Marginal likelihood is an average over all values
of p
38
Sensitivity to hyper-parameters
39
Bayesian model selection

Simple and complex hypotheses can be compared
directly using Bayes rule
requires summing over latent variables
Complex hypotheses are penalized for their
greater flexibility Bayesian Occams razor
Maximum likelihood cannot be used for model
selection (always prefers hypothesis with largest
number of parameters)

40
Outline

Hypothesis testing Bayesian approach
Hypothesis testing classical approach
Whats wrong the classical approach?

41
Example Belgian euro-coins

A Belgian euro spun N250 times came up heads
X140.
It looks very suspicious to me. If the coin were
unbiased the chance of getting a result as
extreme as that would be less than 7 Barry
Blight, LSE (reported in Guardian, 2002)

Source Mackay exercise 3.15
42
Classical hypothesis testing

Null hypothesis H0 eg. q 0.5 (unbiased coin)
For classical analysis, dont need to specify
alternative hypothesis, but later we will useH1
? ? 0.5
Need a decision rule that maps data D to accept/
reject of H0.
Define a scalar measure of deviance d(D) from the
null hypothesis e.g., Nh or ?2

43
P-values

Define p-value of threshold ? as
Intuitively, p-value of data is probability of
getting data at least that extreme given H0

44
P-values
R

Define p-value of threshold ? as
Intuitively, p-value of data is probability of
getting data at least that extreme given H0
Usually choose ? so that false rejection rate of
H0 is below significance level ? 0.05

45
P-values
R

Define p-value of threshold ? as
Intuitively, p-value of data is probability of
getting data at least that extreme given H0
Usually choose ? so that false rejection rate of
H0 is below significance level ? 0.05
Often use asymptotic approximation to
distribution of d(D) under H0 as N ! 1

46
P-value for euro coins

N 250 trials, X140 heads
P-value is less than 7
If N250 and X141, pval 0.0497, so we can
reject the null hypothesis at the significance
level of 5.
This does not mean P(H0D)0.07!

Pval(1-binocdf(139,n,0.5)) binocdf(110,n,0.5)
47
Bayesian analysis of euro-coin

Assume P(H0)P(H1)0.5
Assume P(p) Beta(?,?)
Setting ?1 yields a uniform (non-informative)
prior.

48
Bayesian analysis of euro-coin

If ?1,so H0 (unbiased) is (slightly) more
probable than H1 (biased).
By varying ? over a large range, the best we can
do is make B1.9, which does not strongly support
the biased coin hypothesis.
Other priors yield similar results.
Bayesian analysis contradicts classical analysis.

49
Outline

Hypothesis testing Bayesian approach
Hypothesis testing classical approach
Whats wrong the classical approach?

50
Outline

Hypothesis testing Bayesian approach
Hypothesis testing classical approach
Whats wrong the classical approach?
Violates likelihood principle
Violates stopping rule principle
Violates common sense

51
The likelihood principle

In order to choose between hypotheses H0 and H1
given observed data, one should ask how likely
the observed data are do not ask questions about
data that we might have observed but did not,
such as
This principle can be proved from two simpler
principles called conditionality and sufficiency.

52
Frequentist statistics violates the likelihood
principle

The use of P-values implies that a hypothesis
that may be true can be rejected because it has
not predicted observable results that have not
actually occurred. Jeffreys, 1961

53
Another example

Suppose X N(?,?2) we observe x3
Compare H0 ?0 with H1 ?gt0
P-value P(X 3H0)0.001, so reject H0
Bayesian approach update P(?X) using conjugate
analysis compute Bayes factor to compare H0 and
H1

54
When are P-values valid?

Suppose X N(?,?2) we observe Xx.
One-sided hypothesis test H0 ? ?0
vs H1 ? gt ?0
If P(?) / 1, then P(?x) N(x,?2), so
P-value is the same in this case, since Gaussian
is symmetric in its arguments

55
Outline

Hypothesis testing Bayesian approach
Hypothesis testing classical approach
Whats wrong the classical approach?
Violates likelihood principle
Violates stopping rule principle
Violates common sense

56
Stopping rule principle

Inferences you make should only depend on the
observed data, not the reasons why this data was
collected.
If you look at your data to decide when to stop
collecting, this should not change any
conclusions you draw.
Follows from likelihood principle.

57
Frequentist statistics violates stopping rule
principle

Observe DHHHTHHHHTHHT. Is there evidence of bias
(Pt gt Ph)?
Let X3 heads be observed random variable and
N12 trials be fixed constant. Define H0 Ph0.5.
Then, at the 5 level, there is no significant
evidence of bias

58
Frequentist statistics violates stopping rule
principle

Suppose the data was generated by tossing coins
until we got X3 heads.
Now X3 heads is a fixed constant and N12 is a
random variable. Now there is significant
evidence of bias!

First n-1 trials contain x-1 heads last trial
always heads
59
Ignoring stopping criterion can mislead classical
estimators

Let Xi Bernoulli(?)
Max lik. estimator
MLE is unbiased
Toss coin if head, stop, else toss second coin.
P(H)?, P(HT)? (1-?), P(TT)(1-?)2.
Now MLE is biased!
Many classical rules for assessing significance
when complex stopping rules are used.

60
Outline

Hypothesis testing Bayesian approach
Hypothesis testing classical approach
Whats wrong the classical approach?
Violates likelihood principle
Violates stopping rule principle
Violates common sense

61
Confidence intervals

An interval (?min(D),?max(D)) is a 95 CI if ?
lies inside this interval 95 of the time across
repeated draws DP(.?)
This does not mean P(? 2 CID) 0.95!

Mackay sec 37.3
62
Example

Draw 2 integers from
If ?39, we would expect

63
Example

If ?39, we would expect
Define confidence interval as
eg (x1,x2)(40,39), CI(39,39)
75 of the time, this will contain the true ?

64
CIs violate common sense

If ?39, we would expect
If (x1,x2)(39,39), then CI(39,39) at level 75.
But clearly P(?39D)P(?38D)0.5
If (x1,x2)(39,40), then CI(39,39), but clearly
P(?39D)1.0.

65
Whats wrong with the classical approach?

Violates likelihood principle
Violates stopping rule principle
Violates common sense

66
Whats right about the Bayesian approach?

Simple and natural
Optimal mechanism for reasoning under uncertainty
Generalization of Aristotelian logic that reduces
to deductive logic if our hypotheses are either
true or false
Supports interesting (human-like) kinds of
learning

67
(No Transcript)
68
Bayesian humor