Why%20I%20am%20a%20Bayesian%20(and%20why%20you%20should%20become%20one,%20too)%20or%20Classical%20statistics%20considered%20harmful - PowerPoint PPT Presentation

About This Presentation
Title:

Why%20I%20am%20a%20Bayesian%20(and%20why%20you%20should%20become%20one,%20too)%20or%20Classical%20statistics%20considered%20harmful

Description:

Comparing two simple hypotheses. P(H) = 0.5 vs. P(H) = 1.0 ... Comparing simple and complex hypotheses ... hypotheses can be compared directly using Bayes' ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 69
Provided by: josht58
Category:

less

Transcript and Presenter's Notes

Title: Why%20I%20am%20a%20Bayesian%20(and%20why%20you%20should%20become%20one,%20too)%20or%20Classical%20statistics%20considered%20harmful


1
Why I am a Bayesian(and why you should become
one, too)orClassical statistics considered
harmful
  • Kevin Murphy
  • UBC CS Stats
  • 9 February 2005

2
Where does the title come from?
  • Why I am not a Bayesian, Glymour, 1981
  • Why Glymour is a Bayesian, Rosenkrantz, 1983
  • Why isnt everyone a Bayesian?,Efron, 1986
  • Bayesianism and causality, or, why I am only a
    half-Bayesian, Pearl, 2001

Many other such philosophical essays
3
Frequentist vs Bayesian
  • Prob objective relative frequencies
  • Params are fixed unknown constants, so cannot
    write e.g. P(?0.5D)
  • Estimators should be good when averaged across
    many trials
  • Prob degrees of belief (uncertainty)
  • Can write P(anythingD)
  • Estimators should be good for the available data

Source All of statistics, Larry Wasserman
4
Outline
  • Hypothesis testing Bayesian approach
  • Hypothesis testing classical approach
  • Whats wrong the classical approach?

5
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
The following slides are from Tenenbaum
Griffiths
6
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
  • Fair coin, P(H) 0.5
  • Coin with P(H) p
  • Markov model
  • Hidden Markov model
  • ...

7
Hypotheses in coin flipping
Describe processes by which D could be generated
D
HHTHT
  • Fair coin, P(H) 0.5
  • Coin with P(H) p
  • Markov model
  • Hidden Markov model
  • ...

8
Representing generative models
  • Graphical model notation
  • Pearl (1988), Jordan (1998)
  • Variables are nodes, edges indicate dependency
  • Directed edges show causal process of data
    generation

9
Models with latent structure
  • Not all nodes in a graphical model need to be
    observed
  • Some variables reflect latent structure, used in
    generating D but unobserved

How do we select the best model?
10
Bayes rule
Sum over space of hypotheses
11
The origin of Bayes rule
  • A simple consequence of using probability to
    represent degrees of belief
  • For any two random variables

12
Why represent degrees of belief with
probabilities?
  • Good statistics
  • consistency, and worst-case error bounds.
  • Cox Axioms
  • necessary to cohere with common sense
  • Dutch Book Survival of the Fittest
  • if your beliefs do not accord with the laws of
    probability, then you can always be out-gambled
    by someone whose beliefs do so accord.
  • Provides a theory of incremental learning
  • a common currency for combining prior knowledge
    and the lessons of experience.

13
Hypotheses in Bayesian inference
  • Hypotheses H refer to processes that could have
    generated the data D
  • Bayesian inference provides a distribution over
    these hypotheses, given D
  • P(DH) is the probability of D being generated by
    the process identified by H
  • Hypotheses H are mutually exclusive only one
    process could have generated D

14
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p

15
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p

16
Comparing two simple hypotheses
  • Contrast simple hypotheses
  • H1 fair coin, P(H) 0.5
  • H2always heads, P(H) 1.0
  • Bayes rule
  • With two hypotheses, use odds form

17
Bayes rule in odds form
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)

x
Prior odds
Posterior odds
Bayes factor(likelihood ratio)
18
Data HHTHT
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHTHT
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 0 P(H2) 1/1000
  • P(H1D) / P(H2D) infinity

x
19
Data HHHHH
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000
  • P(H1D) / P(H2D) ? 30

x
20
Data HHHHHHHHHH
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • D HHHHHHHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/210 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000
  • P(H1D) / P(H2D) ? 1

x
21
Coin flipping
  • Comparing two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Comparing simple and complex hypotheses
  • P(H) 0.5 vs. P(H) p

22
Comparing simple and complex hypotheses
vs.
  • Which provides a better account of the data the
    simple hypothesis of a fair coin, or the complex
    hypothesis that P(H) p?

23
Comparing simple and complex hypotheses
  • P(H) p is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) p
  • for any observed sequence X, we can choose p such
    that X is more probable than if P(H) 0.5

24
Comparing simple and complex hypotheses
Probability
25
Comparing simple and complex hypotheses
Probability
HHHHH p 1.0
26
Comparing simple and complex hypotheses
Probability
HHTHT p 0.6
27
Comparing simple and complex hypotheses
  • P(H) p is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) p
  • for any observed sequence X, we can choose p such
    that X is more probable than if P(H) 0.5
  • How can we deal with this?
  • frequentist hypothesis testing
  • information theorist minimum description length
  • Bayesian just use probability theory!

28
Comparing simple and complex hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • Computing P(DH1) is easy
  • P(DH1) 1/2N
  • Compute P(DH2) by averaging over p

x
29
Comparing simple and complex hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • Computing P(DH1) is easy
  • P(DH1) 1/2N
  • Compute P(DH2) by averaging over p

x
Marginal likelihood
likelihood
Prior
30
Likelihood and prior
  • Likelihood
  • P(D p) pNH (1-p)NT
  • NH number of heads
  • NT number of tails
  • Prior
  • P(p) ? pFH-1 (1-p)FT-1

?
31
A simple method of specifying priors
  • Imagine some fictitious trials, reflecting a set
    of previous experiences
  • strategy often used with neural networks
  • e.g., F 1000 heads, 1000 tails strong
    expectation that any new coin will be fair
  • In fact, this is a sensible statistical idea...

32
Likelihood and prior
  • Likelihood
  • P(D p) pNH (1-p)NT
  • NH number of heads
  • NT number of tails
  • Prior
  • P(p) ? pFH-1 (1-p)FT-1
  • FH fictitious observations of heads
  • FT fictitious observations of tails

Beta(FH,FT)
(pseudo-counts)
33
Posterior / prior x likelihood
  • Prior
  • Likelihood
  • Posterior

Same form!
34
Conjugate priors
  • Exist for many standard distributions
  • formula for exponential family conjugacy
  • Define prior in terms of fictitious observations
  • Beta is conjugate to Bernoulli (coin-flipping)

FH FT 1 FH FT 3 FH FT 1000
35
Normalizing constants
  • Prior
  • Normalizing constant for Beta distribution
  • Posterior
  • Hence marginal likelihood is

36
Comparing simple and complex hypotheses
  • P(H1D) P(DH1) P(H1)
  • P(H2D) P(DH2) P(H2)
  • Computing P(DH1) is easy
  • P(DH1) 1/2N
  • Compute P(DH2) by averaging over p

x
Likelihood for H1
Marginal likelihood (evidence) for H2
37
Marginal likelihood for H1 and H2
Probability
Marginal likelihood is an average over all values
of p
38
Sensitivity to hyper-parameters
39
Bayesian model selection
  • Simple and complex hypotheses can be compared
    directly using Bayes rule
  • requires summing over latent variables
  • Complex hypotheses are penalized for their
    greater flexibility Bayesian Occams razor
  • Maximum likelihood cannot be used for model
    selection (always prefers hypothesis with largest
    number of parameters)

40
Outline
  • Hypothesis testing Bayesian approach
  • Hypothesis testing classical approach
  • Whats wrong the classical approach?

41
Example Belgian euro-coins
  • A Belgian euro spun N250 times came up heads
    X140.
  • It looks very suspicious to me. If the coin were
    unbiased the chance of getting a result as
    extreme as that would be less than 7 Barry
    Blight, LSE (reported in Guardian, 2002)

Source Mackay exercise 3.15
42
Classical hypothesis testing
  • Null hypothesis H0 eg. q 0.5 (unbiased coin)
  • For classical analysis, dont need to specify
    alternative hypothesis, but later we will useH1
    ? ? 0.5
  • Need a decision rule that maps data D to accept/
    reject of H0.
  • Define a scalar measure of deviance d(D) from the
    null hypothesis e.g., Nh or ?2

43
P-values
  • Define p-value of threshold ? as
  • Intuitively, p-value of data is probability of
    getting data at least that extreme given H0

44
P-values
R
  • Define p-value of threshold ? as
  • Intuitively, p-value of data is probability of
    getting data at least that extreme given H0
  • Usually choose ? so that false rejection rate of
    H0 is below significance level ? 0.05

45
P-values
R
  • Define p-value of threshold ? as
  • Intuitively, p-value of data is probability of
    getting data at least that extreme given H0
  • Usually choose ? so that false rejection rate of
    H0 is below significance level ? 0.05
  • Often use asymptotic approximation to
    distribution of d(D) under H0 as N ! 1

46
P-value for euro coins
  • N 250 trials, X140 heads
  • P-value is less than 7
  • If N250 and X141, pval 0.0497, so we can
    reject the null hypothesis at the significance
    level of 5.
  • This does not mean P(H0D)0.07!

Pval(1-binocdf(139,n,0.5)) binocdf(110,n,0.5)
47
Bayesian analysis of euro-coin
  • Assume P(H0)P(H1)0.5
  • Assume P(p) Beta(?,?)
  • Setting ?1 yields a uniform (non-informative)
    prior.

48
Bayesian analysis of euro-coin
  • If ?1,so H0 (unbiased) is (slightly) more
    probable than H1 (biased).
  • By varying ? over a large range, the best we can
    do is make B1.9, which does not strongly support
    the biased coin hypothesis.
  • Other priors yield similar results.
  • Bayesian analysis contradicts classical analysis.

49
Outline
  • Hypothesis testing Bayesian approach
  • Hypothesis testing classical approach
  • Whats wrong the classical approach?

50
Outline
  • Hypothesis testing Bayesian approach
  • Hypothesis testing classical approach
  • Whats wrong the classical approach?
  • Violates likelihood principle
  • Violates stopping rule principle
  • Violates common sense

51
The likelihood principle
  • In order to choose between hypotheses H0 and H1
    given observed data, one should ask how likely
    the observed data are do not ask questions about
    data that we might have observed but did not,
    such as
  • This principle can be proved from two simpler
    principles called conditionality and sufficiency.

52
Frequentist statistics violates the likelihood
principle
  • The use of P-values implies that a hypothesis
    that may be true can be rejected because it has
    not predicted observable results that have not
    actually occurred. Jeffreys, 1961

53
Another example
  • Suppose X N(?,?2) we observe x3
  • Compare H0 ?0 with H1 ?gt0
  • P-value P(X 3H0)0.001, so reject H0
  • Bayesian approach update P(?X) using conjugate
    analysis compute Bayes factor to compare H0 and
    H1

54
When are P-values valid?
  • Suppose X N(?,?2) we observe Xx.
  • One-sided hypothesis test H0 ? ?0
    vs H1 ? gt ?0
  • If P(?) / 1, then P(?x) N(x,?2), so
  • P-value is the same in this case, since Gaussian
    is symmetric in its arguments

55
Outline
  • Hypothesis testing Bayesian approach
  • Hypothesis testing classical approach
  • Whats wrong the classical approach?
  • Violates likelihood principle
  • Violates stopping rule principle
  • Violates common sense

56
Stopping rule principle
  • Inferences you make should only depend on the
    observed data, not the reasons why this data was
    collected.
  • If you look at your data to decide when to stop
    collecting, this should not change any
    conclusions you draw.
  • Follows from likelihood principle.

57
Frequentist statistics violates stopping rule
principle
  • Observe DHHHTHHHHTHHT. Is there evidence of bias
    (Pt gt Ph)?
  • Let X3 heads be observed random variable and
    N12 trials be fixed constant. Define H0 Ph0.5.
    Then, at the 5 level, there is no significant
    evidence of bias

58
Frequentist statistics violates stopping rule
principle
  • Suppose the data was generated by tossing coins
    until we got X3 heads.
  • Now X3 heads is a fixed constant and N12 is a
    random variable. Now there is significant
    evidence of bias!

First n-1 trials contain x-1 heads last trial
always heads
59
Ignoring stopping criterion can mislead classical
estimators
  • Let Xi Bernoulli(?)
  • Max lik. estimator
  • MLE is unbiased
  • Toss coin if head, stop, else toss second coin.
    P(H)?, P(HT)? (1-?), P(TT)(1-?)2.
  • Now MLE is biased!
  • Many classical rules for assessing significance
    when complex stopping rules are used.

60
Outline
  • Hypothesis testing Bayesian approach
  • Hypothesis testing classical approach
  • Whats wrong the classical approach?
  • Violates likelihood principle
  • Violates stopping rule principle
  • Violates common sense

61
Confidence intervals
  • An interval (?min(D),?max(D)) is a 95 CI if ?
    lies inside this interval 95 of the time across
    repeated draws DP(.?)
  • This does not mean P(? 2 CID) 0.95!

Mackay sec 37.3
62
Example
  • Draw 2 integers from
  • If ?39, we would expect

63
Example
  • If ?39, we would expect
  • Define confidence interval as
  • eg (x1,x2)(40,39), CI(39,39)
  • 75 of the time, this will contain the true ?

64
CIs violate common sense
  • If ?39, we would expect
  • If (x1,x2)(39,39), then CI(39,39) at level 75.
    But clearly P(?39D)P(?38D)0.5
  • If (x1,x2)(39,40), then CI(39,39), but clearly
    P(?39D)1.0.

65
Whats wrong with the classical approach?
  • Violates likelihood principle
  • Violates stopping rule principle
  • Violates common sense

66
Whats right about the Bayesian approach?
  • Simple and natural
  • Optimal mechanism for reasoning under uncertainty
  • Generalization of Aristotelian logic that reduces
    to deductive logic if our hypotheses are either
    true or false
  • Supports interesting (human-like) kinds of
    learning

67
(No Transcript)
68
Bayesian humor
  • A Bayesian is one who, vaguely expecting a
    horse, and catching a glimpse of a donkey,
    strongly believes he has seen a mule.
Write a Comment
User Comments (0)
About PowerShow.com