Bayesian Model Comparison and Occam - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Bayesian Model Comparison and Occam

Description:

'All things being equal, the simplest solution tends to be the best one,' or ... But, we can see from the picture it isn't too good, and adds more complexity ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 25
Provided by: marks82
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Model Comparison and Occam


1
Bayesian Model Comparison and Occams Razor
  • Lecture 2

2
A Picture of Occams Razor
3
Occams razor
  • "All things being equal, the simplest solution
    tends to be the best one," or alternately, "the
    simplest explanation tends to be the right one."
    In other words, when multiple competing theories
    are equal in other respects, the principle
    recommends selecting the theory that introduces
    the fewest assumptions and postulates the fewest
    hypothetical entities. It is in this sense that
    Occam's razor is usually understood.
  • Wikipedia

4
Copernican versus Ptolemaic View of the Universe
  • Copernicus proposed a model of the solar system
    in which the earth revolved around the sun.
    Ptolemy (around 1000 years earlier) had proposed
    a theory of the universe in which planetary
    bodies revolved around the earth he used
    epicycles to explain his theory.
  • Copernicuss theory won because it was a
    simpler framework from which to explain
    astronomical motion. Epicycles also explained
    astronomical motion but employed an unnecessarily
    complex framework which could not properly
    predict such things.

5
Occams Razor Choose Simple Models when possible
6
Boxes behind a tree
  • In the figure, is there 1 or 2 boxes behind the
    tree? A one box theory does not assume many
    complicated factors such as that two boxes happen
    to be of identical height, but does explain the
    data as we see it. A two-box theory does assume
    an unlikely factor, but also explains the data as
    we see it.

7
Statistical Models
  • Statistical models are designed to describe data
    by postulating that data
  • X(x1,...,xn) follow a density f(XT) in a
    class described by those in a class
  • f(XT) T (possibly nonparametric). For a
    given parameter T0, We can compare the
    likelihood of data values X01 vs X02 via
  • f(X01T0)/f(X02T0). If this is gt1, then the
    first datum is more likely if lt1, then the
    second is more likely.

8
Bayesian Model Comparison
  • We evaluate statistical models via
  • The term P(XM) is the likelihood the term
    P(M) is the prior the term P(X) is the
    marginal density of the data.
  • When comparing two models M1 and M2, we need only
    look at the ratio,

9
Bayesian Model Comparison (continued)
  • In comparing the two models, the term
  • P(XM) explains how well model M explains the
    data. We see in our tree example that both the
    one box and two box theories explain the data
    well. So they dont help us decide between the
    one and two box models. But, the probability,
    P(M1) for the one-box theory is much larger
    than the probability P(M2) . So we prefer the
    one-box to the two box theory. Note that things
    like the MLE have no preference regarding the one
    versus two-box theory.

10
Model Comparison when parameters are present
  • If parameters T are present, we want to use
  • This is the average score of the data.
  • Calculus shows (see the appendix) that,

11
The Occam factor
  • Now, if we had two models M1,M2 which explained
    the data equally well, but the first provided
    more certain (posterior) information than the
    second, we prefer the first model to the second.
    The likelihood scores are similar for the two
    models the Occam factor (TM)S(1/2) or
    posterior uncertainty for the first model is
    smaller than that for the second.

12
Example of Model Comparison when Parameters are
present
  • Say we want to choose between two regression
    models for a set of bivariate data. The first is
    a linear model and the second is a polynomial
    model involving terms up to the fourth power.
    The second always does a better job of fitting
    the data than the first. But the posterior
    uncertainty of the second tends to be smaller
    than that of the second because the presence of
    additional parameters adds posterior uncertainty.
    Note that classical statistics always views the
    second as better than the first.

13
An example
  • The data (-8,8),(-2,10),(6,11) (see next)
  • The model under
  • H0 yß0e H1 y ß0ß1xe
  • Parameters have simple gaussian priors and se1.
  • Score0fv3 sY fY? (1/v3)1.510-23
  • Score1fv3 sY v(1-?2) fb0 fb1(1/3sX)
    .7110-24
  • Score(1)/Score(0) .71/15 .05

14
Example Explained H0
  • Score0fv3 sY fY? (1/v3) 1.510-23
  • Y? is the average of the Ys. f is the gaussian
    density.
  • fv3 sY is the likelihood under the null model
    (with MLE assignment)
  • fY? is the prior under the null model (with
    MLE assignment)
  • (1/v3) is the inverse of the square root of the
    information.

15
Example explained H1
  • Score1fv3 sY v(1-?2) fb0 fb1 (1/sX)
  • fv3 sY v(1-?2) is the likelihood under the
    alternative model (under MLE assignment)
  • fb0 fb1 is the prior under the alternative
    model (under MLE assignment)
  • b0, b1 are the usual beta estimates.
  • (1/3sX) is the inverse of the square root of the
    information.

16
Regression Example
17
Classical Statistics falls short
  • Comparing the likelihoods (under MLEs) without
    regards to the Occam factor gives
  • Classical Null Score fv3 sY .012
  • Classical Alt Score fv3 sY v(1-?2) .3146
  • In this case, the alternate model is to be
    preferred. But, we can see from the picture it
    isnt too good, and adds more complexity which
    doesnt serve a good purpose.

18
Stats for the linear model
  • sx 7.02 sy 1.53 mean(y)9.66
    mean(x)-1.33
  • b0 9.9459
  • b1 0.2095
  • BINT b conf
  • 5.5683 14.3236
  • -0.5340 0.9530
  • R residuals
  • -0.2703
  • 0.4730
  • -0.2027

19
Dice Example
  • We roll a die 30 times getting 4,4,3,3,7,9. Is
    it a fair die? Would you be willing to gamble
    using it?
  • H0 p1p6(1/6) H1 ps Dir(1,,1)
  • What does chi-squared goodness of fit say?
    Chi-square p-value is 31 -- we would never
    reject the null in this case.
  • What does Bayes theory say
  • score under H0 is

20
Dice Example (continued)
  • Under the alternative
  • In this case, the laplace approximation is
    slightly off. The real answer is 310-6
  • So, roughly, the alternative is about 10 times
    as likely as the null. This is in accord with
    our intuition.

21
Possible Project
  • Possible Project Construct or otherwise get
    bivariate data which are essentially linearly
    related with noise. Assume linear and higher
    power models have equal prior probability.
    Calculate the average score for linear and higher
    order models. Show the average score for the
    linear model is best.

22
Another Possible Project
  • Generate multinomial data from a distribution
    with equal ps. For the generated data determine
    the chi-squared p-value and compare it to the
    Bayes factor
  • favoring the null (true) hypothesis
    determine how the chi-squared values differ from
    the Bayes factor counterparts over many
    simulations.

23
Appendix Laplace approximation
  • In the usual setting,

24
Possible Project
  • Fill in the mathematical steps involving the
    calculation of the marginal distribution of the
    data and compare it to the laplace approximation.
Write a Comment
User Comments (0)
About PowerShow.com