CSC321: Lecture 8: The Bayesian way to fit models - PowerPoint PPT Presentation

About This Presentation
Title:

CSC321: Lecture 8: The Bayesian way to fit models

Description:

If you use the full posterior over parameter settings, overfitting disappears! ... But it is not economical and it makes silly predictions. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 16
Provided by: hin9
Category:

less

Transcript and Presenter's Notes

Title: CSC321: Lecture 8: The Bayesian way to fit models


1
CSC321Lecture 8 The Bayesian way to fit
models
  • Geoffrey Hinton

2
The Bayesian framework
  • The Bayesian framework assumes that we always
    have a prior distribution for everything.
  • The prior may be very vague.
  • When we see some data, we combine our prior
    distribution with a likelihood term to get a
    posterior distribution.
  • The likelihood term takes into account how
    probable the observed data is given the
    parameters of the model.
  • It favors parameter settings that make the data
    likely.
  • It fights the prior
  • With enough data the likelihood terms always win.

3
A coin tossing example
  • Suppose you know nothing about coins except that
    each tossing event produces a head with some
    unknown probability p and a tail with probability
    1-p.
  • Suppose we observe 100 tosses and there are 53
    heads. What is p?
  • The frequentist answer Pick the value of p that
    makes the observation of 53 heads and 47 tails
    most probable.

4
Some problems with picking the parameters that
are most likely to generate the data
  • What if we only tossed the coin once and we got 1
    head?
  • Is p1 a sensible answer?
  • Surely p0.5 is a much better answer.
  • Is it reasonable to give a single answer?
  • If we dont have much data, we are unsure about
    p.
  • Our computations will work much better if we
    take this uncertainty into account.

5
Using a distribution over parameters
  • Start with a prior distribution over p.
  • Multiply the prior probability of each parameter
    value by the probability of observing a head
    given that value.
  • Then renormalize to get the posterior distribution

probability density
1
area1
p
0
1
area1
6
Lets do it again
2
  • Start with a prior distribution over p.
  • Multiply the prior probability of each parameter
    value by the probability of observing a tail
    given that value.
  • The renormalize to get the posterior distribution

probability density
1
area1
p
0
1
area1
7
Lets do it another 98 times
  • After 53 heads and 47 tails we get a very
    sensible posterior distribution that has its peak
    at 0.53 (assuming a uniform prior).

area1
2
probability density
1
p
0
1
8
Bayes Theorem
conditional probability
joint probability
Probability of observed data given W
Prior probability of weight vector W
Posterior probability of weight vector W
9
Why we maximize sums of log probs
  • We want to maximize products of probabilities of
    a set of independent events
  • Assume the output errors on different training
    cases are independent.
  • Assume the priors on weights are independent.
  • Because the log function is monotonic, we can
    maximize sums of log probabilities

10
The Bayesian interpretation of weight decay
assuming a Gaussian prior for the weights
assuming that the model makes a Gaussian
prediction
11
Maximum Likelihood Learning
  • Minimizing the squared residuals is equivalent to
    maximizing the log probability of the correct
    answers under a Gaussian centered at the models
    guess. This is Maximum Likelihood.

y models prediction
d correct answer
12
Maximum A Posteriori Learning
  • This trades-off the prior probabilities of the
    parameters against the probability of the data
    given the parameters. It looks for the parameters
    that have the greatest product of the prior term
    and the likelihood term.
  • Minimizing the squared weights is equivalent to
    maximizing the log probability of the weights
    under a zero-mean Gaussian prior.

p(w)
w
0
13
Full Bayesian Learning
  • Instead of trying to find the best single setting
    of the parameters (as in ML or MAP) compute the
    full posterior distribution over parameter
    settings
  • This is extremely computationally intensive for
    all but the simplest models (its feasible for a
    biased coin).
  • To make predictions, let each different setting
    of the parameters make its own prediction and
    then combine all these predictions by weighting
    each of them by the posterior probability of that
    setting of the parameters.
  • This is also computationally intensive.
  • The full Bayesian approach allows us to use
    complicated models even when we do not have much
    data

14
Overfitting A frequentist illusion?
  • If you do not have much data, you should use a
    simple model, because a complex one will overfit.
  • This is true. But only if you assume that fitting
    a model means choosing a single best setting of
    the parameters.
  • If you use the full posterior over parameter
    settings, overfitting disappears!
  • With little data, you get very vague predictions
    because many different parameters settings have
    significant posterior probability

15
A classic example of overfitting
  • Which model do you believe?
  • The complicated model fits the data better.
  • But it is not economical and it makes silly
    predictions.
  • But what if we start with a reasonable prior over
    all fifth-order polynomials and use the full
    posterior distribution.
  • Now we get vague and sensible predictions.
  • There is no reason why the amount of data should
    influence our prior beliefs about the complexity
    of the model.
Write a Comment
User Comments (0)
About PowerShow.com