CSC321: Lecture 8: The Bayesian way to fit models - PowerPoint PPT Presentation

About This Presentation

Title:

CSC321: Lecture 8: The Bayesian way to fit models

Description:

If you use the full posterior over parameter settings, overfitting disappears! ... But it is not economical and it makes silly predictions. ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 16

Provided by: hin9

Learn more at: http://www.cs.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSC321: Lecture 8: The Bayesian way to fit models

1
CSC321Lecture 8 The Bayesian way to fit
models

Geoffrey Hinton

2
The Bayesian framework

The Bayesian framework assumes that we always
have a prior distribution for everything.
The prior may be very vague.
When we see some data, we combine our prior
distribution with a likelihood term to get a
posterior distribution.
The likelihood term takes into account how
probable the observed data is given the
parameters of the model.
It favors parameter settings that make the data
likely.
It fights the prior
With enough data the likelihood terms always win.

3
A coin tossing example

Suppose you know nothing about coins except that
each tossing event produces a head with some
unknown probability p and a tail with probability
1-p.
Suppose we observe 100 tosses and there are 53
heads. What is p?
The frequentist answer Pick the value of p that
makes the observation of 53 heads and 47 tails
most probable.

4
Some problems with picking the parameters that
are most likely to generate the data

What if we only tossed the coin once and we got 1
head?
Is p1 a sensible answer?
Surely p0.5 is a much better answer.
Is it reasonable to give a single answer?
If we dont have much data, we are unsure about
p.
Our computations will work much better if we
take this uncertainty into account.

5
Using a distribution over parameters

Start with a prior distribution over p.
Multiply the prior probability of each parameter
value by the probability of observing a head
given that value.
Then renormalize to get the posterior distribution

probability density
1
area1
p
0
1
area1
6
Lets do it again
2

Start with a prior distribution over p.
Multiply the prior probability of each parameter
value by the probability of observing a tail
given that value.
The renormalize to get the posterior distribution

probability density
1
area1
p
0
1
area1
7
Lets do it another 98 times

After 53 heads and 47 tails we get a very
sensible posterior distribution that has its peak
at 0.53 (assuming a uniform prior).

area1
2
probability density
1
p
0
1
8
Bayes Theorem
conditional probability
joint probability
Probability of observed data given W
Prior probability of weight vector W
Posterior probability of weight vector W
9
Why we maximize sums of log probs

We want to maximize products of probabilities of
a set of independent events
Assume the output errors on different training
cases are independent.
Assume the priors on weights are independent.
Because the log function is monotonic, we can
maximize sums of log probabilities

10
The Bayesian interpretation of weight decay
assuming a Gaussian prior for the weights
assuming that the model makes a Gaussian
prediction
11
Maximum Likelihood Learning

Minimizing the squared residuals is equivalent to
maximizing the log probability of the correct
answers under a Gaussian centered at the models
guess. This is Maximum Likelihood.

y models prediction
d correct answer
12
Maximum A Posteriori Learning

This trades-off the prior probabilities of the
parameters against the probability of the data
given the parameters. It looks for the parameters
that have the greatest product of the prior term
and the likelihood term.
Minimizing the squared weights is equivalent to
maximizing the log probability of the weights
under a zero-mean Gaussian prior.

p(w)
w
0
13
Full Bayesian Learning

Instead of trying to find the best single setting
of the parameters (as in ML or MAP) compute the
full posterior distribution over parameter
settings
This is extremely computationally intensive for
all but the simplest models (its feasible for a
biased coin).
To make predictions, let each different setting
of the parameters make its own prediction and
then combine all these predictions by weighting
each of them by the posterior probability of that
setting of the parameters.
This is also computationally intensive.
The full Bayesian approach allows us to use
complicated models even when we do not have much
data

14
Overfitting A frequentist illusion?

If you do not have much data, you should use a
simple model, because a complex one will overfit.
This is true. But only if you assume that fitting
a model means choosing a single best setting of
the parameters.
If you use the full posterior over parameter
settings, overfitting disappears!
With little data, you get very vague predictions
because many different parameters settings have
significant posterior probability

15
A classic example of overfitting

Which model do you believe?
The complicated model fits the data better.
But it is not economical and it makes silly
predictions.
But what if we start with a reasonable prior over
all fifth-order polynomials and use the full
posterior distribution.
Now we get vague and sensible predictions.
There is no reason why the amount of data should
influence our prior beliefs about the complexity
of the model.