Title: CSC321: Lecture 8: The Bayesian way to fit models
1CSC321Lecture 8 The Bayesian way to fit
models
2The Bayesian framework
- The Bayesian framework assumes that we always
have a prior distribution for everything. - The prior may be very vague.
- When we see some data, we combine our prior
distribution with a likelihood term to get a
posterior distribution. - The likelihood term takes into account how
probable the observed data is given the
parameters of the model. - It favors parameter settings that make the data
likely. - It fights the prior
- With enough data the likelihood terms always win.
3A coin tossing example
- Suppose you know nothing about coins except that
each tossing event produces a head with some
unknown probability p and a tail with probability
1-p. - Suppose we observe 100 tosses and there are 53
heads. What is p? - The frequentist answer Pick the value of p that
makes the observation of 53 heads and 47 tails
most probable.
4Some problems with picking the parameters that
are most likely to generate the data
- What if we only tossed the coin once and we got 1
head? - Is p1 a sensible answer?
- Surely p0.5 is a much better answer.
- Is it reasonable to give a single answer?
- If we dont have much data, we are unsure about
p. - Our computations will work much better if we
take this uncertainty into account.
5Using a distribution over parameters
- Start with a prior distribution over p.
- Multiply the prior probability of each parameter
value by the probability of observing a head
given that value. - Then renormalize to get the posterior distribution
probability density
1
area1
p
0
1
area1
6Lets do it again
2
- Start with a prior distribution over p.
- Multiply the prior probability of each parameter
value by the probability of observing a tail
given that value. - The renormalize to get the posterior distribution
probability density
1
area1
p
0
1
area1
7Lets do it another 98 times
- After 53 heads and 47 tails we get a very
sensible posterior distribution that has its peak
at 0.53 (assuming a uniform prior).
area1
2
probability density
1
p
0
1
8Bayes Theorem
conditional probability
joint probability
Probability of observed data given W
Prior probability of weight vector W
Posterior probability of weight vector W
9Why we maximize sums of log probs
- We want to maximize products of probabilities of
a set of independent events - Assume the output errors on different training
cases are independent. - Assume the priors on weights are independent.
- Because the log function is monotonic, we can
maximize sums of log probabilities
10The Bayesian interpretation of weight decay
assuming a Gaussian prior for the weights
assuming that the model makes a Gaussian
prediction
11Maximum Likelihood Learning
- Minimizing the squared residuals is equivalent to
maximizing the log probability of the correct
answers under a Gaussian centered at the models
guess. This is Maximum Likelihood.
y models prediction
d correct answer
12Maximum A Posteriori Learning
- This trades-off the prior probabilities of the
parameters against the probability of the data
given the parameters. It looks for the parameters
that have the greatest product of the prior term
and the likelihood term. - Minimizing the squared weights is equivalent to
maximizing the log probability of the weights
under a zero-mean Gaussian prior.
p(w)
w
0
13Full Bayesian Learning
- Instead of trying to find the best single setting
of the parameters (as in ML or MAP) compute the
full posterior distribution over parameter
settings - This is extremely computationally intensive for
all but the simplest models (its feasible for a
biased coin). - To make predictions, let each different setting
of the parameters make its own prediction and
then combine all these predictions by weighting
each of them by the posterior probability of that
setting of the parameters. - This is also computationally intensive.
- The full Bayesian approach allows us to use
complicated models even when we do not have much
data
14Overfitting A frequentist illusion?
- If you do not have much data, you should use a
simple model, because a complex one will overfit. - This is true. But only if you assume that fitting
a model means choosing a single best setting of
the parameters. - If you use the full posterior over parameter
settings, overfitting disappears! - With little data, you get very vague predictions
because many different parameters settings have
significant posterior probability
15A classic example of overfitting
- Which model do you believe?
- The complicated model fits the data better.
- But it is not economical and it makes silly
predictions. - But what if we start with a reasonable prior over
all fifth-order polynomials and use the full
posterior distribution. - Now we get vague and sensible predictions.
- There is no reason why the amount of data should
influence our prior beliefs about the complexity
of the model.