Title: Pattern Recognition and Machine Learning
1Pattern Recognition and Machine Learning
Chapter 1 Introduction
2Expectations
3Expectations
Conditional Expectation (discrete)
Approximate Expectation (discrete and continuous)
4Variances and Covariances
5Intinsic Error
6Curve Fitting Re-visited
7Maximum Likelihood
Determine by minimizing sum-of-squares
error, .
8Minimizing the loss function for regression
- Using the squared error as the loss function
- We want to choose y(x) to minimize the expected
loss
9 10The Squared Loss Function
11- The first term is minimized when we select y(x)
as - The second term is independent of y(x) and
represents the intrinsic variability of the
target - It is called the intrinsic error.
12(No Transcript)
13Inverse Problems
14Bias, Variance
15The Bias-Variance Decomposition (1)
- Recall the expected squared loss,
- where
- The second term corresponds to the noise inherent
in the random variable t. - What about the first term?
16The Bias-Variance Decomposition (2)
- Suppose we were given multiple data sets, each of
size N. Any particular data set, D, will give a
particular function y(x D). We then have
17The Bias-Variance Decomposition (3)
- Taking the expectation over D yields
18The Bias-Variance Decomposition (4)
19- Bias measures how much the prediction (averaged
over all data sets) differs from the desired
regression function. - Variance measures how much the predictions for
individual data sets vary around their average. - There is a trade-off between bias and variance
- As we increase model complexity,
- bias decreases (a better fit to data) and
- variance increases (fit varies more with data)
20f
f
bias
gi
g
variance
21Reminder Introduction to OverfittingPRML 1.1
- Concepts Polynomial curve fitting,
- overfitting, regularization,
- training set size vs model complexity
22Polynomial Curve Fitting
23Sum-of-Squares Error Function
240th Order Polynomial
251st Order Polynomial
263rd Order Polynomial
279th Order Polynomial
28Over-fitting
Root-Mean-Square (RMS) Error
29Polynomial Coefficients
30Regularization
- Penalize large coefficient values
31Regularization
32Regularization
33Regularization vs.
34Polynomial Coefficients
35Back to Bias/Variance
36The Bias-Variance Decomposition (5)
- Example 100 data sets, each with 25 data points
from the sinusoidal h(x) sin(2px), varying the
degree of regularization, l.
37The Bias-Variance Decomposition (6)
- Example 100 data sets, each with 25 data points
from the sinusoidal h(x) sin(2px), varying the
degree of regularization, l.
38The Bias-Variance Decomposition (7)
- Example 100 data sets, each with 25 data points
from the sinusoidal h(x) sin(2px), varying the
degree of regularization, l.
39The Bias-Variance Trade-off
- From these plots, we note that an
over-regularized model (large l) will have a high
bias, while an under-regularized model (small l)
will have a high variance.
Minimum value of bias2variance is around
l-0.31 This is close to the value that gives the
minimum error on the test data.
40f
f
bias
gi
g
variance
41Model Selection Procedures
- Regularization (Breiman 1998) Penalize the
augmented error - error on data l.model complexity
- If l is too large, we risk introducing bias
- Use cross validation to optimize for l
- Structural Risk Minimization (Vapnik 1995)
- Use a set of models ordered in terms of their
complexities - Number of free parameters
- VC dimension,
- Find the best model w.r.t empirical error and
model complexity. - Minimum Description Length Principle
- Bayesian Model Selection If we have some prior
knowledge about the approximating function, it
can be incorporated into the Bayesian approach in
the form of p(model).
42Bayesian Model Selection
- Prior on models, p(model)
- When prior favors simpler models Bayesian,
regularization, SRM and MDL are equivalent.
43Model Selection Procedures
- Cross validation Measure the total error, rather
than bias/variance, on a validation set. - Train/Validation sets
- K-fold cross validation
- Leave-One-Out
- No prior assumption about the models
44Polynomial Regression
Best fit min error
45Best fit, elbow
46- Averaging of multiple solutions which themselves
have low bias - Lower variance due to averaging
- Ensembles, mixture of experts, committee
machines - Bayesian approach weighted average
- Increasing the number of data points N
- Constrains the solutions, reducing variance
47Data Set Size
9th Order Polynomial
48Data Set Size
9th Order Polynomial
49End of Lecture Skip the Rest
50Bayesian Linear Regression (1)
- Define a conjugate prior over w
- Combining this with the likelihood function and
using results for marginal and conditional
Gaussian distributions, gives the posterior - where
51Bayesian Linear Regression (2)
- For simplicity, lets assume that the prior is
- for which
- Next we consider an example
52Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
53Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
54Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
55Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
56Predictive Distribution (1)
- Predict t for new values of x by integrating over
w - where
57Predictive Distribution (2)
- Example Sinusoidal data, 9 Gaussian basis
functions, 1 data point
58Predictive Distribution (3)
- Example Sinusoidal data, 9 Gaussian basis
functions, 2 data points
59Predictive Distribution (4)
- Example Sinusoidal data, 9 Gaussian basis
functions, 4 data points
60Predictive Distribution (5)
- Example Sinusoidal data, 9 Gaussian basis
functions, 25 data points