Pattern Recognition and Machine Learning

About This Presentation

Title:

Pattern Recognition and Machine Learning

Description:

... an example ... Bayesian Linear Regression (3) 0 data points observed ... Predict t for new values of x by integrating over w: where. Predictive Distribution (2) ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 50

Provided by: marku183

Category:

more less

Transcript and Presenter's Notes

Title: Pattern Recognition and Machine Learning

1
Pattern Recognition and Machine Learning
Chapter 1 Introduction
2
Expectations
3
Expectations
Conditional Expectation (discrete)
Approximate Expectation (discrete and continuous)
4
Variances and Covariances
5
Intinsic Error

PRML 1.5.5

6
Curve Fitting Re-visited
7
Maximum Likelihood
Determine by minimizing sum-of-squares
error, .
8
Minimizing the loss function for regression

Using the squared error as the loss function
We want to choose y(x) to minimize the expected
loss

Solving for y(x), we get

10
The Squared Loss Function
11

The first term is minimized when we select y(x)
as
The second term is independent of y(x) and
represents the intrinsic variability of the
target
It is called the intrinsic error.

12
(No Transcript)
13
Inverse Problems
14
Bias, Variance

PRML 3.2

15
The Bias-Variance Decomposition (1)

Recall the expected squared loss,
where
The second term corresponds to the noise inherent
in the random variable t.
What about the first term?

16
The Bias-Variance Decomposition (2)

Suppose we were given multiple data sets, each of
size N. Any particular data set, D, will give a
particular function y(x D). We then have

17
The Bias-Variance Decomposition (3)

Taking the expectation over D yields

18
The Bias-Variance Decomposition (4)

Thus we can write
where

Bias measures how much the prediction (averaged
over all data sets) differs from the desired
regression function.
Variance measures how much the predictions for
individual data sets vary around their average.
There is a trade-off between bias and variance
As we increase model complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)

20
f
f
bias
gi
g
variance
21
Reminder Introduction to OverfittingPRML 1.1

Concepts Polynomial curve fitting,
overfitting, regularization,
training set size vs model complexity

22
Polynomial Curve Fitting
23
Sum-of-Squares Error Function
24
0th Order Polynomial
25
1st Order Polynomial
26
3rd Order Polynomial
27
9th Order Polynomial
28
Over-fitting
Root-Mean-Square (RMS) Error
29
Polynomial Coefficients
30
Regularization

Penalize large coefficient values

31
Regularization
32
Regularization
33
Regularization vs.
34
Polynomial Coefficients
35
Back to Bias/Variance
36
The Bias-Variance Decomposition (5)

Example 100 data sets, each with 25 data points
from the sinusoidal h(x) sin(2px), varying the
degree of regularization, l.

37
The Bias-Variance Decomposition (6)

Example 100 data sets, each with 25 data points
from the sinusoidal h(x) sin(2px), varying the
degree of regularization, l.

38
The Bias-Variance Decomposition (7)

Example 100 data sets, each with 25 data points
from the sinusoidal h(x) sin(2px), varying the
degree of regularization, l.

39
The Bias-Variance Trade-off

From these plots, we note that an
over-regularized model (large l) will have a high
bias, while an under-regularized model (small l)
will have a high variance.

Minimum value of bias2variance is around
l-0.31 This is close to the value that gives the
minimum error on the test data.
40
f
f
bias
gi
g
variance
41
Model Selection Procedures

Regularization (Breiman 1998) Penalize the
augmented error
error on data l.model complexity
If l is too large, we risk introducing bias
Use cross validation to optimize for l
Structural Risk Minimization (Vapnik 1995)
Use a set of models ordered in terms of their
complexities
Number of free parameters
VC dimension,
Find the best model w.r.t empirical error and
model complexity.
Minimum Description Length Principle
Bayesian Model Selection If we have some prior
knowledge about the approximating function, it
can be incorporated into the Bayesian approach in
the form of p(model).

42
Bayesian Model Selection

Prior on models, p(model)
When prior favors simpler models Bayesian,
regularization, SRM and MDL are equivalent.

43
Model Selection Procedures

Cross validation Measure the total error, rather
than bias/variance, on a validation set.
Train/Validation sets
K-fold cross validation
Leave-One-Out
No prior assumption about the models

44
Polynomial Regression
Best fit min error
45
Best fit, elbow
46

Averaging of multiple solutions which themselves
have low bias
Lower variance due to averaging
Ensembles, mixture of experts, committee
machines
Bayesian approach weighted average
Increasing the number of data points N
Constrains the solutions, reducing variance

47
Data Set Size
9th Order Polynomial
48
Data Set Size
9th Order Polynomial
49
End of Lecture Skip the Rest
50
Bayesian Linear Regression (1)

Define a conjugate prior over w
Combining this with the likelihood function and
using results for marginal and conditional
Gaussian distributions, gives the posterior
where

51
Bayesian Linear Regression (2)

For simplicity, lets assume that the prior is
for which
Next we consider an example

52
Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
53
Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
54
Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
55
Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
56
Predictive Distribution (1)