118, v3'0 - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

118, v3'0

Description:

http://www.eee.manchester.ac.uk/intranet/pg/coursematerial/ EE-M016 2005/6: IS L5&6 ... MSE (mean squared error) performance function. Gradient descent ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 19

Provided by: intranetE

Category:

Tags: eee

more less

Transcript and Presenter's Notes

Title: 118, v3'0

1
Lecture 56 Using Optimization Theory for
Parameter Estimation

Dr Martin Brown
Room E1k
Email martin.brown_at_manchester.ac.uk
Telephone 0161 306 4672
http//www.eee.manchester.ac.uk/intranet/pg/course
material/

2
Lecture 56 Outline

Parameter estimation using optimization
Review of parameter estimation
MSE (mean squared error) performance function
Gradient descent parameter estimation
Convergence/stability analysis
Non-linear models, Newtons method and Quadratic
Programming
The goal of this lecture is to show how gradient
descent algorithms can be used to learn/estimate
a models parameters, and consider its
limitations/extensions.

3
Lecture 56 Resources

These slides are largely self-contained, but
extra, background material can be found in
Chapter 2, Machine Learning (MIT Open
Courseware), T Jakkola, http//ocw.mit.edu/OcwWeb/
Electrical-Engineering-and-Computer-Science/6-867M
achine-LearningFall2002/CourseHome/index.htm
An introduction to conjugate gradient without the
agonizing pain, JS Shewchuk, 1994, Technical
Report , http//www.cs.cmu.edu/quake-papers/painl
ess-conjugate-gradient.ps
Adaptive Signal Processing, Widrow Stearns,
Prentice Hall, 1985
Advanced
Practical Methods of Optimization, R Fletcher,
2nd Edition, Wiley

4
Parameter Estimation Framework

The basic goal of machine learning is
Given a set of data that describes the problem,
estimate the models parameters in some best
sense.
There exists a data set of the form (regression
problem)
There exists a model of the form
y(x) m(q,x), where q is the parameter vector
(generally including bias )
There exists a measure of performance
Note there exists many variations on this basic
form

5
Parameter Estimation Goal

For a fixed model and data set, calculate the
optimal value of the parameters that minimise the
performance function
Open questions
How does f() depend on q?
How to refine the current estimate, qk, of q?

1 parameter view of parameter optimisation
f(q)

q
qk

6
Direct Solutions and Iterative Estimation

Direct solutions such as the generalized inverse
are only possible for linear models with
quadratic performance functions
In general, an iterative approach is needed
Desire
Onto gradient descent learning

f(q)

q
q
qk
qk1
7
Review Mean Squared Error Performance

The quadratic (mean) squared error performance
function is the most widely used
This is because
Closed form solution for linear models y xTq
Simply invert the matrix
Local Taylor series approximation for non-linear
models
Analytic representation of gradient and second
order methods
Gaussian interpretation (log likelihood), as it
represents a scaled (l/2) estimate of the
measurement noise variance
Exercise show that m(y-y)0 for an optimal linear
model

8
Review Quadratic MSE for Linear Models

For a linear (in the parameter) model of the form
y xTq
Quadratic function is determined by
Hessian is the (scaled) variance/covariance
matrix (n1)(n1)
correlation vector (n1)1

f
q2
q1
9
Review Normal Equations

When the parameter vector is optimal
For a quadratic MSE with a linear model
At optimality
This are the normal equations for least squares
modelling
Using this, the performance function can be
expressed as

for some constant c
10
Gradient Descent Learning

The basic aim of gradient descent learning
Given the current parameter estimate and the
gradient vector of the performance function with
respect to the parameters, update the parameters
along the negative gradient
For a linear model with a MSE performance
function
Batch gradient descent learning gives (similar to
LMS)

hgt0, is the learning rate
11
Gradients and Contours
q

The gradient/negative gradient is perpendicular
to the tangent to the contour at the current
point.
Exercise Prove this using Taylor series

12
2D Visualisation of Gradient Descent
20 iterations of gradient descent learning q0
-0.5 1.5 q 1 1 H 0.333 0.166 0.166
0.333 h 1

Exercise Obtain a gradient expression using only
H, q and qk (see S9)
Exercise Implement this scenario as a Matlab
script

13
Pictorial Stability Investigation
h 1
h 3
h 4
h 5
14
Stability of Linear Gradient Descent

What values of h gives a stable learning process?
What values of h reduce parameter errors?
We need all the eigenvalues of
to have a value lt1 in magnitude
Let s1, s2, sn be the positive eigenvalues of
XTX, then the eigenvalues of are
given by
These are all lt1 in magnitude when

15
Non-linear Gradient Descent

When a non-linear model is used, the MSE is no
longer a quadratic function of the parameters
Taylor series estimate
Non-linear models can be locally approximated by
a linear model and gradient descent applied to
this situation

local minima
plateau regions
non-constant Hessian matrix

f(q)
q
16
2nd Order Methods Newtons Algorithm

Gradient descent is known as a 1st order
algorithm because it only uses knowledge about
the 1st derivative (gradient)
Newtons method
Dqk hs
2nd order
Requires matrix inversion

Gradient descent moves perpendicular to the
local contours
Slow for badly conditioned systems (steep sided,
flat bottomed valleys)
Can we not move directly to the optimum?

s

17
Lecture 56 Summary

Often, the central problem of machine learning is
parameter estimation
Using MSE and linear models is a quadratic
optimisation problem for which there exists a
closed form solution
Gradient descent can be used to recursively
estimate parameters for both quadratic and
non-quadratic performance functions, linear and
non-linear models
Learning is stable when 0lt h lt mini 2/si
Slow learning when smax gtgt smin (ratio smax/smin
is known as the condition number)
Higher order learning methods (Newton, ) are
possible, and although these are more
computationally costly they converge a lot
quicker.

18
Lecture 56 Laboratory

Matlab
Obtain the gradient expression and implement the
gradient descent learning scenario described on
Slide 12. Make sure you obtain similar stability
results to Slide 13.
Use the contour command to superimpose the
contours of the quadratic performance function,
on the learning history in (1).
Modify the LMS update algorithm, in lab 2, so
that the scenario now performs a single batch
gradient descent at the end of complete pass (see
slide 10) through the data set, rather than after
every pattern is presented to the model. How
does the learning trajectories compare (LMS v.
gradient descent)?
Theory
Show H XTX is positive definite (non-singular)
Show the instantaneous Hessian Hk xkxkT has a
single non-zero eigenvalue and calculate its
corresponding eigenvector. How does this value
compare with the stability constraint for the LMS
algorithm?
Do the exercises on S7,1112.