Title: 118, v3'0
1Lecture 56 Using Optimization Theory for
Parameter Estimation
- Dr Martin Brown
- Room E1k
- Email martin.brown_at_manchester.ac.uk
- Telephone 0161 306 4672
- http//www.eee.manchester.ac.uk/intranet/pg/course
material/
2Lecture 56 Outline
- Parameter estimation using optimization
- Review of parameter estimation
- MSE (mean squared error) performance function
- Gradient descent parameter estimation
- Convergence/stability analysis
- Non-linear models, Newtons method and Quadratic
Programming - The goal of this lecture is to show how gradient
descent algorithms can be used to learn/estimate
a models parameters, and consider its
limitations/extensions.
3Lecture 56 Resources
- These slides are largely self-contained, but
extra, background material can be found in - Chapter 2, Machine Learning (MIT Open
Courseware), T Jakkola, http//ocw.mit.edu/OcwWeb/
Electrical-Engineering-and-Computer-Science/6-867M
achine-LearningFall2002/CourseHome/index.htm - An introduction to conjugate gradient without the
agonizing pain, JS Shewchuk, 1994, Technical
Report , http//www.cs.cmu.edu/quake-papers/painl
ess-conjugate-gradient.ps - Adaptive Signal Processing, Widrow Stearns,
Prentice Hall, 1985 - Advanced
- Practical Methods of Optimization, R Fletcher,
2nd Edition, Wiley
4Parameter Estimation Framework
- The basic goal of machine learning is
- Given a set of data that describes the problem,
estimate the models parameters in some best
sense. - There exists a data set of the form (regression
problem) -
- There exists a model of the form
- y(x) m(q,x), where q is the parameter vector
(generally including bias ) - There exists a measure of performance
- Note there exists many variations on this basic
form
5Parameter Estimation Goal
- For a fixed model and data set, calculate the
optimal value of the parameters that minimise the
performance function - Open questions
- How does f() depend on q?
- How to refine the current estimate, qk, of q?
1 parameter view of parameter optimisation
f(q)
q
qk
6Direct Solutions and Iterative Estimation
- Direct solutions such as the generalized inverse
- are only possible for linear models with
quadratic performance functions - In general, an iterative approach is needed
- Desire
- Onto gradient descent learning
f(q)
q
q
qk
qk1
7Review Mean Squared Error Performance
- The quadratic (mean) squared error performance
function is the most widely used - This is because
- Closed form solution for linear models y xTq
- Simply invert the matrix
- Local Taylor series approximation for non-linear
models - Analytic representation of gradient and second
order methods - Gaussian interpretation (log likelihood), as it
represents a scaled (l/2) estimate of the
measurement noise variance - Exercise show that m(y-y)0 for an optimal linear
model
8Review Quadratic MSE for Linear Models
- For a linear (in the parameter) model of the form
y xTq - Quadratic function is determined by
- Hessian is the (scaled) variance/covariance
matrix (n1)(n1) - correlation vector (n1)1
f
q2
q1
9Review Normal Equations
- When the parameter vector is optimal
- For a quadratic MSE with a linear model
- At optimality
- This are the normal equations for least squares
modelling - Using this, the performance function can be
expressed as
for some constant c
10Gradient Descent Learning
- The basic aim of gradient descent learning
- Given the current parameter estimate and the
gradient vector of the performance function with
respect to the parameters, update the parameters
along the negative gradient - For a linear model with a MSE performance
function - Batch gradient descent learning gives (similar to
LMS)
hgt0, is the learning rate
11Gradients and Contours
q
- The gradient/negative gradient is perpendicular
to the tangent to the contour at the current
point. - Exercise Prove this using Taylor series
122D Visualisation of Gradient Descent
20 iterations of gradient descent learning q0
-0.5 1.5 q 1 1 H 0.333 0.166 0.166
0.333 h 1
- Exercise Obtain a gradient expression using only
H, q and qk (see S9) - Exercise Implement this scenario as a Matlab
script
13Pictorial Stability Investigation
h 1
h 3
h 4
h 5
14Stability of Linear Gradient Descent
- What values of h gives a stable learning process?
- What values of h reduce parameter errors?
- We need all the eigenvalues of
to have a value lt1 in magnitude - Let s1, s2, sn be the positive eigenvalues of
XTX, then the eigenvalues of are
given by -
- These are all lt1 in magnitude when
15Non-linear Gradient Descent
- When a non-linear model is used, the MSE is no
longer a quadratic function of the parameters - Taylor series estimate
- Non-linear models can be locally approximated by
a linear model and gradient descent applied to
this situation
- local minima
- plateau regions
- non-constant Hessian matrix
f(q)
q
162nd Order Methods Newtons Algorithm
- Gradient descent is known as a 1st order
algorithm because it only uses knowledge about
the 1st derivative (gradient) - Newtons method
- Dqk hs
- 2nd order
- Requires matrix inversion
- Gradient descent moves perpendicular to the
local contours - Slow for badly conditioned systems (steep sided,
flat bottomed valleys) - Can we not move directly to the optimum?
s
17Lecture 56 Summary
- Often, the central problem of machine learning is
parameter estimation - Using MSE and linear models is a quadratic
optimisation problem for which there exists a
closed form solution - Gradient descent can be used to recursively
estimate parameters for both quadratic and
non-quadratic performance functions, linear and
non-linear models - Learning is stable when 0lt h lt mini 2/si
- Slow learning when smax gtgt smin (ratio smax/smin
is known as the condition number) - Higher order learning methods (Newton, ) are
possible, and although these are more
computationally costly they converge a lot
quicker.
18Lecture 56 Laboratory
- Matlab
- Obtain the gradient expression and implement the
gradient descent learning scenario described on
Slide 12. Make sure you obtain similar stability
results to Slide 13. - Use the contour command to superimpose the
contours of the quadratic performance function,
on the learning history in (1). - Modify the LMS update algorithm, in lab 2, so
that the scenario now performs a single batch
gradient descent at the end of complete pass (see
slide 10) through the data set, rather than after
every pattern is presented to the model. How
does the learning trajectories compare (LMS v.
gradient descent)? - Theory
- Show H XTX is positive definite (non-singular)
- Show the instantaneous Hessian Hk xkxkT has a
single non-zero eigenvalue and calculate its
corresponding eigenvector. How does this value
compare with the stability constraint for the LMS
algorithm? - Do the exercises on S7,1112.