Title: 1/24, v3.0
1Lectures 78 Non-linear Classification and
Regression using Layered Perceptrons
- Dr Martin Brown
- Room E1k
- Email martin.brown_at_manchester.ac.uk
- Telephone 0161 306 4672
- http//www.eee.manchester.ac.uk/intranet/pg/course
material/
2Lectures 78 Outline
- What approaches are possible for non-linear
classification and regression problems - Non-linear polynomial networks
- Potential and problems using flexible models
- Sigmoidal-type non-linear transformations
- Modelling capabilities
- Regression and classification interpretation
- Parameter optimization using gradient descent
- Non-linear logical functions and layered
Perceptron nets - Lead onto Multi-Layer Perceptron (MLP) models
next week
3Lecture 78 Resources
- These slides are largely self-contained, but
extra, background material can be found in - Machine Learning, T Mitchell, McGraw Hill, 1997
- Machine Learning, Neural and Statistical
Classification, D Michie, DJ Spiegelhalter and CC
Taylor, 1994 http//www.amsta.leeds.ac.uk/charle
s/statlog/ - In addition, there are many on-line sources for
multi-layer perceptrons (MLPs) and error back
propagation (EBP), just search on google - Advanced text
- Information Theory, Inference and Learning
Algorithms, D MacKay, Cambridge University
Press, 2003
4Non-Linear Regression and Classification
- Most real-world modelling problems are not
linear - A task is non-linear if it cannot be represented
using a linear model - Classification the number of classification
errors is too large - Regression the noise variance is too large
- Using non-linear models/relationships may help to
approximate f().
5Non-Linear Classification
- Consider the following 2-class classification
problem - Always compare to prior error rate
- Exercise What are the error rates for prior,
optimal linear and non-linear models? - Type of non-linear function is important
- Data is generated by (with classification errors)
x2
x1
6Non-linear Regression
- Need to balance model complexity against data
accuracy - How much signal is reproducible
7Polynomial Non-Linear Models
- A simple and convenient way to extend linear
models is to consider polynomial expansions, such
as quadratic - Expansion to any order is possible cubic,
quadratic, subset of terms - Linear model is produced when
- A polynomial model is linear in its parameters
- Approximate any continuous function, arbitrarily
closely if a high enough polynomial expansion is
used (Taylor series)
8Example Quadratic Decision Boundary
- A quadratic 2-class classifier is given by
- This has a decision boundary given by
- an 2-dimensional ellipse
Example of quadratic classification boundary for
the Iris Setosa data Modify Perceptron
simulation to work on this?
9Polynomial Regression Overfitting
- Optimal, least squares parameter estimator is
given by - where X is the data matrix, each row represents a
data point, each column is one polynomial basis
term. - Which polynomial terms should be used -
polynomials are flexible but can be quite
oscillatory (high frequency components), usually
not appropriate
Example 20 data points, x randomly drawn from a
unit variance, normal distribution, yexp(-x.2)
, fitted by a fifth order polynomial.
10Sigmoidal Non-Linear Transformations
- Lets consider another way to introduce
non-linearities into a basic linear model, by
producing a continuous, non-linear transformation
of a weighted sum - What sort of single input, single output
functions, f(), are possible? - To estimate parameters using gradient descent, it
should be differentiable - To use for classification and regression, is
should be able to represent linear and step
functions, as appropriate
x01
q0
y
x1
q1
qn
xn
11Tanh() Function
- Consider the tanh() function whose output lies in
(-1,1) - When there is a single input u q0xq1
- When q1 is large ( 4)
- Almost a step function
- When q1 is small ( 0.25)
- Almost a linear relationship
- q0 shifts tanh() horizontally
q1 large
q1 small
12Tanh Function in 2D X-Space
- Such functions are often known as ridge
functions, because they are constant along a line
in input space! - u xTq c
130-1 Sigmoid
- Many books/notes use the following sigmoid
function - which has an output lying in the range (0,1).
- In these notes, well refer to both
transformation functions as sigmoidal functions,
because of their lazy S shape - In fact, theyre just transformations of each
other
14Sigmoidal Parameter Estimation
- Gradient descent update for a single training
datum - For the ith training pattern
- Using the chain rule
- Giving an update rule
Similar to the LMS rule, apart from the extra
sigmoidal derivative term, f().
15Sigmoidal Parameter Estimation (ii)
- Sigmoidal functions derivative (tanh)
16Layered Perceptron Networks
- In this section, were going to consider how
these sigmoidal nodes can be connected together
into layers to give greater/more flexible
non-linear modelling behaviour - Two central questions
- What are the non-linear modelling capabilities?
- How to estimate the non-linear parameters?
x0
h0
y
h1
x1
x2
h2
17Linearly Separable 2D Logical Functions
- Note class output values of 0 and 1 in next few
slides - AND
- OR
- NOT
1
1
-1
1
1
-1
x2
x2
0
-1
-1
-1
-1
0
1
x1
1
0
0
x1
1
1
1
1
1
1
x2
x2
0
1
-1
1
-1
0
1
x1
1
0
0
x1
-1
1
x
18Nonlinearly Separable 2D XOR
- eXclusive OR (XOR) - n bit parity
- 2 inputs
-
- Data generated by
- y (NOT x2 AND x1) OR (NOT x1 AND x2).
- Non-linear, polynomial input transformations
- x3 x1x2, makes the problem separable
- How can multi-layer networks?
19Multi-Layer Network for 2D XOR
- Can be implemented as a two layer network (two
layers of adjustable parameters) with two hidden
nodes in the hidden layer - Empty circles represent linear Perceptron nodes
- Solid circles represent a real signals
- Arrows represent model parameters q
- (NOT x2 AND x1) OR (NOT x1 AND x2)
- Is represented in a 2 layer network as
- h1 (NOT x2 AND x1)
- h2 (NOT x1 AND x2), y h1 OR h2
x01
h01
y
h1
x1
x2
output layer
h2
hidden layer
20Exercise Determine the 9 Parameters
- Write down the parameter vectors for the 3
Perceptron nodes - h1 (NOT x2 AND x1)
- h2 (NOT x1 AND x2),
- y h1 OR h2
21Logical Functions and DNF
- Any logical function can be expressed as the
union of negation and conjunction terms. - It can be realized with a 2 layer Perceptron
network. - Each hidden layer unit to respond to exactly one
positive example. - Output layer is formed from the union of the
hidden layer outputs. - f h1 OR h2 OR OR hP
- Each data point/positive example is given its own
hidden unit, which responds to only that point - Essentially, it memorizes the positive training
samples
22Lecture 78 Conclusions
- There are many ways to build and use non-linear
models for classification and regression purposes - Potentially get more accurate predictions/fewer
errors if the data is generated by a non-linear
relationship - Parameter estimation is sometimes more complex
- No direct optimal parameter calculation
- Gradient-based estimation has local minima and
differing curvatures - Need to select an appropriate non-linear
framework - Multi-layer (sigmoidal) Perceptrons are one such
framework - Non-linearity controlled by nodes in hidden layer
- Parameters estimated using gradient descent
- Several factors need to be considered
23Lecture 78 Laboratory (i)
- Matlab
- Extend the basic Perceptron matlab script so that
it now trains up a quadratic classifier (note
that the plotting routines will no longer be
appropriate). - Implement the sigmoidal perceptron learning
algorithm, where the model consists of a single
layer with a tanh activation function and the
parameters are updated after each presentation of
a datum (see Slides 10-14) - Test the algorithm on the logical AND and logical
OR data, as you did for the normal Perceptron
algorithm in the laboratory in IS2.ppt - What are the similarities/differences of this
model compared to the normal Perceptron algorithm
described in IS2.ppt
24Lecture 78 Laboratory (ii)
- Theory
- Prove the relationship on Slide 13 between the
two types of sigmoids - Verify the derivative of the tanh function on
Slide 15, and prove that the derivative of the
(0,1) sigmoid on Slide 13 can be expressed as
y(1-y) - Calculate the optimal parameter values missing on
Slides 17 and 20. - Derive a generic rule for setting the parameter
values on Slide 21 for an arbitrary logical
function. You may assume that you know the
number of positive examples, the number of
features and the logical structure of each
positive example