1/24, v3.0 - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

1/24, v3.0

Description:

Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons. m(x,q) = 0 + +. + + + + + x2.. + + +. +. + +. + + + x1 Dr Martin Brown Room: E1k – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 25

Provided by: MartinB167

Category:

more less

Transcript and Presenter's Notes

Title: 1/24, v3.0

1
Lectures 78 Non-linear Classification and
Regression using Layered Perceptrons

Dr Martin Brown
Room E1k
Email martin.brown_at_manchester.ac.uk
Telephone 0161 306 4672
http//www.eee.manchester.ac.uk/intranet/pg/course
material/

2
Lectures 78 Outline

What approaches are possible for non-linear
classification and regression problems
Non-linear polynomial networks
Potential and problems using flexible models
Sigmoidal-type non-linear transformations
Modelling capabilities
Regression and classification interpretation
Parameter optimization using gradient descent
Non-linear logical functions and layered
Perceptron nets
Lead onto Multi-Layer Perceptron (MLP) models
next week

3
Lecture 78 Resources

These slides are largely self-contained, but
extra, background material can be found in
Machine Learning, T Mitchell, McGraw Hill, 1997
Machine Learning, Neural and Statistical
Classification, D Michie, DJ Spiegelhalter and CC
Taylor, 1994 http//www.amsta.leeds.ac.uk/charle
s/statlog/
In addition, there are many on-line sources for
multi-layer perceptrons (MLPs) and error back
propagation (EBP), just search on google
Advanced text
Information Theory, Inference and Learning
Algorithms, D MacKay, Cambridge University
Press, 2003

4
Non-Linear Regression and Classification

Most real-world modelling problems are not
linear
A task is non-linear if it cannot be represented
using a linear model
Classification the number of classification
errors is too large
Regression the noise variance is too large
Using non-linear models/relationships may help to
approximate f().

5
Non-Linear Classification

Consider the following 2-class classification
problem
Always compare to prior error rate
Exercise What are the error rates for prior,
optimal linear and non-linear models?
Type of non-linear function is important
Data is generated by (with classification errors)

x2
x1
6
Non-linear Regression

Need to balance model complexity against data
accuracy
How much signal is reproducible

7
Polynomial Non-Linear Models

A simple and convenient way to extend linear
models is to consider polynomial expansions, such
as quadratic
Expansion to any order is possible cubic,
quadratic, subset of terms
Linear model is produced when
A polynomial model is linear in its parameters
Approximate any continuous function, arbitrarily
closely if a high enough polynomial expansion is
used (Taylor series)

8
Example Quadratic Decision Boundary

A quadratic 2-class classifier is given by
This has a decision boundary given by
an 2-dimensional ellipse

Example of quadratic classification boundary for
the Iris Setosa data Modify Perceptron
simulation to work on this?
9
Polynomial Regression Overfitting

Optimal, least squares parameter estimator is
given by
where X is the data matrix, each row represents a
data point, each column is one polynomial basis
term.
Which polynomial terms should be used -
polynomials are flexible but can be quite
oscillatory (high frequency components), usually
not appropriate

Example 20 data points, x randomly drawn from a
unit variance, normal distribution, yexp(-x.2)
, fitted by a fifth order polynomial.
10
Sigmoidal Non-Linear Transformations

Lets consider another way to introduce
non-linearities into a basic linear model, by
producing a continuous, non-linear transformation
of a weighted sum
What sort of single input, single output
functions, f(), are possible?
To estimate parameters using gradient descent, it
should be differentiable
To use for classification and regression, is
should be able to represent linear and step
functions, as appropriate

x01
q0
y
x1
q1
qn
xn
11
Tanh() Function

Consider the tanh() function whose output lies in
(-1,1)
When there is a single input u q0xq1
When q1 is large ( 4)
Almost a step function
When q1 is small ( 0.25)
Almost a linear relationship
q0 shifts tanh() horizontally

q1 large
q1 small
12
Tanh Function in 2D X-Space

Such functions are often known as ridge
functions, because they are constant along a line
in input space!
u xTq c

13
0-1 Sigmoid

Many books/notes use the following sigmoid
function
which has an output lying in the range (0,1).
In these notes, well refer to both
transformation functions as sigmoidal functions,
because of their lazy S shape
In fact, theyre just transformations of each
other

14
Sigmoidal Parameter Estimation

Gradient descent update for a single training
datum
For the ith training pattern
Using the chain rule
Giving an update rule

Similar to the LMS rule, apart from the extra
sigmoidal derivative term, f().
15
Sigmoidal Parameter Estimation (ii)

Sigmoidal functions derivative (tanh)

16
Layered Perceptron Networks

In this section, were going to consider how
these sigmoidal nodes can be connected together
into layers to give greater/more flexible
non-linear modelling behaviour
Two central questions
What are the non-linear modelling capabilities?
How to estimate the non-linear parameters?

x0
h0
y
h1
x1
x2
h2
17
Linearly Separable 2D Logical Functions

Note class output values of 0 and 1 in next few
slides
AND
OR
NOT

1
1
-1
1
1
-1
x2
x2
0
-1
-1
-1
-1
0
1
x1
1
0
0
x1
1
1
1
1
1
1
x2
x2
0
1
-1
1
-1
0
1
x1
1
0
0
x1
-1
1
x
18
Nonlinearly Separable 2D XOR

eXclusive OR (XOR) - n bit parity
2 inputs
Data generated by
y (NOT x2 AND x1) OR (NOT x1 AND x2).
Non-linear, polynomial input transformations
x3 x1x2, makes the problem separable
How can multi-layer networks?

19
Multi-Layer Network for 2D XOR

Can be implemented as a two layer network (two
layers of adjustable parameters) with two hidden
nodes in the hidden layer
Empty circles represent linear Perceptron nodes
Solid circles represent a real signals
Arrows represent model parameters q
(NOT x2 AND x1) OR (NOT x1 AND x2)
Is represented in a 2 layer network as
h1 (NOT x2 AND x1)
h2 (NOT x1 AND x2), y h1 OR h2

x01
h01
y
h1
x1
x2
output layer
h2
hidden layer
20
Exercise Determine the 9 Parameters

Write down the parameter vectors for the 3
Perceptron nodes
h1 (NOT x2 AND x1)
h2 (NOT x1 AND x2),
y h1 OR h2

21
Logical Functions and DNF

Any logical function can be expressed as the
union of negation and conjunction terms.
It can be realized with a 2 layer Perceptron
network.
Each hidden layer unit to respond to exactly one
positive example.
Output layer is formed from the union of the
hidden layer outputs.
f h1 OR h2 OR OR hP
Each data point/positive example is given its own
hidden unit, which responds to only that point
Essentially, it memorizes the positive training
samples

22
Lecture 78 Conclusions

There are many ways to build and use non-linear
models for classification and regression purposes
Potentially get more accurate predictions/fewer
errors if the data is generated by a non-linear
relationship
Parameter estimation is sometimes more complex
No direct optimal parameter calculation
Gradient-based estimation has local minima and
differing curvatures
Need to select an appropriate non-linear
framework
Multi-layer (sigmoidal) Perceptrons are one such
framework
Non-linearity controlled by nodes in hidden layer
Parameters estimated using gradient descent
Several factors need to be considered

23
Lecture 78 Laboratory (i)

Matlab
Extend the basic Perceptron matlab script so that
it now trains up a quadratic classifier (note
that the plotting routines will no longer be
appropriate).
Implement the sigmoidal perceptron learning
algorithm, where the model consists of a single
layer with a tanh activation function and the
parameters are updated after each presentation of
a datum (see Slides 10-14)
Test the algorithm on the logical AND and logical
OR data, as you did for the normal Perceptron
algorithm in the laboratory in IS2.ppt
What are the similarities/differences of this
model compared to the normal Perceptron algorithm
described in IS2.ppt

24
Lecture 78 Laboratory (ii)

Theory
Prove the relationship on Slide 13 between the
two types of sigmoids
Verify the derivative of the tanh function on
Slide 15, and prove that the derivative of the
(0,1) sigmoid on Slide 13 can be expressed as
y(1-y)
Calculate the optimal parameter values missing on
Slides 17 and 20.
Derive a generic rule for setting the parameter
values on Slide 21 for an arbitrary logical
function. You may assume that you know the
number of positive examples, the number of
features and the logical structure of each
positive example