Title: 132 v2.0
1Lectures 34Linear Machine Learning Algorithms
- Dr Martin Brown
- Room E1k
- Email martin.brown_at_manchester.ac.uk
- Telephone 0161 306 4672
- http//www.csc.umist.ac.uk/msc/intranet/EE-M016
2Lectures 34 Outline
- Linear classification using the Perceptron
- Classification problem
- Linear classifier and decision boundary
- Perceptron learning rule
- Proof of convergence
- Recursive linear regression using LMS
- Modelling and recursive parameter estimation
- Linear models and quadratic performance function
- LMS and NLMS learning rules
- Proof of convergence
3Lectures 34 Learning Objectives
- Understand what classification and regression
machine learning techniques are and their
differences - Describe how linear models can be used for both
classification and regression problems - Prove convergence of the learning algorithms for
linear relationships, subject to restrictive
conditions - Understand the restrictions of these basic proofs
- Develop basic framework that will be expanded on
in subsequent lectures
4Lecture 34 Resources
- Classification/Perceptron
- An introduction to Support Vector Machines and
other kernel-based learning methods, N
Cristianini, J Shawe-Taylor, CUP, 2000 - Regression/LMS
- Adaptive Signal Processing, Widrow Stearns,
Prentice Hall, 1985 - Many other sources are available (on-line).
5What is Classification?
- Classification is also known as (statistical)
pattern recognition - The aim is to build a machine/algorithm that can
assign appropriate qualitative labels to new,
previously unseen quantitative data using a
priori knowledge and/or information contained in
a training set. The patterns to be classified
are usually groups of measurements/observations,
that are believed to be informative for the
classification task. - Example Face recognition
Training data D X,y
Prior knowledge
Design/ learn
Classifier m(q,x)
Predict
Predicted class label y
New pattern x
6Classification Training Data
- To supply training data for a classifier,
examples must be collected that contain both
positive (examples of the class) and negative
(examples of other classes) instances. These are
qualitative target class values and are stored as
1 and -1, for the positive and negative
instances respectively. Generated by expert or
by observation. - The quantitative input features should be
informative - The training set should contain enough examples
to be able to build statistically significant
decisions
How to encode qualitative target and input
features?
7Bayes Class Priors
- Classification is all about decision making using
the concept of minimum risk - Imagine that the training data contains 100
examples, 70 of them are class 1 (c1), 30 are
class 2 (c2) - If I have to decide which class an unknown
example belongs to, which decision is optimal? - Errors if decision is class 1
p(c1) - Errors if decision is class 2
p(c2) - Minimum risk decision is
- p(c1) p(c2) are known as the Bayes priors, they
represent the baseline performance for any
classifier. They are derived from the training
data as simple percentages
8Structure of a Linear Classifier
- Given a set of quantitative features x, a linear
classifier has the form - The sgn() function is used to produce the
qualitative class label (/-1) - The class/decision boundary is determined when
- This is an (n-1)D hyperplane in feature space.
- In 2-dimensional feature space
- How does the sign and magnitude of q affect the
decision boundary?
x2
x1
9Simple Example Fishers Iris Data
- Famous example of building classifiers for a
problem with 3 types of Iris flowers and 4
measurements about the flower - Sepal length and width
- Petal length and width
- 150 examples were collected, 50 from each class
- Build 3 separate classifiers, one for recognizing
examples of each class - Data is shown, plotted against last two features,
as well as two linear classifiers for the Setosa
and Virginica classes
Calculate q in lab 34
10Perceptron Linear Classifier
- The Perceptron linear classifier was devised by
Rosenblatt in 1956 - It comprises a linear classifier (as just
discussed) and a simple parameter update rule of
the form - Cyclically present each training pattern xk, yk
to the linear classifier - When an error (misclassification) is made, update
the parameters - where hgt0 is the learning rate.
- The bias term can be included as q0 with an extra
feature x0 1 - Continue until there are no prediction errors
- Perceptron convergence theorem If the data set is
linearly separable, the perceptron learning
algorithm will converge to an optimal separator
in a finite time
11Instantaneous Parameter Update
- What does this look like?
- The parameters are updated to make them more like
the incorrect feature vector. - After updating
- Updated parameters are closer
- to correct decision
x2, q2
Error-driven update
x1, q1
y, y
1
0
-1
12Perceptron Convergence Proof Preamble
- Basic aim is to minimise the number of
mis-classifications - This is generally an NP-complete problem
- Weve assumed that there is an optimal solution
with 0 errors - This is similar to Least Squares recursive
estimation - Performance Si(yi-yi)2 4numberOfErrors
- Except that the sgn() makes it a non-quadratic
optimization problem - Updating only when
there are errors is the same as -
with or without errors - Sometimes drawn as a network
error driven parameter estimation
Repeatedly cycle through data set D, drawing out
each sample xk, yk
yk
xk
-
yk
13Convergence Analysis of the Perceptron (i)
- If a linearly separable data set D is repeatedly
presented to a Perceptron, then the learning
procedure is guaranteed to converge (no errors)
in a finite time - If the data set is linearly separable, there
exists optimal parameters q such that
for all i 1, , l - Note that are also
optimal parameter vectors - Consider the positive quantity g defined by, such
that q 1 - This is a concept known as the classification
margin - Assume also that the feature vectors are bounded
by
14Convergence Analysis of the Perceptron (ii)
- To show convergence, we need to establish that at
the kth iteration, when an error has occurred - Using the update formula
q2
qk
qk1
q
q1
To finish proof, select
15Convergence Analysis of the Perceptron (iii)
- To show this terminates in a finite number of
iterations, simply note that - is independent of the current training sample, so
the parameter error must decrease by at least
this amount at each update iteration. As the
initial error is finite, q0 0, say, there must
exist a finite number of steps before the
parameter error is reduced to zero. - Note also that a is proportional to the size of
the feature vector (R2) and inversely
proportional to the size of the margin (g). Both
of these will influence the number of update
iterations when the Perceptron is learning
16Example of Perceptron (i)
- Consider modelling the logical AND data using a
Perceptron
Is the data linearly separable?
k0, q 0.01, 0.1, 0.006
k5, q -0.98, 1.11, 1.01
k18, q -2.98, 2.11, 1.01
x2
x2
x2
x1
x1
x1
17Example Parameter Trajectory (ii)
Lab exercise Calculate by hand the first 4
iterations of the learning scheme
18Classification Margin
- In this proof, we assumed that there exists a
single, optimal parameter vector. - In practice, when the data is linearly separable,
there are an infinite number simply requiring
correct classification results in an ill-posed
posed problem - The classification margin can be defined as the
minimum distance of the decision boundary to a
point in that class - Used in deriving Support Vector Machines
x2
x1
x2
1
0
-1
x1
19Classification Summary
- Classification is the task of assigning an
object, described by a feature vector, to one of
a set of mutually exclusive groups - A linear classifier has a linear decision
boundary - The perceptron training algorithm is guaranteed
to converge in a finite time when the data set is
linearly separable - The final boundary is determined by the initial
values and the order of presentation of the data
20Definition of Regression
- Regression is a (statistical) methodology that
utilizes the relation between two or more
quantitative variables so that one variable can
be predicted from the other, or others. - Examples
- Sales of a product can be predicted by using the
relationship between sales volume and amount of
advertising - The performance of an employee can be predicted
by using the relationship between performance and
aptitude tests - The size of a childs vocabulary can be predicted
by using the relationship between the vocabulary
size, the childs age and the parents
educational input.
21Regression Problem Visualisation
y, y
y
x
- Data generated by
- Estimate model parameters
- Predict a real value (fit a curve to the data)
- Predictive performance
- average error
22Probabilistic Prediction Output
- An output of 12 with rmse/standard deviation
1.5 Within a small region close to the query
point, the average target value was 12 and the
standard deviation within that region was 1.5
(variance 2.25)
m(yx) 12
2s(e) 3
95 of the data lies in the range m/-2s 12
/-21.5 9,15
23Structure of a Linear Regression Model
- Given a set of features x, a linear predictor has
the form - The output is a real-valued, quantitative
variable - The bias term can be included as an extra feature
x0 1. This renames the bias parameter as q0. - Most linear control system models do not
explicitly include a bias term, why is this? - Similar to the Toluca example in week 1.
24Least Mean Squares Learning
- Least Mean Squares (LMS) proposed by Widrow 1962
- This is a (non-optimal) sequential parameter
estimation procedure for a linear model - NB, compared to classification, both yk and yk
are quantitative variables, so the error/noise
signal (yk-yk) is generally non-zero. Similar to
the Perceptron, but no threshold on xTq. h is
again the positive learning rate. - Widely used in filtering/signal processing and
adaptive control applications - Cheap version of sequential/recursive parameter
estimation - The normalised version (NLMS) was developed by
Kaczmarz in 1937
25Proof of LMS Convergence (i)
- If a noise-free data set containing a linear
relationship x-gty is repeatedly presented to a
linear model, then the LMS algorithm is
guaranteed to update the parameters so that they
converge to their optimal values, assuming the
learning rate is sufficiently small. - Note
- Assume there is no measurement noise in the
target data - Assume the data is generated from a linear
relationship - Parameter estimation will take an infinite time
to converge to the optimal values - Rate of convergence and stability depend on the
learning rate
26Proof of Convergence (ii)
- To show convergence, we need to establish that at
the kth iteration, when an error has occurred - Using the update formula
when
27Example LMS Learning
- Consider the target linear model y 1 - 2x,
where the inputs are drawn from a normal
distribution with zero mean, unit variance - Data set consisted of 25 data points, and
involved 10 cycles through the data set - h0.1
k100
y, y
k5
k0
x
q0
q
q1
q1
q0
k
28Stability and NLMS
- To normalise the LMS algorithm and remove the
dependency of h on the input vector size,
consider - This learning algorithm is stable for 0lthlt 2
(exercise). - When h1, the NLMS algorithm has the property
that the error, on that datum, after adaptation
is zero, ie - Exercise prove this.
- Is this desirable when the target contains
(measurement) noise?
29Regression Summary
- Regression is a (statistical) technique for
predicting real-valued outputs, given a
quantitative feature vector - Typically, it is assumed that the dependent,
target variable is corrupted by Gaussian noise,
and this is unpredictable. - The aim is then to fit the underlying
linear/non-linear signal. - The LMS algorithm is a simple, cheap gradient
descent technique for updating the linear
parameter estimates - The parameters will converge to their correct
values when the target does not contain any
noise, otherwise they will oscillate in a zone
around the optimum. - Stability of the algorithm depends on the
learning rate
30Lecture 34 Summary
- This lecture has looked at basic (linear)
classification and regression techniques - Investigated basic linear model structure
- Proposed simple, on-line learning rules
- Proved convergence for simple environments
- Discussed the practicality of the machine
learning algorithms - While these algorithms are rarely used in this
form, their structure has strongly influenced the
development of more advanced techniques - Support vector machines
- Multi-layer perceptrons
- which will be studied in the coming weeks
31Laboratory 34 Perceptron/LMS
- Download the irisClassifier.m iris.mat Matlab
files that contain a simple GUI for displaying
the Iris data and entering decision boundaries - Enter parameters that create suitable decision
boundaries for both the Setosa and Virginica
classes - Which of the three classes are linearly
separable? - Make sure you can translate between the
classifiers parameters, q, and the
gradient/intercept coordinate systems. Also
ensure that the output is 1 (rather than -1) in
the appropriate region - Download the irisPerceptron.m and perceptron.m
Matlab files that contain the Perceptron
algorithm for the Iris data - Run the algorithm and note how the decision
boundary changes when a point is
correctly/incorrectly classified - Modify the learning rate and note the effect it
has on the convergence rate and final values
32Laboratory 34 Perceptron/LMS (ii)
- Copy and modify the irisPerceptron.m Matlab file
so that it runs on the logical AND and OR
classification functions (see slides 16 17).
Each should contain 2 features and four training
patterns. Make sure you can calculate the
updates by hand, as required on Slide 17. - Create a Matlab implementation of example given
in Slide 27 for the LMS algorithm with a simple,
single input linear model - What values of h causes the LMS algorithm to
become unstable? - Can this ever happen with the Perceptron
algorithm? - Modify this implementation to use the NLMS
training rule - Verify that learning is always stable for 0 lt h lt
2. - Complete the two (pen and paper) exercises on
Slide 28. - How might this insight be used with the
Perceptron algorithm to implement a dynamic
learning rate?