Title: Neural Networks and Pattern Recognition
1(No Transcript)
2unit 4
Neural Networks and Pattern Recognition
Giansalvo EXIN Cirrincione
3Single-layer networks
They directly compute linear discriminant
functions using the TS without need of
determining probability densities.
4Linear discriminant functions
Two classes
(d-1)-dimensional hyperplane
5Linear discriminant functions
Several classes
6Linear discriminant functions
Several classes
The decision regions are always simply connected
and convex.
7Logistic discrimination
The decision boundary is still linear
- two classes
- Gaussians with S1 S2 S
8Logistic discrimination
9Logistic discrimination
logistic sigmoid
10Logistic discrimination
The use of the logistic sigmoid activation
function allows the outputs of the discriminant
to be interpreted as posterior probabilities.
logistic sigmoid
11binary input vectors
Let Pki denote the probability that the input xi
takes the value 1 when the input vector is drawn
from the class Ck. The corresponding probability
that xi 0 is then given by 1- Pki .
Assuming the input variables are statistically
independent, the probability for the complete
input vector is given by
12binary input vectors
Linear discriminant functions arise when we
consider input patterns in which the variables
are binary.
13binary input vectors
Consider a set of independent binary variables
having Bernoulli class-conditional densities. For
the two-class problem
Both for normally distributed and Bernoulli
distributed class-conditional densities, the
posterior probabilities are obtained by a
logistic single-layer network.
14homework
15(No Transcript)
16(No Transcript)
17Generalized discriminant functions
It can approximate any CONTINUOUS functional
transformation to arbitrary accuracy.
18Sum-of-squares error function
target
quadratic in the weights
19Geometrical interpretation of least squares
20Pseudo-inverse solution
21Pseudo-inverse solution
22bias
The role of the biases is to compensate for the
difference between the averages (over the data
set) of the target values and the averages of the
output vectors
23gradient descent
Group all of the parameters (weights and biases)
together to form a single weight vector w.
batch
sequential
24gradient descent
Differentiable non-linear activation functions
25gradient descent
Generate and plot a set of data points in two
dimensions, drawn from two classes each of which
is described by a Gaussian class-conditional
density function. Implement the gradient descent
algorithm for training a logistic discriminant,
and plot the decision boundary at regular
intervals during the training procedure on the
same graph as the data. Explore the effect of
choosing different values for the learning rate.
Compare the behaviour of the sequential and batch
weight update procedures.
26The perceptron
Applied to classification problems in which the
inputs are usually binary images of characters or
simple shapes
27The perceptron
Define the error function in terms of the total
number of misclassifications over the TS.
However, an error function based on a loss matrix
is piecewise constant w.r.t. the weights and
gradient descent cannot be applied.
Minimize the perceptron criterion
The criterion is continuous and piecewise linear
28The perceptron
Apply the sequential gradient descent rule to the
perceptron criterion
Cycle through all of the patterns in the TS and
test each pattern in turn using the current set
of weight values. If the pattern is correctly
classified do nothing, otherwise add the pattern
vector to the weight vector if the pattern is
labelled class C1 or subtract the pattern vector
from the weight vector if the pattern is labelled
class C2.
The value of ? is unimportant since its change is
equivalent to a re-scaling of the weights and
biases.
29The perceptron
30The perceptron convergence theorem
For any data set which is linearly separable, the
perceptron learning rule is guaranteed to find a
solution in a finite number of steps.
proof
null initial conditions
31The perceptron convergence theorem
For any data set which is linearly separable, the
perceptron learning rule is guaranteed to find a
solution in a finite number of steps.
proof
end proof
32The perceptron convergence theorem
33If the data set happens not to be linearly
separable, then the learning algorithm will never
terminate. If we arbitrarily stop the learning
process, there is no guarantee that the weight
vector found will generalize well for new data.
- decrease ? during the training process
- the pocket algorithm.
34Limitations of the perceptron
Even though the data set of input patterns may
not be linearly separable in the input space, it
can become linearly separable in the ? -space.
However, it implies the number and complexity of
the ?js to grow very rapidly (typically
exponential).
35Fishers linear discriminant
optimal linear dimensionality reduction
no bias
36Fishers linear discriminant
Constrained optimization w ? (m2 - m1)
arbitrarily large by increasing the magnitude of w
Maximize a function which represents the
difference between the projected class means,
normalized by a measure of the within-class
scatter along the direction of w.
37Fishers linear discriminant
The within-class scatter of the transformed data
from class Ck is described by the within-class
covariance given by
Fisher criterion
between-class covariance matrix
within-class covariance matrix
38Fishers linear discriminant
Generalized eigenvector problem
39Fishers linear discriminant
EXAMPLE
40Fishers linear discriminant
The projected data can subsequently be used to
construct a discriminant, by choosing a threshold
y0 so that we classify a new point as belonging
to C1 if y(x) ? y0 and classify it as belonging
to C2 otherwise. Note that y wTx is the sum of
a set of random variables and so we may invoke
the central limit theorem and model the
class-conditional density functions p(y Ck)
using normal distributions.
Once we have obtained a suitable weight vector
and a threshold, the procedure for deciding the
class of a new vector is identical to that of the
perceptron network. So, the Fisher criterion can
be viewed as a learning law for the single-layer
network.
41Fishers linear discriminant
relation to the least-squares approach
42Fishers linear discriminant
relation to the least-squares approach
Bias threshold
43Fishers linear discriminant
relation to the least-squares approach
44Fishers linear discriminant
Several classes
d linear features
45Fishers linear discriminant
Several classes
46Fishers linear discriminant
Several classes
In the projected d-dimensional y-space
47Fishers linear discriminant
Several classes
One possible criterion ...
This criterion is unable to find more than (c -
1) linear features
48FINE