Title: Linear Discriminant Functions
1Linear Discriminant Functions
2Contents
- Introduction
- Linear Discriminant Functions and Decision
Surface - Linear Separability
- Learning
- Gradient Decent Algorithm
- Newtons Method
3Linear Discriminant Functions
4Decision-Making Approaches
- Probabilistic Approaches
- Based on the underlying probability densities of
training sets. - For example, Bayesian decision rule assumes that
the underlying probability densities were
available. - Discriminating Approaches
- Assume we know the proper forms for the
discriminant functions were known. - Use the samples to estimate the values of
parameters of the classifier.
5Linear Discriminating Functions
- Easy for computing, analysis and learning.
- Linear classifiers are attractive candidates for
initial, trial classifier. - Learning by minimizing a criterion function,
e.g., training error. - Difficulty a small training error does not
guarantee a small test error.
6Linear Discriminant Functions
- Linear Discriminant Functions and Decision Surface
7Linear Discriminant Functions
The two-category classification
8Implementation
9Implementation
w2
wn
w1
x01
10Decision Surface
w2
wn
w1
x01
11Decision Surface
12Decision Surface
- A linear discriminant function divides the
feature space by a hyperplane. - 2. The orientation of the surface is determined
by the normal vector w. - 3. The location of the surface is determined by
the bias w0.
13Augmented Space
d-dimension
(d1)-dimension
Let
g(x)w1x1 w2x2 w00
Decision surface
14Augmented Space
Augmented feature vector
Augmented weight vector
d-dimension
(d1)-dimension
Let
g(x)w1x1 w2x2 w00
Decision surface
Pass through the origin of augmented space
15Augmented Space
ay 0
g(x)w1x1 w2x2 w00
1
16Augmented Space
By using this mapping, the problem of finding
weight vector w and threshold w0 is reduced to
finding a single vector a.
Decision surface in feature space
Decision surface in augmented space
17Linear Discriminant Functions
18The Two-Category Case
Not Linearly Separable
Linearly Separable
19The Two-Category Case
How to find a?
Given a set of samples y1, y2, , yn, some
labeled ?1 and some labeled ?2,
if there exists a vector a such that
if yi is labeled ?1
if yi is labeled ?2
then the samples are said to be
Linearly Separable
20Normalization
How to find a?
Given a set of samples y1, y2, , yn, some
labeled ?1 and some labeled ?2,
Withdrawing all labels of samples and replacing
the ones labeled ?2 by their negatives, it is
equivalent to find a vector a such that
if there exists a vector a such that
if yi is labeled ?1
if yi is labeled ?2
then the samples are said to be
Linearly Separable
21Solution Region in Feature Space
Separating Plane
Normalized Case
22Solution Region in Weight Space
Separating Plane
Shrink solution region by margin
23Linear Discriminant Functions
24Criterion Function
- To facilitate learning, we usually define a
scalar criterion function. - It usually represents the penalty or cost of a
solution. - Our goal is to minimize its value.
- Function optimization.
25Learning Algorithms
- To design a learning algorithm, we face the
following problems - Whether to stop?
- In what direction to proceed?
- How long a step to take?
Is the criterion satisfactory?
e.g., steepest decent
? learning rate
26Criterion FunctionsThe Two-Category Case
J(a)
of misclassified patterns
solution state
where to go?
Is this criterion appropriate?
27Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
Perceptron Criterion Function
solution state
where to go?
Is this criterion better?
What problem it has?
28Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
A Relative of Perceptron Criterion Function
solution state
where to go?
Is this criterion much better?
What problem it has?
29Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
What is the difference with the previous one?
solution state
where to go?
Is this criterion good enough?
Are there others?
30Learning
- Gradient Decent Algorithm
31Gradient Decent Algorithm
Our goal is to go downhill.
Contour Map
(a1, a2)
32Gradient Decent Algorithm
Our goal is to go downhill.
Define
?aJ is a vector
33Gradient Decent Algorithm
Our goal is to go downhill.
34Gradient Decent Algorithm
Our goal is to go downhill.
How long a step shall we take?
?aJ
go this way
35Gradient Decent Algorithm
?(k) Learning Rate
Initial Setting
a, ?, k 0
do
k ? k 1
a ? a ? ?(k)?a J(a)
?a J(a) lt ?
until
36Trajectory of Steepest Decent
If improper learning rate (?k) is used, the
convergence rate may be poor.
- Too small slow convergence.
- Too large overshooting
a
Basin of Attraction
a2
a0
??J(a1)
Furthermore, the best decent direction is not
necessary, and in fact is quite rarely, the
direction of steepest decent.
??J(a0)
a1
37Learning Rate
Paraboloid
Q symmetric positive definite
38Learning Rate
Paraboloid
All smooth functions can be approximated by
paraboloids in a sufficiently small neighborhood
of any point.
39Learning Rate
We will discuss the convergence properties using
paraboloids.
Paraboloid
Global minimum (x)
Set
40Learning Rate
Paraboloid
Error
Define
41Learning Rate
Paraboloid
Error
Define
We want to minimize
and
Clearly,
42Learning Rate
Suppose that we are at yk.
Let
The steepest decent direction will be
Let learning rate be ?k. That is, yk1 yk ?k
pk.
43Learning Rate
Well choose the most favorable ?k .
That is, setting
we get
44Learning Rate
If Q I,
45Convergence Rate
46Convergence Rate
Kantorovich Inequality Let Q be a positive
definite, symmetric, n?n matrix. For any vector
there holds
where ?1? ?2 ? ? ?n are eigenvalues of Q.
The smaller the better
47Convergence Rate
Condition Number
Faster convergence
Slower convergence
48Convergence Rate
Paraboloid
Suppose we are now at xk.
Updating Rule
49Trajectory of Steepest Decent
In this case, the condition number of Q is
moderately large.
One then see that the best decent direction is
not necessary, and in fact is quite rarely, the
direction of steepest decent.
50Learning
51Global minimum of a Paraboloid
Paraboloid
We can find the global minimum of a paraboloid by
setting its gradient to zero.
52Function Approximation
Taylor Series Expansion
All smooth functions can be approximated by
paraboloids in a sufficiently small neighborhood
of any point.
53Function Approximation
Hessian Matrix
54Newtons Method
Set
55Comparison
Gradient Decent
Newtons method
56Comparison
- Newtons Method will usually give a greater
improvement per step than the simple gradient
decent algorithm, even with optimal value of ?k. - However, Newtons Method is not applicable if the
Hessian matrix Q is singular. - Even when Q is nonsingular, compute Q is time
consuming O(d3). - It often takes less time to set ?k to a constant
(small than necessary) than it is to compute the
optimum ?k at each step.