Linear Discriminant Functions - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Linear Discriminant Functions

Description:

For example, Bayesian decision rule assumes that the underlying probability ... then the samples are said to be ... Learning Algorithms. To design a learning ... – PowerPoint PPT presentation

Number of Views:508

Avg rating:3.0/5.0

Slides: 57

Provided by: taiwe

Category:

more less

Transcript and Presenter's Notes

Title: Linear Discriminant Functions

1
Linear Discriminant Functions

??????

2
Contents

Introduction
Linear Discriminant Functions and Decision
Surface
Linear Separability
Learning
Gradient Decent Algorithm
Newtons Method

3
Linear Discriminant Functions

Introduction

4
Decision-Making Approaches

Probabilistic Approaches
Based on the underlying probability densities of
training sets.
For example, Bayesian decision rule assumes that
the underlying probability densities were
available.
Discriminating Approaches
Assume we know the proper forms for the
discriminant functions were known.
Use the samples to estimate the values of
parameters of the classifier.

5
Linear Discriminating Functions

Easy for computing, analysis and learning.
Linear classifiers are attractive candidates for
initial, trial classifier.
Learning by minimizing a criterion function,
e.g., training error.
Difficulty a small training error does not
guarantee a small test error.

6
Linear Discriminant Functions

Linear Discriminant Functions and Decision Surface

7
Linear Discriminant Functions
The two-category classification
8
Implementation
9
Implementation
w2
wn
w1
x01
10
Decision Surface
w2
wn
w1
x01
11
Decision Surface
12
Decision Surface

A linear discriminant function divides the
feature space by a hyperplane.
2. The orientation of the surface is determined
by the normal vector w.
3. The location of the surface is determined by
the bias w0.

13
Augmented Space
d-dimension
(d1)-dimension
Let
g(x)w1x1 w2x2 w00
Decision surface
14
Augmented Space
Augmented feature vector
Augmented weight vector
d-dimension
(d1)-dimension
Let
g(x)w1x1 w2x2 w00
Decision surface
Pass through the origin of augmented space
15
Augmented Space
ay 0
g(x)w1x1 w2x2 w00
1
16
Augmented Space
By using this mapping, the problem of finding
weight vector w and threshold w0 is reduced to
finding a single vector a.
Decision surface in feature space
Decision surface in augmented space
17
Linear Discriminant Functions

Linear Separability

18
The Two-Category Case
Not Linearly Separable
Linearly Separable
19
The Two-Category Case
How to find a?
Given a set of samples y1, y2, , yn, some
labeled ?1 and some labeled ?2,
if there exists a vector a such that
if yi is labeled ?1
if yi is labeled ?2
then the samples are said to be
Linearly Separable
20
Normalization
How to find a?
Given a set of samples y1, y2, , yn, some
labeled ?1 and some labeled ?2,
Withdrawing all labels of samples and replacing
the ones labeled ?2 by their negatives, it is
equivalent to find a vector a such that
if there exists a vector a such that
if yi is labeled ?1
if yi is labeled ?2
then the samples are said to be
Linearly Separable
21
Solution Region in Feature Space
Separating Plane
Normalized Case
22
Solution Region in Weight Space
Separating Plane
Shrink solution region by margin
23
Linear Discriminant Functions

Learning

24
Criterion Function

To facilitate learning, we usually define a
scalar criterion function.
It usually represents the penalty or cost of a
solution.
Our goal is to minimize its value.
Function optimization.

25
Learning Algorithms

To design a learning algorithm, we face the
following problems
Whether to stop?
In what direction to proceed?
How long a step to take?

Is the criterion satisfactory?
e.g., steepest decent
? learning rate
26
Criterion FunctionsThe Two-Category Case
J(a)
of misclassified patterns
solution state
where to go?
Is this criterion appropriate?
27
Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
Perceptron Criterion Function
solution state
where to go?
Is this criterion better?
What problem it has?
28
Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
A Relative of Perceptron Criterion Function
solution state
where to go?
Is this criterion much better?
What problem it has?
29
Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
What is the difference with the previous one?
solution state
where to go?
Is this criterion good enough?
Are there others?
30
Learning

Gradient Decent Algorithm

31
Gradient Decent Algorithm
Our goal is to go downhill.
Contour Map
(a1, a2)
32
Gradient Decent Algorithm
Our goal is to go downhill.
Define
?aJ is a vector
33
Gradient Decent Algorithm
Our goal is to go downhill.
34
Gradient Decent Algorithm
Our goal is to go downhill.
How long a step shall we take?
?aJ
go this way
35
Gradient Decent Algorithm
?(k) Learning Rate
Initial Setting
a, ?, k 0
do
k ? k 1
a ? a ? ?(k)?a J(a)
?a J(a) lt ?
until
36
Trajectory of Steepest Decent
If improper learning rate (?k) is used, the
convergence rate may be poor.

Too small slow convergence.
Too large overshooting

a
Basin of Attraction
a2
a0
??J(a1)
Furthermore, the best decent direction is not
necessary, and in fact is quite rarely, the
direction of steepest decent.
??J(a0)
a1
37
Learning Rate
Paraboloid
Q symmetric positive definite
38
Learning Rate
Paraboloid
All smooth functions can be approximated by
paraboloids in a sufficiently small neighborhood
of any point.
39
Learning Rate
We will discuss the convergence properties using
paraboloids.
Paraboloid
Global minimum (x)
Set
40
Learning Rate
Paraboloid
Error
Define
41
Learning Rate
Paraboloid
Error
Define
We want to minimize
and
Clearly,
42
Learning Rate
Suppose that we are at yk.
Let
The steepest decent direction will be
Let learning rate be ?k. That is, yk1 yk ?k
pk.
43
Learning Rate
Well choose the most favorable ?k .
That is, setting
we get
44
Learning Rate
If Q I,
45
Convergence Rate
46
Convergence Rate
Kantorovich Inequality Let Q be a positive
definite, symmetric, n?n matrix. For any vector
there holds
where ?1? ?2 ? ? ?n are eigenvalues of Q.
The smaller the better
47
Convergence Rate
Condition Number
Faster convergence
Slower convergence
48
Convergence Rate
Paraboloid
Suppose we are now at xk.
Updating Rule
49
Trajectory of Steepest Decent
In this case, the condition number of Q is
moderately large.
One then see that the best decent direction is
not necessary, and in fact is quite rarely, the
direction of steepest decent.
50
Learning

Newtons Method

51
Global minimum of a Paraboloid
Paraboloid
We can find the global minimum of a paraboloid by
setting its gradient to zero.
52
Function Approximation
Taylor Series Expansion
All smooth functions can be approximated by
paraboloids in a sufficiently small neighborhood
of any point.
53
Function Approximation
Hessian Matrix
54
Newtons Method
Set
55
Comparison
Gradient Decent
Newtons method
56
Comparison

Newtons Method will usually give a greater
improvement per step than the simple gradient
decent algorithm, even with optimal value of ?k.
However, Newtons Method is not applicable if the
Hessian matrix Q is singular.
Even when Q is nonsingular, compute Q is time
consuming O(d3).
It often takes less time to set ?k to a constant
(small than necessary) than it is to compute the
optimum ?k at each step.