Linear Discriminant Functions - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Linear Discriminant Functions

Description:

For example, Bayesian decision rule assumes that the underlying probability ... then the samples are said to be ... Learning Algorithms. To design a learning ... – PowerPoint PPT presentation

Number of Views:505
Avg rating:3.0/5.0
Slides: 57
Provided by: taiwe
Category:

less

Transcript and Presenter's Notes

Title: Linear Discriminant Functions


1
Linear Discriminant Functions
  • ??????

2
Contents
  • Introduction
  • Linear Discriminant Functions and Decision
    Surface
  • Linear Separability
  • Learning
  • Gradient Decent Algorithm
  • Newtons Method

3
Linear Discriminant Functions
  • Introduction

4
Decision-Making Approaches
  • Probabilistic Approaches
  • Based on the underlying probability densities of
    training sets.
  • For example, Bayesian decision rule assumes that
    the underlying probability densities were
    available.
  • Discriminating Approaches
  • Assume we know the proper forms for the
    discriminant functions were known.
  • Use the samples to estimate the values of
    parameters of the classifier.

5
Linear Discriminating Functions
  • Easy for computing, analysis and learning.
  • Linear classifiers are attractive candidates for
    initial, trial classifier.
  • Learning by minimizing a criterion function,
    e.g., training error.
  • Difficulty a small training error does not
    guarantee a small test error.

6
Linear Discriminant Functions
  • Linear Discriminant Functions and Decision Surface

7
Linear Discriminant Functions
The two-category classification
8
Implementation
9
Implementation
w2
wn
w1
x01
10
Decision Surface
w2
wn
w1
x01
11
Decision Surface
12
Decision Surface
  • A linear discriminant function divides the
    feature space by a hyperplane.
  • 2. The orientation of the surface is determined
    by the normal vector w.
  • 3. The location of the surface is determined by
    the bias w0.

13
Augmented Space
d-dimension
(d1)-dimension
Let
g(x)w1x1 w2x2 w00
Decision surface
14
Augmented Space
Augmented feature vector
Augmented weight vector
d-dimension
(d1)-dimension
Let
g(x)w1x1 w2x2 w00
Decision surface
Pass through the origin of augmented space
15
Augmented Space
ay 0
g(x)w1x1 w2x2 w00
1
16
Augmented Space
By using this mapping, the problem of finding
weight vector w and threshold w0 is reduced to
finding a single vector a.
Decision surface in feature space
Decision surface in augmented space
17
Linear Discriminant Functions
  • Linear Separability

18
The Two-Category Case
Not Linearly Separable
Linearly Separable
19
The Two-Category Case
How to find a?
Given a set of samples y1, y2, , yn, some
labeled ?1 and some labeled ?2,
if there exists a vector a such that
if yi is labeled ?1
if yi is labeled ?2
then the samples are said to be
Linearly Separable
20
Normalization
How to find a?
Given a set of samples y1, y2, , yn, some
labeled ?1 and some labeled ?2,
Withdrawing all labels of samples and replacing
the ones labeled ?2 by their negatives, it is
equivalent to find a vector a such that
if there exists a vector a such that
if yi is labeled ?1
if yi is labeled ?2
then the samples are said to be
Linearly Separable
21
Solution Region in Feature Space
Separating Plane
Normalized Case
22
Solution Region in Weight Space
Separating Plane
Shrink solution region by margin
23
Linear Discriminant Functions
  • Learning

24
Criterion Function
  • To facilitate learning, we usually define a
    scalar criterion function.
  • It usually represents the penalty or cost of a
    solution.
  • Our goal is to minimize its value.
  • Function optimization.

25
Learning Algorithms
  • To design a learning algorithm, we face the
    following problems
  • Whether to stop?
  • In what direction to proceed?
  • How long a step to take?

Is the criterion satisfactory?
e.g., steepest decent
? learning rate
26
Criterion FunctionsThe Two-Category Case
J(a)
of misclassified patterns
solution state
where to go?
Is this criterion appropriate?
27
Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
Perceptron Criterion Function
solution state
where to go?
Is this criterion better?
What problem it has?
28
Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
A Relative of Perceptron Criterion Function
solution state
where to go?
Is this criterion much better?
What problem it has?
29
Criterion FunctionsThe Two-Category Case
Y the set of misclassified patterns
What is the difference with the previous one?
solution state
where to go?
Is this criterion good enough?
Are there others?
30
Learning
  • Gradient Decent Algorithm

31
Gradient Decent Algorithm
Our goal is to go downhill.
Contour Map
(a1, a2)
32
Gradient Decent Algorithm
Our goal is to go downhill.
Define
?aJ is a vector
33
Gradient Decent Algorithm
Our goal is to go downhill.
34
Gradient Decent Algorithm
Our goal is to go downhill.
How long a step shall we take?
?aJ
go this way
35
Gradient Decent Algorithm
?(k) Learning Rate
Initial Setting
a, ?, k 0
do
k ? k 1
a ? a ? ?(k)?a J(a)
?a J(a) lt ?
until
36
Trajectory of Steepest Decent
If improper learning rate (?k) is used, the
convergence rate may be poor.
  • Too small slow convergence.
  • Too large overshooting

a
Basin of Attraction
a2
a0
??J(a1)
Furthermore, the best decent direction is not
necessary, and in fact is quite rarely, the
direction of steepest decent.
??J(a0)
a1
37
Learning Rate
Paraboloid
Q symmetric positive definite
38
Learning Rate
Paraboloid
All smooth functions can be approximated by
paraboloids in a sufficiently small neighborhood
of any point.
39
Learning Rate
We will discuss the convergence properties using
paraboloids.
Paraboloid
Global minimum (x)
Set
40
Learning Rate
Paraboloid
Error
Define
41
Learning Rate
Paraboloid
Error
Define
We want to minimize
and
Clearly,
42
Learning Rate
Suppose that we are at yk.
Let
The steepest decent direction will be
Let learning rate be ?k. That is, yk1 yk ?k
pk.
43
Learning Rate
Well choose the most favorable ?k .
That is, setting
we get
44
Learning Rate
If Q I,
45
Convergence Rate
46
Convergence Rate
Kantorovich Inequality Let Q be a positive
definite, symmetric, n?n matrix. For any vector
there holds
where ?1? ?2 ? ? ?n are eigenvalues of Q.
The smaller the better
47
Convergence Rate
Condition Number
Faster convergence
Slower convergence
48
Convergence Rate
Paraboloid
Suppose we are now at xk.
Updating Rule
49
Trajectory of Steepest Decent
In this case, the condition number of Q is
moderately large.
One then see that the best decent direction is
not necessary, and in fact is quite rarely, the
direction of steepest decent.
50
Learning
  • Newtons Method

51
Global minimum of a Paraboloid
Paraboloid
We can find the global minimum of a paraboloid by
setting its gradient to zero.
52
Function Approximation
Taylor Series Expansion
All smooth functions can be approximated by
paraboloids in a sufficiently small neighborhood
of any point.
53
Function Approximation
Hessian Matrix
54
Newtons Method
Set
55
Comparison
Gradient Decent
Newtons method
56
Comparison
  • Newtons Method will usually give a greater
    improvement per step than the simple gradient
    decent algorithm, even with optimal value of ?k.
  • However, Newtons Method is not applicable if the
    Hessian matrix Q is singular.
  • Even when Q is nonsingular, compute Q is time
    consuming O(d3).
  • It often takes less time to set ?k to a constant
    (small than necessary) than it is to compute the
    optimum ?k at each step.
Write a Comment
User Comments (0)
About PowerShow.com