Title: Linear Discriminant Functions Chapter 5 (Duda et al.)
1Linear Discriminant FunctionsChapter 5 (Duda et
al.)
CS479/679 Pattern RecognitionDr. George Bebis
2Generative vs Discriminant Approach
- Generative approaches estimate the discriminant
function by first estimating the probability
distribution of the data belonging to each class. - Discriminative approaches estimate the
discriminant function explicitly, without
assuming a probability distribution.
3Linear Discriminants(case of two categories)
- A linear discriminant has the following form
- Decide w1 if g(x) gt 0 and w2 if
g(x) lt 0 -
If g(x)0, then x lies on the decision boundary
and can be
assigned to either class.
4Decision Boundary
- The decision boundary g(x)0 is a hyperplane.
- The orientation of the hyperplane is determined
by w and its location by w0. - w is the normal to the hyperplane.
- If w00, it passes through the origin
5Decision Boundary Estimation
- Use learning algorithms to estimate w and w0
from training data xk. - Let us suppose that
- The solution can be found by minimizing an error
function, e.g., the training error or
empirical risk
true class label
predicted class label
6Geometric Interpretation
- Lets look at g(x) from a geometrical point of
view.
Using vector algebra, x can be expressed as
follows
Substitute x in
7Geometric Interpretation (contd)
8Geometric Interpretation (contd)
- The distance of x from the hyperplane is given
by - g(x) provides an algebraic measure of the
distance of x from the hyperplane.
(i.e., distance of the plane from origin)
Setting x0
9Linear Discriminant Functions case of c
categories
- There are several ways to devise multi-category
classifiers using linear discriminant functions - (1) One against the rest
How many decision boundaries are there?
c boundaries
But there is a problem ambiguous region
10Linear Discriminant Functions case of c
categories (contd)
How many decision boundaries are there?
c(c-1)/2 boundaries
But there is a problem again ambiguous region
11Linear Discriminant Functions case of c
categories (contd)
- To avoid the problem with ambiguous regions
- Define c linear functions gi(x), i1,2,,c
- Assign x to wi if gi(x) gt gj(x) for all j ? i.
- The resulting classifier is called a linear
machine.
(see Chapter 2)
12Linear Discriminant Functions case of c
categories (contd)
- A linear machine divides the feature space in c
convex decisions regions.
If x is in region Ri, the gi(x) is the largest.
Note although there are c(c-1)/2 region
pairs, there typically less decision boundaries
(i.e., 8 instead of 10 in the five class example
above).
13Geometric Interpretation
- The decision boundary between adjacent regions Ri
and Rj is a portion of the hyperplane Hij - (wi-wj) is normal to Hij
- The distance from x to Hij is
14Higher Order Discriminant Functions
- Higher order discriminants yield more complex
decision boundaries than linear discriminant
functions. - Quadratic discriminant - add terms corresponding
to products of pairs of components of x - Polynomial discriminant add even higher order
products such as
15Linear Discriminants Revisited A More General
Definition
- More convenient when the decision boundary
- passes through the origin augment feature space!
d1 features
d features
d1 parameters
d parameters
16Linear Discriminants Revisited A More General
Definition (contd)
-
-
- Decide w1 if aty gt 0 and w2 if aty
lt 0
Discriminant
Classification rule
- Separates points in (d1)-space by a hyperplane.
- Decision boundary passes through the origin.
17Generalized Discriminants
- The main idea is mapping the data to a space of
higher dimensionality. -
- This can be done by transforming the data through
properly chosen functions yi(x), i1,2,,
(called f functions) -
18Generalized Discriminants (contd)
- A generalized discriminant is defined as a linear
discriminant in the - dimensional space -
f
19Generalized Discriminants (contd)
- Why are generalized discriminants attractive?
- By properly choosing the f functions, a problem
which is not linearly-separable in the
d-dimensional space, might become linearly
separable in the dimensional space!
20Example
- The corresponding decision regions R1,R2 in the
1D-space are not simply connected (i.e., not
linearly separable). - Consider the following mapping and generalized
discriminant
21Example (contd)
22Learning Linearly Separable Categories
- Given a linear discriminant function
-
- the goal is to learn the parameters
(weights) a from a set of n labeled samples yi,
where each yi has a class label ?1 or ?2.
23Learning effect of training examples
- Every training sample yi places a constraint on
the weight vector a - Visualize solution in feature space
- aty0 defines a hyperplane in the feature space
with a being the normal vector. - Given n examples, the solution a must lie within
a certain region (shaded region in the example).
24Learning effect of training examples (contd)
- Visualize solution in parameter space
- aty0 defines a hyperplane in the parameter space
with y being the normal vector. - Given n examples, the solution a must lie on the
intersection of n half-spaces - (shown by the red lines in the example).
parameter space (?1, ?2)
25Uniqueness of Solution
- Solution vector a is usually not unique we can
impose - additional constraints to enforce uniqueness,
e.g., - Find unit-length weight vector a that
maximizes the - minimum distance from the training
examples to the - separating plane
-
26Learning Using Iterative Optimization
- Minimize some error function J(a) with respect to
a. - Minimize iteratively
e.g.,
Â
a(k)
a(k1)
How should we choose pk?
27Choose pk using Gradient
Warning notation is reversed in the figures.
points in the direction of steepest decrease!
J(a)
28Gradient Descent
Warning notation is reversed in the figure.
J(a)
29Gradient Descent (contd)
- Gradient descent is very popular due to its
simplicity but can get stuck in local minima.
Warning notation is reversed in the figure.
J(a)
30Gradient Descent
- What is the effect of the learning rate h(k) ?
- If it is too small, it takes too many iterations.
- If it is too big, it might overshoot the
solution (and never find it), possibly leading to
oscillations (no convergence).
Warning notation is reversed in the figure.
J(a)
Take bigger steps to converge faster.
Take smaller steps to avoid overshooting.
31Gradient Descent (contd)
- Even a small change in the learning rate might
lead to overshooting the solution.
J(a)
Converges to the solution!
Overshoots the solution!
32Gradient Descent (contd)
- Could we choose h(k) adaptively?
- Yes lets review Taylor Series expansion first.
Expands f(x) around x0 using derivatives
33Gradient Descent (contd)
- Expand J(a) around a0a(k) using Taylor Series
(up to second derivative) - Evaluate J(a) at aa(k1)
Hessian matrix
Expensive to compute in practice!
34Choosing pk using Hessian
requires inverting H expensive in practice!
Gradient descent can be seen as a special case
of Newtons method assuming HI
35Newtons Method
h(k)1
If J(a) is quadratic, Newtons method converges
in one iteration!
J(a)
36Gradient descent vs Newtons method
Gradient Descent
Newton
Typically, Newtons method converges faster than
Gradient Descent.
37Dual Classification Problem
If atyigt0 assign yi to ?1 else if atyilt0 assign
yi to ?2
- If yi in ?2, replace yi by -yi
- Find a such that atyigt0
Seeks a hyperplane that separates patterns from
different categories
Seeks a hyperplane that puts normalized patterns
on the same (positive) side
38Perceptron rule
- The perceptron rule minimizes the following error
function - where Y(a) is the set of samples misclassified by
a. - If Y(a) is empty, Jp(a)0 otherwise, Jp(a)gt0
Find a such that atyigt0 for all i
39Perceptron rule (contd)
- Apply gradient descent using Jp(a)
- Compute the gradient of Jp(a)
40Perceptron rule (contd)
missclassified examples
41Perceptron rule (contd)
- Keep updating the orientation of the hyperplane
until all training samples are on its positive
side.
Example
42Perceptron rule (contd)
Update is done using one misclassified example
at a time
Perceptron Convergence Theorem If training
samples are linearly separable, then the
perceptron algorithm will terminate at a solution
vector in a finite number of steps.
43Perceptron rule (contd)
order of examples y2 y3 y1 y3
Batch algorithm leads to a smoother trajectory
in solution space.
44Quiz
- When April 21st
- What Linear Discriminants