Linear Discriminant Functions Chapter 5 (Duda et al.) - PowerPoint PPT Presentation

About This Presentation
Title:

Linear Discriminant Functions Chapter 5 (Duda et al.)

Description:

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis Newton s Method requires inverting H a Newton s method ... – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 45
Provided by: cseUnrEd
Learn more at: https://www.cse.unr.edu
Category:

less

Transcript and Presenter's Notes

Title: Linear Discriminant Functions Chapter 5 (Duda et al.)


1
Linear Discriminant FunctionsChapter 5 (Duda et
al.)
CS479/679 Pattern RecognitionDr. George Bebis
2
Generative vs Discriminant Approach
  • Generative approaches estimate the discriminant
    function by first estimating the probability
    distribution of the data belonging to each class.
  • Discriminative approaches estimate the
    discriminant function explicitly, without
    assuming a probability distribution.

3
Linear Discriminants(case of two categories)
  • A linear discriminant has the following form
  • Decide w1 if g(x) gt 0 and w2 if
    g(x) lt 0

If g(x)0, then x lies on the decision boundary
and can be
assigned to either class.
4
Decision Boundary
  • The decision boundary g(x)0 is a hyperplane.
  • The orientation of the hyperplane is determined
    by w and its location by w0.
  • w is the normal to the hyperplane.
  • If w00, it passes through the origin

5
Decision Boundary Estimation
  • Use learning algorithms to estimate w and w0
    from training data xk.
  • Let us suppose that
  • The solution can be found by minimizing an error
    function, e.g., the training error or
    empirical risk

true class label
predicted class label
6
Geometric Interpretation
  • Lets look at g(x) from a geometrical point of
    view.

Using vector algebra, x can be expressed as
follows
Substitute x in
7
Geometric Interpretation (contd)
8
Geometric Interpretation (contd)
  • The distance of x from the hyperplane is given
    by
  • g(x) provides an algebraic measure of the
    distance of x from the hyperplane.

(i.e., distance of the plane from origin)
Setting x0
9
Linear Discriminant Functions case of c
categories
  • There are several ways to devise multi-category
    classifiers using linear discriminant functions
  • (1) One against the rest

How many decision boundaries are there?
c boundaries
But there is a problem ambiguous region
10
Linear Discriminant Functions case of c
categories (contd)
  • (2) One against another

How many decision boundaries are there?
c(c-1)/2 boundaries
But there is a problem again ambiguous region
11
Linear Discriminant Functions case of c
categories (contd)
  • To avoid the problem with ambiguous regions
  • Define c linear functions gi(x), i1,2,,c
  • Assign x to wi if gi(x) gt gj(x) for all j ? i.
  • The resulting classifier is called a linear
    machine.

(see Chapter 2)
12
Linear Discriminant Functions case of c
categories (contd)
  • A linear machine divides the feature space in c
    convex decisions regions.

If x is in region Ri, the gi(x) is the largest.
Note although there are c(c-1)/2 region
pairs, there typically less decision boundaries
(i.e., 8 instead of 10 in the five class example
above).
13
Geometric Interpretation
  • The decision boundary between adjacent regions Ri
    and Rj is a portion of the hyperplane Hij
  • (wi-wj) is normal to Hij
  • The distance from x to Hij is

14
Higher Order Discriminant Functions
  • Higher order discriminants yield more complex
    decision boundaries than linear discriminant
    functions.
  • Quadratic discriminant - add terms corresponding
    to products of pairs of components of x
  • Polynomial discriminant add even higher order
    products such as

15
Linear Discriminants Revisited A More General
Definition
  • More convenient when the decision boundary
  • passes through the origin augment feature space!

d1 features
d features
d1 parameters
d parameters
16
Linear Discriminants Revisited A More General
Definition (contd)
  • Decide w1 if aty gt 0 and w2 if aty
    lt 0

Discriminant
Classification rule
  • Separates points in (d1)-space by a hyperplane.
  • Decision boundary passes through the origin.

17
Generalized Discriminants
  • The main idea is mapping the data to a space of
    higher dimensionality.
  • This can be done by transforming the data through
    properly chosen functions yi(x), i1,2,,
    (called f functions)

18
Generalized Discriminants (contd)
  • A generalized discriminant is defined as a linear
    discriminant in the - dimensional space

f
19
Generalized Discriminants (contd)
  • Why are generalized discriminants attractive?
  • By properly choosing the f functions, a problem
    which is not linearly-separable in the
    d-dimensional space, might become linearly
    separable in the dimensional space!

20
Example
  • The corresponding decision regions R1,R2 in the
    1D-space are not simply connected (i.e., not
    linearly separable).
  • Consider the following mapping and generalized
    discriminant

21
Example (contd)
22
Learning Linearly Separable Categories
  • Given a linear discriminant function
  • the goal is to learn the parameters
    (weights) a from a set of n labeled samples yi,
    where each yi has a class label ?1 or ?2.

23
Learning effect of training examples
  • Every training sample yi places a constraint on
    the weight vector a
  • Visualize solution in feature space
  • aty0 defines a hyperplane in the feature space
    with a being the normal vector.
  • Given n examples, the solution a must lie within
    a certain region (shaded region in the example).

24
Learning effect of training examples (contd)
  • Visualize solution in parameter space
  • aty0 defines a hyperplane in the parameter space
    with y being the normal vector.
  • Given n examples, the solution a must lie on the
    intersection of n half-spaces
  • (shown by the red lines in the example).

parameter space (?1, ?2)
25
Uniqueness of Solution
  • Solution vector a is usually not unique we can
    impose
  • additional constraints to enforce uniqueness,
    e.g.,
  • Find unit-length weight vector a that
    maximizes the
  • minimum distance from the training
    examples to the
  • separating plane

26
Learning Using Iterative Optimization
  • Minimize some error function J(a) with respect to
    a.
  • Minimize iteratively

e.g.,
 
a(k)
a(k1)
How should we choose pk?
27
Choose pk using Gradient
Warning notation is reversed in the figures.
points in the direction of steepest decrease!
J(a)
28
Gradient Descent
Warning notation is reversed in the figure.
J(a)
29
Gradient Descent (contd)
  • Gradient descent is very popular due to its
    simplicity but can get stuck in local minima.

Warning notation is reversed in the figure.
J(a)
30
Gradient Descent
  • What is the effect of the learning rate h(k) ?
  • If it is too small, it takes too many iterations.
  • If it is too big, it might overshoot the
    solution (and never find it), possibly leading to
    oscillations (no convergence).

Warning notation is reversed in the figure.
J(a)
Take bigger steps to converge faster.
Take smaller steps to avoid overshooting.
31
Gradient Descent (contd)
  • Even a small change in the learning rate might
    lead to overshooting the solution.

J(a)
Converges to the solution!
Overshoots the solution!
32
Gradient Descent (contd)
  • Could we choose h(k) adaptively?
  • Yes lets review Taylor Series expansion first.

Expands f(x) around x0 using derivatives
33
Gradient Descent (contd)
  • Expand J(a) around a0a(k) using Taylor Series
    (up to second derivative)
  • Evaluate J(a) at aa(k1)

Hessian matrix
Expensive to compute in practice!
34
Choosing pk using Hessian

requires inverting H expensive in practice!
Gradient descent can be seen as a special case
of Newtons method assuming HI
35
Newtons Method
h(k)1
If J(a) is quadratic, Newtons method converges
in one iteration!
J(a)
36
Gradient descent vs Newtons method
Gradient Descent

Newton
Typically, Newtons method converges faster than
Gradient Descent.
37
Dual Classification Problem
If atyigt0 assign yi to ?1 else if atyilt0 assign
yi to ?2
  • If yi in ?2, replace yi by -yi
  • Find a such that atyigt0

Seeks a hyperplane that separates patterns from
different categories
Seeks a hyperplane that puts normalized patterns
on the same (positive) side
38
Perceptron rule
  • The perceptron rule minimizes the following error
    function
  • where Y(a) is the set of samples misclassified by
    a.
  • If Y(a) is empty, Jp(a)0 otherwise, Jp(a)gt0

Find a such that atyigt0 for all i
39
Perceptron rule (contd)
  • Apply gradient descent using Jp(a)
  • Compute the gradient of Jp(a)

40
Perceptron rule (contd)

missclassified examples
41
Perceptron rule (contd)
  • Keep updating the orientation of the hyperplane
    until all training samples are on its positive
    side.

Example
42
Perceptron rule (contd)
Update is done using one misclassified example
at a time
Perceptron Convergence Theorem If training
samples are linearly separable, then the
perceptron algorithm will terminate at a solution
vector in a finite number of steps.
43
Perceptron rule (contd)

order of examples y2 y3 y1 y3
Batch algorithm leads to a smoother trajectory
in solution space.
44
Quiz
  • When April 21st
  • What Linear Discriminants
Write a Comment
User Comments (0)
About PowerShow.com