CS479679 Pattern Recognition Spring 2006 Prof' Bebis - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

CS479679 Pattern Recognition Spring 2006 Prof' Bebis

Description:

Parametric/non-parametric density estimation techniques find the decision ... Maps a line in x-space to a parabola in y-space. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 53
Provided by: cse5
Category:

less

Transcript and Presenter's Notes

Title: CS479679 Pattern Recognition Spring 2006 Prof' Bebis


1
CS479/679 Pattern RecognitionSpring 2006 Prof.
Bebis
  • Linear Discriminant Functions
  • Chapter 5 (Duda et al.)

2
Statistical vs Discriminant Approach
  • Parametric/non-parametric density estimation
    techniques find the decision boundaries by first
    estimating the probability distribution of the
    patterns belonging to each class.
  • In the discriminant-based approach, the decision
    boundary is constructed explicitly.
  • Knowledge of the form of the probability
    distribution is not required.

3
Discriminant Approach
  • Classification is viewed as learning good
    decision boundaries that separate the examples
    belonging to different classes in a data set.

4
Discriminant function estimation
  • Specify a parametric form of the decision
    boundary (e.g., linear or quadratic) .
  • Find the best decision boundary of the
    specified form using a set of training examples.
  • This is done by minimizing a criterion function
  • e.g., training error (or sample risk)

5
Linear Discriminant Functions
  • A linear discriminant function is a linear
    combination of its components
  • where w is the weight vector and w0 is the bias
    (or threshold weight).

6
Linear Discriminant Functions two category case
  • Decide w1 if g(x) gt 0 and w2 if g(x) lt 0
  • If g(x)0, then x is on the decision boundary and
    can be assigned to either class.

7
Linear Discriminant Functions two category case
(contd)
  • If g(x) is linear, the decision boundary is a
    hyperplane.
  • The orientation of the hyperplane is determined
    by w and its location by w0.
  • w is the normal to the hyperplane.
  • If w00, the hyperplane passes through the origin.

8
Interpretation of g(x)
  • g(x) provides an algebraic measure of the
    distance of x from the hyperplane.

specifies direction of r
9
Interpretation of g(x) (contd)
  • Substitute the above expression in g(x)
  • This gives the distance of x from the hyperplane
  • w0 determines the distance of the hyperplane from
    the origin

10
Linear Discriminant Functions multi-category
case
  • There are several ways to devise multi-category
    classifiers using linear discriminant functions
  • One against the rest (i.e., c-1 two-class
    problems)

11
Linear Discriminant Functions multi-category
case (contd)
  • One against another (i.e., c(c-1)/2 pairs of
    classes)

12
Linear Discriminant Functions multi-category
case (contd)
  • To avoid the problem of ambiguous regions
  • Define c linear discriminant functions
  • Assign x to wi if gi(x) gt gj(x) for all j ? i.
  • The resulting classifier is called a linear
    machine

13
Linear Discriminant Functions multi-category
case (contd)

14
Linear Discriminant Functions multi-category
case (contd)
  • The boundary between two regions Ri and Rj is a
    portion of the hyperplane given by
  • The decision regions for a linear machine are
    convex.

15
Higher order discriminant functions
  • Can produce more complicated decision boundaries
    than linear discriminant functions.

hyperquadric decision boundaries
16
Higher order discriminant functions (contd)
  • Generalized discriminant
  • - a is a dimensional weight vector
  • - the functions yi(x) are called f
    functions
  • The functions yi(x) map points from the
    d-dimensional x-space to the -dimensional
    y-space (usually gtgt d )

17
Generalized discriminant functions
  • The resulting discriminant function is not linear
    in x but it is linear in y.
  • The generalized discriminant separates points in
    the transformed space by a hyperplane passing
    through the origin.

18
Generalized discriminant functions (contd)
  • Example
  • Maps a line in x-space to a parabola in y-space.
  • The plane aty0 divides the y-space in two
    decision regions
  • The corresponding decision regions R1,R2 in the
    x-space are not simply connected!

f functions
19
Generalized discriminant functions (contd)

20
Generalized discriminant functions (contd)
  • Practical issues.
  • Computationally intensive.
  • Lots of training examples are required to
    determine a if is very large (i.e.,curse of
    dimensionality).

21
NotationAugmented feature/weight vectors

d1 dimensions
  • Decision hyperplane passes from origin in y-space

22
Two-Category, Linearly Separable Case
  • Given a linear discriminant function g(x)aty,
    the goal is to learn the weights using a set of n
    labeled samples (i.e., examples and their
    associated classes).
  • Classification rule
  • If atyigt0 assign yi to ?1
  • else if atyilt0 assign yi to ?2

23
Two-Category, Linearly Separable Case (contd)
  • Every training sample yi places a constraint on
    the weight vector a.
  • Given n examples, the solution must lie on the
    intersection of n half-spaces.

a2
g(x)aty
a1
24
Two-Category, Linearly Separable Case (contd)
  • Solution vector is usually not unique!
  • Impose constraints to enforce uniqueness...

(normalized version)
  • If yi in ?2, replace yi by -yi
  • Find a such that atyigt0

25
Two-Category, Linearly Separable Case (contd)
  • Constrain margin
  • find min-length a with
  • Move solution to the center of the feasible
    region

26
Iterative Optimization
  • Define a criterion function J(a) that is
    minimized if a is a solution vector.
  • Minimize J(a) iteratively ...

a(k1)
search direction
learning rate
a(k)
27
Gradient Descent
  • Gradient descent rule

learning rate
28
Gradient Descent (contd)

29
Gradient Descent (contd)

too large learning rate!
?
?
30
Gradient Descent (contd)
  • How to choose the learning rate h(k)?
  • Note if J(a) is quadratic, the learning rate is
    constant!

Taylor series expansion
Hessian
optimum learning rate
31
Newtons method

requires inverting H!
32
Newtons method (contd)

If the error function is quadratic,
Newtons method converges in one step!
33
ComparisonGradient descent vs Newtons method

34
Perceptron rule
  • where Y(a) is the set of samples misclassified
    by a.
  • If Y(a) is empty, Jp(a)0 otherwise, Jp(a)gt0

Criterion Function
(normalized version)
35
Perceptron rule (contd)
  • The gradient of Jp(a) is
  • The perceptron update rule is obtained using
    gradient descent

36
Perceptron rule (contd)

consider all examples missclassified
37
Perceptron rule (contd)
  • Move the hyperplane so that training samples are
    on its positive side.

a2
a2
a1
a1
38
Perceptron rule (contd)
?(k)1
consider one example at a time
Perceptron Convergence Theorem If training
samples are linearly separable, then the sequence
of weight vectors by the above algorithm will
terminate at a solution vector in a finite number
of steps.
39
Perceptron rule (contd)

order of examples y2 y3 y1 y3
40
Perceptron rule (contd)

41
Perceptron rule (contd)
  • Some Direct Generalizations
  • Variable increment and a margin

42
Perceptron rule (contd)

43
Perceptron rule (contd)

44
Relaxation Procedures
  • Note that different criterion functions exist
  • One possible choice is
  • Where Y is again the set of the training samples
    that are misclassified by a
  • However, there are two problems with this
    criterion
  • The function is too smooth and can converge to
    a0
  • Jq is dominated by training samples with large
    magnitude

45
Relaxation Procedures (contd)
  • A modified version that avoids the above two
    problems is
  • Here Y is the set of samples for which
  • Its gradient is given by

46
Relaxation Procedures (contd)

47
Relaxation Procedures (contd)

48
Relaxation Procedures (contd)

49
Relaxation Procedures (contd)

50
Minimum Squared Error Procedures
  • Minimum squared error and pseudoinverse
  • The problem is to find a weight vector a
    satisfying Yab
  • If we have more equations than unknowns, a is
    over-determined.
  • We want to choose the one that minimizes the
    sum-of-squared-error criterion function

51
Minimum Squared Error Procedures (contd)
  • Pseudoinverse

52
Minimum Squared Error Procedures (contd)
Write a Comment
User Comments (0)
About PowerShow.com