240650 Principles of Pattern Recognition - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

240650 Principles of Pattern Recognition

Description:

Bayesian Decision Theory cont. Whenever we observe a particular x, the prob. of error is ... Bayesian Decision Theory--continuous features. Feature space ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 62
Provided by: Mon96
Category:

less

Transcript and Presenter's Notes

Title: 240650 Principles of Pattern Recognition


1
240-650 Principles of Pattern Recognition
Montri Karnjanadecha montri_at_coe.psu.ac.th http//f
ivedots.coe.psu.ac.th/montri
2
Chapter 2
  • Bayesian Decision Theory

3
Statistical Approach to Pattern Recognition
4
A Simple Example
  • Suppose that we are given two classes w1 and w2
  • P(w1) 0.7
  • P(w2) 0.3
  • No measurement is given
  • Guessing
  • What shall we do to recognize a given input?
  • What is the best we can do statistically? Why?

5
A More Complicated Example
  • Suppose that we are given two classes
  • A single measurement x
  • P(w1x) and P(w2x) are given graphically

6
A Bayesian Example
  • Suppose that we are given two classes
  • A single measurement x
  • We are given p(xw1) and p(xw2) this time

7
A Bayesian Example cont.
8
Bayesian Decision Theory
  • Bayes formula
  • In case of two categories
  • In English, it can be expressed as

9
Bayesian Decision Theory cont.
  • A posterior probability
  • The probability of the state of nature being
    given that feature value x has been measured
  • Likelihood
  • is the likelihood of with
    respect to x
  • Evidence
  • The evidence factor can be viewed as a scaling
    factor that guarantees that the posterior
    probabilities sum to one.

10
Bayesian Decision Theory cont.
  • Whenever we observe a particular x, the prob. of
    error is
  • The average prob. of error is given by

11
Bayesian Decision Theory cont.
  • Bayes decision rule
  • Decide w1 if P(w1x) gt P(w2x) otherwise decide
    w2
  • Prob. of error
  • P(errorx)minP(w1x), P(w2x)
  • If we ignore the evidence, the decision rule
    becomes
  • Decide w1 if P(xw1) P(w1) gt P(xw2) P(w2)
  • Otherwise decide w2

12
Bayesian Decision Theory--continuous features
  • Feature space
  • In general, an input can be represented by a
    vector, a point in a d-dimensional Euclidean
    space Rd
  • Loss function
  • The loss function states exactly how costly each
    action is and is used to convert a probability
    determination into a decision
  • Written as

13
Loss Function
  • Describe the loss incurred for taking action ai
  • when the state of nature is wj

14
Conditional Risk
  • Suppose we observe a particular x
  • We take action ai
  • If the true state of nature is wj
  • By definition we will incur the loss l(aiwj)
  • We can minimize our expected loss by selecting
    the action that minimize the condition risk,
    R(aix)

15
Bayesian Decision Theory
  • Suppose that there are c categories
  • w1, w2, ..., wc
  • Conditional risk
  • Risk is the average expected loss

16
Bayesian Decision Theory
  • Bayes decision rule
  • For a given x, select the action ai for which the
    conditional risk is minimum
  • The resulting minimum overall risk is called the
    Bayes risk, denoted as R, which is the best
    performance that can be achieved

17
Two-Category Classification
  • Let lij l(aiwj)
  • Conditional risk
  • Fundamental decision rule
  • Decide w1 if R(a1x) lt R(w2x)

18
Two-Category Classification cont.
  • The decision rule can be written in several ways
  • Decide w1 if one of the followings is true

These rules are equivalent
Likelihood Ratio
19
Minimum-Error-Rate Classification
  • A special case of the Bayes decision rule with
    the following zero-one loss function
  • Assigns no loss to correct decision
  • Assigns unit loss to any error
  • All errors are equally costly

20
Minimum-Error-Rate Classification
  • Conditional risk

21
Minimum-Error-Rate Classification
  • We should select i that maximizes the posterior
    probability
  • For minimum error rate
  • Decide

22
Minimum-Error-Rate Classification
23
Classifiers, Discriminant Functions, and Decision
Surfaces
  • There are many ways to represent pattern
    classifiers
  • One of the most useful is in terms of a set of
    discriminant functions gi(x), i1,,c
  • The classifier assigns a feature vector x to
    class if

24
The Multicategory Classifier
25
Classifiers, Discriminant Functions, and Decision
Surfaces
  • There are many equivalent discriminant functions
  • i.e., the classification results will be the same
    even though they are different functions
  • For example, if f is a monotonically increasing
    function, then

26
Classifiers, Discriminant Functions, and Decision
Surfaces
  • Some of discriminant functions are easier to
    understand or to compute

27
Decision Regions
  • The effect of any decision is to divide the
    feature space into c decision regions, R1, ...,
    Rc
  • The regions are separated with decision
    boundaries, where ties occur among the largest
    discriminant functions

28
Decision Regions cont.
29
Two-Category Case (Dichotomizer)
  • Two-category case is a special case
  • Instead of two discriminant functions, a single
    one can be used

30
The Normal Density
  • Univariate Gaussian Density
  • Mean
  • Variance

31
The Normal Density
32
The Normal Density
  • Central Limit Theorem
  • The aggregate effect of the sum of a large number
    of small, independent random disturbances will
    lead to a Gaussian distribution
  • Gaussian is often a good model for the actual
    probability distribution

33
The Multivariate Normal Density
  • Multivariate Density (in d dimension)
  • Abbreviation

34
The Multivariate Normal Density
  • Mean
  • Covariance matrix
  • The ijth component of

35
Statistically Independence
  • If xi and xj are statistically independence then
  • The covariance matrix will become a diagonal
    matrix where all off-diagonal elements are zero

36
Whitening Transform
Diagonal matrix of the corresponding eigenvalues
of
matrix whose columns are the orthonormal
eigenvectors of
37
Whitening Transform
38
Squared Mahalanobis Distance from x to m
Constant density
Principle axes of hyperellipsiods are given by
the eigenvectors of S Length of axes are
determined by eigenvalues of S
39
Discriminant Functions for the Normal Density
  • Minimum distance classifier
  • If the density are multivariate normal
    i.e., if
  • Then we have

40
Discriminant Functions for the Normal Density
  • Case 1
  • Features are statistically independence and each
    feature has the same variance
  • Where . denotes the Euclidean norm

41
Case 1 Si s2I
42
Linear Discriminant Function
  • It is not necessary to compute distances
  • Expanding the form yields
  • The term is the same for all i
  • We have the following linear discriminant
    function

43
Linear Discriminant Function
  • where
  • and

Threshold or bias for the ith category
44
Linear Machine
  • A classifier that uses linear discriminant
    functions is called a linear machine
  • Its decision surfaces are pieces of hyperplanes
    defined by the linear equations
  • for the two categories
    with the highest posterior probabilities. For our
    case this equation can be written as

45
Linear Machine
  • Where
  • And
  • If then the second term
    vanishes
  • It is called a minimum-distance classifier

46
Priors change -gt decision boundaries shift
47
Priors change -gt decision boundaries shift
48
Priors change -gt decision boundaries shift
49
Case 2 Si S
  • Covariance matrices for all of the classes are
    identical but otherwise arbitrary
  • The cluster for the ith class is centered about
    mi
  • Discriminant function

Can be ignored if prior probabilities are the
same for all classes
50
Case 2 Discriminant function
  • Where
  • and

51
For 2-category case
  • If Ri and Rj are contiguous, the boundary between
    them has the equation
  • where
  • and

52
(No Transcript)
53
(No Transcript)
54
Case 3 Si arbitrary
  • In general, the covariance matrices are different
    for each category
  • The only term that can be dropped is the (d/2) ln
    2p term

55
Case 3 Si arbitrary
  • The discriminant functions are
  • Where
  • and

56
Two-category case
  • The decision surface are hyperquadrics
    (hyperplanes, hyperspheres, hyperellipsoids,
    hyperparaboloids,)

57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
Example
Write a Comment
User Comments (0)
About PowerShow.com