COMPE 467 - Pattern Recognition presentation

About This Presentation

Transcript and Presenter's Notes

Title: COMPE 467 - Pattern Recognition

1
Bayesian Decision Theory
COMPE 467 - Pattern Recognition
2
Bayesian Decision Theory

Bayesian Decision Theory is a fundamental
statistical approach that quanti?es the trade
offs between various decisions using
probabilities and costs that accompany such
decisions.
First, we will assume that all probabilities are
known.
Then, we will study the cases where the
probabilistic structure is not completely known.

3
Fish Sorting Example

State of nature is a random variable.
De?ne w as the type of ?sh we observe (state of
nature, class) where
w w1 for sea bass,
w w2 for salmon.
P(w1) is the a priori probability that the next
?sh is a sea bass.
P(w2) is the a priori probability that the next
?sh is a salmon.

4
Prior Probabilities

Prior probabilities re?ect our knowledge of how
likely each type of ?sh will appear before we
actually see it.
How can we choose P(w1) and P(w2)?
Set P(w1) P(w2) if they are equiprobable
(uniform priors).
May use different values depending on the ?shing
area, time of the year, etc.
Assume there are no other types of ?sh
P(w1) P(w2) 1

5
Making a Decision

How can we make a decision with only the prior
information?
What is the probability of error for this
decision?
P(error ) minP(w1), P(w2)

6
Making a Decision

Decision rule with only the prior information
Decide ?1 if P(?1) gt P(?2) otherwise decide ?2
Make further measurement and compute the class
conditional densities

7
Class-Conditional Probabilities

Lets try to improve the decision using the
lightness measurement x.
Let x be a continuous random variable.
De?ne p(xwj) as the class-conditional
probability density (probability of x given that
the state of nature is wj for j 1, 2).
p(xw1) and p(xw2) describe the difference in
lightness between populations of sea bass and
salmon

8
Class-Conditional Probabilities
9
Posterior Probabilities

Suppose we know P(wj) and p(xwj) for j 1, 2,
and measure the lightness of a ?sh as the value
x.
De?ne P(wj x) as the a posteriori probability
(probability of the state of nature being wj
given the measurement of feature value x).
We can use the Bayes formula to convert the prior
probability to the posterior probability
where

10
Posterior Probabilities (remember)
11
Posterior Probabilities (remember)
Posterior (Likelihood . Prior) / Evidence
12
Making a Decision

p(xwj) is called the likelihood and p(x) is
called the evidence.
How can we make a decision after observing the
value of x?

13
Making a Decision

Decision strategy Given the posterior
probabilities for each class
X is an observation for which
if P(?1 x) gt P(?2 x) True state of
nature ?1
if P(?1 x) lt P(?2 x) True state of
nature ?2

14
Making a Decision

p(xwj) is called the likelihood and p(x) is
called the evidence.
How can we make a decision after observing the
value of x?
Rewriting the rule gives
Note that, at every x, P(w1x) P(w2x) 1.

15
Probability of Error

What is the probability of error for this
decision?
What is the probability of error?

16
Probability of Error

Decision strategy for Minimizing the probability
of error
Decide ?1 if P(?1 x) gt P(?2 x) otherwise
decide ?2
Therefore
P(error x) min P(?1 x), P(?2 x)
(Bayes
decision)

17
Probability of Error

What is the probability of error for this
decision?
What is the probability of error?
Bayes decision rule minimizes this error because

18
Example
19
Example (cont.)
20
Example (cont.)
Assign colours to objects.
21
Example (cont.)
22
Example (cont.)
23
Example (cont.)
24
Example (cont.)
Assign colour to pen objects.
25
Example (cont.)
26
Example (cont.)
Assign colour to paper objects.
27
Example (cont.)
28
Example (cont.)
29
Bayesian Decision Theory

How can we generalize to
more than one feature?
replace the scalar x by the feature vector x
more than two states of nature?
just a difference in notation
allowing actions other than just decisions?
allow the possibility of rejection
different risks in the decision?
de?ne how costly each action is

30
Bayesian Decision Theory

Let w1, . . . ,wc be the ?nite set of c states
of nature (classes, categories).
Let a1, . . . , aa be the ?nite set of a
possible actions.
Let ?(aiwj) be the loss incurred for taking
action ai when the state of nature is wj .
Let x be the d-component vector-valued random
variable called the feature vector .

31
Bayesian Decision Theory

p(xwj) is the class-conditional probability
density function.
P(wj) is the prior probability that nature is in
state wj .
The posterior probability can be computed as
where

32
Loss function

Allow actions and not only decide on the state of
nature. How costly an action is?
Introduce a loss function which is more general
than the probability of error
The loss function states how costly each action
taken is
Allowing actions other than classification
primarily allows the possibility of rejection
Refusing to make a decision in close or bad cases

33
Loss function
Let ?1, ?2,, ?c be the set of c states of
nature (or categories) Let , ?(x) maps a
pattern x into one of the actions from ?1,
?2,, ?a, the set of possible actions Let,
?(?i ?j) be the loss incurred for taking action
?i when the category is ?j
34
Conditional Risk

Suppose we observe x and take action ai.
If the true state of nature is wj , we incur the
loss ?(aiwj).
The expected loss with taking action ai is
which is also called the conditional risk.

35
Ex. Target Detection
Actual class ?1 (var) ?2 (yok)
Choose ?1 ?(?1 ?1) hit ?(?1 ?2) false alarm
Choose ?2 ?(?2 ?1) miss ?(?2 ?2) do nothing
36
Minimum-Risk Classi?cation

The general decision rule a(x) tells us which
action to take for observation x.
We want to ?nd the decision rule that minimizes
the overall risk
Bayes decision rule minimizes the overall risk by
selecting the action ai for which R(aix) is
minimum.
The resulting minimum overall risk is called the
Bayes risk and is the best performance that can
be achieved.

37
Two-Category Classi?cation

De?ne
Conditional risks can be written as

38
Two-Category Classi?cation

The minimum-risk decision rule becomes
This corresponds to deciding w1 if
? comparing the likelihood ratio to a threshold
that is
independent of the observation x.

39
Optimal decision property
If the likelihood ratio exceeds a threshold
value T, independent of the input pattern x, we
can take optimal actions
40
Minimum-Error-Rate Classi?cation

Actions are decisions on classes (ai is deciding
wi).
If action ai is taken and the true state of
nature is wj , then the decision is correct if i
j and in error if i ? j.
We want to ?nd a decision rule that minimizes the
probability of error

41
Minimum-Error-Rate Classi?cation

De?ne the zero-one loss function
(all errors are equally costly).
I Conditional risk becomes

42
Minimum-Error-Rate Classi?cation

Minimizing the risk requires maximizing P(wix)
and results in the minimum-error decision rule
Decide wi if P(wix) gt P(wj x) ?j i.
The resulting error is called the Bayes error and
is the best performance that can be achieved.

43
Minimum-Error-Rate Classi?cation
Regions of decision and zero-one loss function,
therefore
44
Minimum-Error-Rate Classi?cation
45
Discriminant Functions

A useful way of representing classi?ers is
through discriminant functions gi(x), i 1, . .
. , c, where the classi?er assigns a feature
vector x to class wi if
For the classi?er that minimizes conditional risk
For the classi?er that minimizes error

46
Discriminant Functions
47
Discriminant Functions

These functions divide the feature space into c
decision regions (R1, . . . , Rc), separated by
decision boundaries.

48
Discriminant Functions
gi(x) can be any monotonically increasing
function of P(?i x)

gi(x) ? f (P(?i x) ) P(x ?i) P(?i)
or natural logarithm of any function of P(?i
x)
gi(x) ln P(x ?i) ln P(?i)

49
Discriminant Functions

The two-category case
A classifier is a dichotomizer that has two
discriminant functions g1 and g2
Let g(x) ? g1(x) g2(x)
Decide ?1 if g(x) gt 0 Otherwise decide ?2

50
Discriminant Functions

The two-category case
The computation of g(x)

51
Example
52
Exercise
53
Example
54
Exercise

Select the optimal decision where
?1, ?2
P(x ?1) N(2, 0.5) (Normal
distribution)
P(x ?2) N(1.5, 0.2)
P(?1) 2/3
P(?2) 1/3

55
The Gaussian Density

Gaussian can be considered as a model where the
feature vectors for a given class are
continuous-valued, randomly corrupted versions of
a single typical or prototype vector.
Some proper ties of the Gaussian
Analytically tractable.
Completely speci?ed by the 1st and 2nd moments.
Has the maximum entropy of all distributions
with a given mean and variance.
Many processes are asymptotically Gaussian
(Central Limit Theorem).
Linear transfor mations of a Gaussian are also
Gaussian.

56
Univariate Gaussian
57
Univariate Gaussian
58
Multivariate Gaussian
59
Linear Transformations
60
Linear Transformations
61
Mahalanobis Distance
Mahalanobis distance takes into account the
covariance among the the variables in calculating
distance.
62
Mahalanobis Distance

Takes into account the covariance among the the
variables in calculating distance.

63
Discriminant Functions for the Gaussian Density
Assume that class conditional density p(x ?i)
is multivariate normal
64
Discriminant Functions for the Gaussian Density
65

the simplest case,
the features are statistically independent,
each feature has the same variance.

the determinant of of the ?
det ? s2d
Because

the inverse of of the ?
?-1 (1/s2) ?
Because

by using
det ? s2d ?-1
(1/s2) ?

69
(No Transcript)
70

The quadratic term is same for all functions, so
we can omit the quadratic term.

71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
References

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern
Classification, New York John Wiley, 2001.
Selim Aksoy, Pattern Recognition Course
Materials, 2011.
M. Narasimha Murty, V. Susheela Devi, Pattern
Recognition an Algorithmic Approach, Springer,
2011.

Write a Comment

User Comments (0)

About PowerShow.com

COMPE 467 - Pattern Recognition PowerPoint PPT Presentation