Title: COMPE 467 - Pattern Recognition
1Bayesian Decision Theory
COMPE 467 - Pattern Recognition
2Bayesian Decision Theory
- Bayesian Decision Theory is a fundamental
statistical approach that quanti?es the trade
offs between various decisions using
probabilities and costs that accompany such
decisions. - First, we will assume that all probabilities are
known. - Then, we will study the cases where the
probabilistic structure is not completely known.
3Fish Sorting Example
- State of nature is a random variable.
- De?ne w as the type of ?sh we observe (state of
nature, class) where - w w1 for sea bass,
- w w2 for salmon.
- P(w1) is the a priori probability that the next
?sh is a sea bass. - P(w2) is the a priori probability that the next
?sh is a salmon.
4Prior Probabilities
- Prior probabilities re?ect our knowledge of how
likely each type of ?sh will appear before we
actually see it. - How can we choose P(w1) and P(w2)?
- Set P(w1) P(w2) if they are equiprobable
(uniform priors). - May use different values depending on the ?shing
area, time of the year, etc. - Assume there are no other types of ?sh
- P(w1) P(w2) 1
5Making a Decision
- How can we make a decision with only the prior
information? - What is the probability of error for this
decision? - P(error ) minP(w1), P(w2)
6Making a Decision
- Decision rule with only the prior information
- Decide ?1 if P(?1) gt P(?2) otherwise decide ?2
- Make further measurement and compute the class
conditional densities
7Class-Conditional Probabilities
- Lets try to improve the decision using the
lightness measurement x. - Let x be a continuous random variable.
- De?ne p(xwj) as the class-conditional
probability density (probability of x given that
the state of nature is wj for j 1, 2). - p(xw1) and p(xw2) describe the difference in
lightness between populations of sea bass and
salmon
8Class-Conditional Probabilities
9Posterior Probabilities
- Suppose we know P(wj) and p(xwj) for j 1, 2,
and measure the lightness of a ?sh as the value
x. - De?ne P(wj x) as the a posteriori probability
(probability of the state of nature being wj
given the measurement of feature value x). - We can use the Bayes formula to convert the prior
probability to the posterior probability - where
10Posterior Probabilities (remember)
11Posterior Probabilities (remember)
Posterior (Likelihood . Prior) / Evidence
12Making a Decision
- p(xwj) is called the likelihood and p(x) is
called the evidence. - How can we make a decision after observing the
value of x?
13Making a Decision
- Decision strategy Given the posterior
probabilities for each class - X is an observation for which
- if P(?1 x) gt P(?2 x) True state of
nature ?1 - if P(?1 x) lt P(?2 x) True state of
nature ?2 -
14Making a Decision
- p(xwj) is called the likelihood and p(x) is
called the evidence. - How can we make a decision after observing the
value of x? - Rewriting the rule gives
- Note that, at every x, P(w1x) P(w2x) 1.
15Probability of Error
- What is the probability of error for this
decision? - What is the probability of error?
16Probability of Error
- Decision strategy for Minimizing the probability
of error - Decide ?1 if P(?1 x) gt P(?2 x) otherwise
decide ?2 - Therefore
- P(error x) min P(?1 x), P(?2 x)
- (Bayes
decision)
17Probability of Error
- What is the probability of error for this
decision? - What is the probability of error?
- Bayes decision rule minimizes this error because
18Example
19Example (cont.)
20Example (cont.)
Assign colours to objects.
21Example (cont.)
22Example (cont.)
23Example (cont.)
24Example (cont.)
Assign colour to pen objects.
25Example (cont.)
26Example (cont.)
Assign colour to paper objects.
27Example (cont.)
28Example (cont.)
29Bayesian Decision Theory
- How can we generalize to
- more than one feature?
- replace the scalar x by the feature vector x
- more than two states of nature?
- just a difference in notation
- allowing actions other than just decisions?
- allow the possibility of rejection
- different risks in the decision?
- de?ne how costly each action is
30Bayesian Decision Theory
- Let w1, . . . ,wc be the ?nite set of c states
of nature (classes, categories). - Let a1, . . . , aa be the ?nite set of a
possible actions. - Let ?(aiwj) be the loss incurred for taking
action ai when the state of nature is wj . - Let x be the d-component vector-valued random
variable called the feature vector .
31Bayesian Decision Theory
- p(xwj) is the class-conditional probability
density function. - P(wj) is the prior probability that nature is in
state wj . - The posterior probability can be computed as
- where
32Loss function
- Allow actions and not only decide on the state of
nature. How costly an action is? - Introduce a loss function which is more general
than the probability of error - The loss function states how costly each action
taken is - Allowing actions other than classification
primarily allows the possibility of rejection - Refusing to make a decision in close or bad cases
33Loss function
Let ?1, ?2,, ?c be the set of c states of
nature (or categories) Let , ?(x) maps a
pattern x into one of the actions from ?1,
?2,, ?a, the set of possible actions Let,
?(?i ?j) be the loss incurred for taking action
?i when the category is ?j
34Conditional Risk
- Suppose we observe x and take action ai.
- If the true state of nature is wj , we incur the
loss ?(aiwj). - The expected loss with taking action ai is
- which is also called the conditional risk.
35Ex. Target Detection
Actual class ?1 (var) ?2 (yok)
Choose ?1 ?(?1 ?1) hit ?(?1 ?2) false alarm
Choose ?2 ?(?2 ?1) miss ?(?2 ?2) do nothing
36Minimum-Risk Classi?cation
- The general decision rule a(x) tells us which
action to take for observation x. - We want to ?nd the decision rule that minimizes
the overall risk - Bayes decision rule minimizes the overall risk by
selecting the action ai for which R(aix) is
minimum. - The resulting minimum overall risk is called the
Bayes risk and is the best performance that can
be achieved.
37Two-Category Classi?cation
- De?ne
- Conditional risks can be written as
38Two-Category Classi?cation
- The minimum-risk decision rule becomes
- This corresponds to deciding w1 if
- ? comparing the likelihood ratio to a threshold
that is - independent of the observation x.
39Optimal decision property
If the likelihood ratio exceeds a threshold
value T, independent of the input pattern x, we
can take optimal actions
40Minimum-Error-Rate Classi?cation
- Actions are decisions on classes (ai is deciding
wi). - If action ai is taken and the true state of
nature is wj , then the decision is correct if i
j and in error if i ? j. - We want to ?nd a decision rule that minimizes the
probability of error
41Minimum-Error-Rate Classi?cation
- De?ne the zero-one loss function
- (all errors are equally costly).
- I Conditional risk becomes
42Minimum-Error-Rate Classi?cation
- Minimizing the risk requires maximizing P(wix)
and results in the minimum-error decision rule - Decide wi if P(wix) gt P(wj x) ?j i.
- The resulting error is called the Bayes error and
is the best performance that can be achieved.
43Minimum-Error-Rate Classi?cation
Regions of decision and zero-one loss function,
therefore
44Minimum-Error-Rate Classi?cation
45Discriminant Functions
- A useful way of representing classi?ers is
through discriminant functions gi(x), i 1, . .
. , c, where the classi?er assigns a feature
vector x to class wi if - For the classi?er that minimizes conditional risk
- For the classi?er that minimizes error
46Discriminant Functions
47Discriminant Functions
- These functions divide the feature space into c
decision regions (R1, . . . , Rc), separated by
decision boundaries.
48Discriminant Functions
gi(x) can be any monotonically increasing
function of P(?i x)
- gi(x) ? f (P(?i x) ) P(x ?i) P(?i)
- or natural logarithm of any function of P(?i
x) - gi(x) ln P(x ?i) ln P(?i)
49Discriminant Functions
- The two-category case
- A classifier is a dichotomizer that has two
discriminant functions g1 and g2 - Let g(x) ? g1(x) g2(x)
- Decide ?1 if g(x) gt 0 Otherwise decide ?2
50Discriminant Functions
- The two-category case
- The computation of g(x)
51Example
52Exercise
53Example
54Exercise
- Select the optimal decision where
- ?1, ?2
- P(x ?1) N(2, 0.5) (Normal
distribution) - P(x ?2) N(1.5, 0.2)
- P(?1) 2/3
- P(?2) 1/3
55The Gaussian Density
- Gaussian can be considered as a model where the
feature vectors for a given class are
continuous-valued, randomly corrupted versions of
a single typical or prototype vector. - Some proper ties of the Gaussian
- Analytically tractable.
- Completely speci?ed by the 1st and 2nd moments.
- Has the maximum entropy of all distributions
with a given mean and variance. - Many processes are asymptotically Gaussian
(Central Limit Theorem). - Linear transfor mations of a Gaussian are also
Gaussian.
56Univariate Gaussian
57Univariate Gaussian
58Multivariate Gaussian
59Linear Transformations
60Linear Transformations
61Mahalanobis Distance
Mahalanobis distance takes into account the
covariance among the the variables in calculating
distance.
62Mahalanobis Distance
- Takes into account the covariance among the the
variables in calculating distance.
63Discriminant Functions for the Gaussian Density
Assume that class conditional density p(x ?i)
is multivariate normal
64Discriminant Functions for the Gaussian Density
65- the simplest case,
- the features are statistically independent,
- each feature has the same variance.
66- the determinant of of the ?
- det ? s2d
- Because
67- the inverse of of the ?
- ?-1 (1/s2) ?
- Because
68- by using
- det ? s2d ?-1
(1/s2) ?
69(No Transcript)
70- The quadratic term is same for all functions, so
we can omit the quadratic term.
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75References
- R.O. Duda, P.E. Hart, and D.G. Stork, Pattern
Classification, New York John Wiley, 2001. - Selim Aksoy, Pattern Recognition Course
Materials, 2011. - M. Narasimha Murty, V. Susheela Devi, Pattern
Recognition an Algorithmic Approach, Springer,
2011.