Title: Theory
1Dept. of Electrical and Computer
Engineering 0909.402.02 / 0909.504.04
Lecture 2
Theory Applications of Pattern Recognition
Probability Theory
Random variables Probability mass function Joint
probability Expected values Mean/Variance/Covarian
ce Stat. independence Correlation Conditional
probability Bayes rule Vector RVs Normal
distribution Central limit theorem Gaussian
derivatives Multivariate densities
2Probability Theory
- Discrete random variables
- Probability mass function
- Cumulative mass function
- Expected value (average)
- Variance and standard deviation
- Pairs of random variables
- Joint probability
- Joint distribution
- Statistical independence
- Expectation for two variables
- Covariance / covariance matrix
- Correlation / correlation coefficient
3The Law of Total Probability
- Conditional Probability
- Bayes Rule
- Prior probability
- Likelihood
- Evidence
- Posterior probability
4Vector Random VariablesContinuous Random
Variables
- Joint distribution for vector random variables
- Bayes rule for vector r.v.
- Expectation, mean vector, covariance matrix for
vector r.v. - Probability density function pdf, ??????
- Cumulative dist. function cdf, ??????
- ????? X? ??? x ?? ??? ?? ??
- Fx(x) P(X lt x)
- Dist. for sum of r.v.
5Gaussian Distribution
- Central Limit Theorem
- Gaussian (Normal) Distribution
- Mean and variance
- Standardizing Gaussian dist.
- Gaussian derivatives and integrals
- Error function
- Using Gaussian Distribution
- Using tables of Gaussian distribution
- Using MATLAB
6Other Important Distributions
- Chi-square
- Poisson
- Binomial
- Beta
- Gamma
- Students t-distribution
- F-distribution
7Multivariate GaussianDistribution
- Multivariate normal density function
- Mahalanobis distance
- Whitening Transform
8Bayes Decision theory
- Statistically best classifier
- Based on quantifying the trade offs betweens
various classification decisions using a
probabilistic approach - The theory assumes
- Decision problem can be posed in probabilistic
terms - All relevant probability values are known (in
practice this is not true)
9Class Conditional Probabilities
P(x ?2) Class conditional probability for
salmon Likelihood Given that salmon has been
observed, what is the probability of this
salmons lightness is between 11 and 12?
?1 Sea bass ?2 Salmon
lightness
10Definitions Bayes Decision Rule
- State of nature
- A priori probability (prior)
- A posteriori probability (posterior)
- Likelihood
- Evidence
11Posterior Probabilities
- Bayes rule allows us to compute the posterior
probability (difficult to determine) from prior
probabilities, likelihood and the evidence
(easier to determine).
Posterior probabilities for priors P (?1) 2/3
and P(?2) 1/3. For example, given that a
pattern is measured to have feature value x 14,
the probability it is in category ?2 is roughly
0.08, and that it is in ?1 is 0.92. At every x,
the posteriors sum to 1.0.
12Bayes Decision Rule
- Choose the class that has the larger posterior
probability !!!
Choose ?i if P(?i x) gt P(?j x) for all i
1,2,,c P(error) min P(?1 x), P(?2 x),
, P(?c x) If there are multiple features,
xx1, x2,, xd ? Choose ?i if P(?i x) gt P(?j
x) for all i 1,2,,c P(error) min P(?1
x), P(?2 x), , P(?c x)
132.2 The Loss Function
- Mathematical description of how costly each
action (making a class decision) is. Are certain
mistakes more costly than others?
?1, ?2,, ?c Set of states of nature
(classes) ?1, ?2, ?a Set of possible
actions. Note that a need not be same as c.
Because we may make more (or less) number of
actions than the number of classes. For example,
not making a decision is also an action. ?1,
?2, ?a Losses associated with each
action ?(?i ?j The loss function Loss
incurred for taking action i when the true state
of nature is in fact j. R(?i x) Conditional
risk - Expected loss for taking action i
Bayes decision takes the action that minimizes
the conditional risk !
14Bayes Decision RuleUsing Conditional Risk
- Compute conditional risk R(?i x) for each action
taken - Select the action that has the minimum
conditional risk. Let this be action k - The overall risk is then
- This is the Bayes Risk, the minimum possible risk
that can be taken by any classifier !
152.3 Minimum Error Rate Classification
- If we associate taking action i as selecting
class i, and if all errors are equally likely, we
obtain the zero-one loss - This loss function assigns no loss to correct
classification, and assigns 1 to
misclassification. The risk corresponding to this
loss function is then - which is precisely the average probability of
error. Clearly, to minimize this risk, we need to
choose the class that maximizes the posterior
probability, hence the Bayes rule !!!
16????
172.4 Discriminant Based Classification
- A discriminant is a function g(x), that
discriminates between classes. This function
assigns the input vector to a class according to
its definition Choose class i if - Bayes rule can be implemented in terms of
discriminant functions
Each discriminant function generates c decision
regions, R1,,Rc, which are separated by
decision boundaries. Decision regions need NOT
be contiguous. The decision boundary satisfies
18The Normal Density (???? ??)
19Normal ??? ?? ????
- If likelihood probabilities are normally
distributed, then a number of simplifications can
be made. In particular, the discriminant function
can be written as in this greatly simplified form
(!) - ? 2.4.1 ?? ?? ??? ??? ??
- There are three distinct cases
that can occur
20Case 1 _________
- Features are statistically independent, and
all features have the same variance Dist. are
spherical in d dimensions, and the boundary is a
generalized hyperplane of d-1 dimensions, and
featured create equal sized hyperspherical
clusters. Examples of such hyperspherical
clusters are
21Case 1________
- This case results in linear discriminants of the
form
Note how priors shift the discriminant function
away from the more likely mean !!!
22Case 2_______
Covariance matrices are arbitrary, but equal to
each other for all classes. Features then form
hyper- ellipsoidal clusters of equal size and
shape. This also results in linear discriminant
functions whose decision boundaries are again
hyperplanes
23Case 3____________
All bets off !In two class case, the decision
boundaries form hyperquadratics. The
discriminant functions are now, in general,
quadratic (nor linear)
24Case 3____________
For the multi class case, the boundaries will
look even more complicated. As an example
Decision Boundaries
25Case 3____________
In 3-D
26Error Probabilities
In a two class case, there are two sources of
error x is in R1, yet SON is w2, or vice versa
xBOptimal Bayes solution xNon-optimal solution
P(error)
27????(Bayes Error)
????? ?? ????? ??-?? x? ??? ??? ???? P(error)?
?? ?? P(errorx)? ????,
??? ??????????
???? ???? ??? ??? ? ????, ?? ??? ???? ?? ? ???? ?
pp. 54, ? (71) ? ?? x? ?? ??? ??? ??? ??? ??? ??
28Error Bounds
It is difficult, at best if possible, to
analytically compute the error probabilities,
Particularly when the decision regions are not
contiguous. However, upper bounds forthis error
can be obtained The Chernoff bound and its
approximation Bhattacharya bound are two such
bounds that are often used. If the distributions
are Gaussian, these expressions are
relativelyeasier to compute? Often times even
non-Gaussian cases are considered as Gaussian.
292.7 Error probabilities and Integrals
302.8 Error Bounds for Normal Densities
2.8.1 Chernoff Bound
2.8.2 Bhattacharyya Bound
312.8.3 Signal Detection Theory and Operating
Characteristics
Receiver Operating Characteriotic
Can calculate the Bayes error rate
322.9 Bayes Decision Theory Discrete Features
Bayes formula
332.9.1 Independent Binary Features
Relevance of a yes answer for xi in determining
the classification
342.10.1 Missing Features
Marginal distribution
? Integrate the posterior probability over the
bad features
352.10.2 Noisy Features
Assume) if were know, would be
independent of
362.11 Bayesian Belief Networks
372.11 Bayesian Belief Networks
382.11 Bayesian Belief Networks
Useful in the case where we seek to determine
some particular configuration of other variables
? given evidence
392.11 Bayesian Belief Networks
1
402.12 Compound Bayesian Decision Theory and Context
Exploit statistical dependence to gain improved
performance ? by using context
- Compound decision problem
- Sequential compound decision problem
412.12 Compound Bayesian Decision Theory and Context
The posterior probability of ?
? The optimal procedure is to minimize the
compound conditional risk.
If no loss for being correct all errors are
equally costly, Procedure ? computing P(?X) for
all ?, selecting ?( posterior
probability is maximum )
In practice, enormous task(?cn) P(?) ?
dependent