Bayesian Decision Theory (Sections 2.1-2.2) - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian Decision Theory (Sections 2.1-2.2)

Description:

Bayesian Decision Theory (Sections 2.1-2.2) Decision problem posed in probabilistic terms Bayesian Decision Theory Continuous Features All the relevant probability ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 89
Provided by: Rad6170
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Decision Theory (Sections 2.1-2.2)


1
Bayesian Decision Theory(Sections 2.1-2.2)
  • Decision problem posed in probabilistic terms
  • Bayesian Decision TheoryContinuous Features
  • All the relevant probability values are known

2
(No Transcript)
3
(No Transcript)
4
(No Transcript)
5
Probability Density
6
Course Outline
MODEL INFORMATION
COMPLETE
INCOMPLETE
Supervised Learning
Unsupervised Learning
Bayes Decision Theory
Nonparametric Approach
Parametric Approach
Nonparametric Approach
Parametric Approach
Optimal Rules
Plug-in Rules
Density Estimation
Geometric Rules (K-NN, MLP)
Mixture Resolving
Cluster Analysis (Hard, Fuzzy)
7
Introduction
  • From sea bass vs. salmon example to abstract
    decision making problem
  • State of nature a priori (prior) probability
  • State of nature (which type of fish will be
    observed next) is unpredictable, so it is a
    random variable
  • The catch of salmon and sea bass is equiprobable
  • P(?1) P(?2) (uniform priors)
  • P(?1) P( ?2) 1 (exclusivity and exhaustivity)
  • Prior prob. reflects our prior knowledge about
    how likely we are to observe a sea bass or
    salmon these probabilities may depend on time of
    the year or the fishing area!

8
  • Bayes decision rule with only the prior
    information
  • Decide ?1 if P(?1) gt P(?2), otherwise decide ?2
  • Error rate Min P(?1) , P(?2)
  • Suppose now we have a measurement or feature on
    the state of nature - say the fish lightness
    value
  • Use of the class-conditional probability density
  • P(x ?1) and P(x ?2) describe the difference
    in lightness feature between populations of sea
    bass and salmon

9
Amount of overlap between the densities
determines the goodness of feature
10
  • Maximum likelihood decision rule
  • Assign input pattern x to class ?1 if
  • P(x ?1) gt P(x ?2), otherwise ?2
  • How does the feature x influence our attitude
    (prior) concerning the true state of nature?
  • Bayes decision rule

11
  • Posteriori probability, likelihood, evidence
  • P(?j , x) P(?j x)p (x) p(x ?j) P (?j)
  • Bayes formula
  • P(?j x) p(x ?j) . P (?j) / p(x)
  • where
  • Posterior (Likelihood. Prior) / Evidence
  • Evidence P(x) can be viewed as a scale factor
    that guarantees that the posterior probabilities
    sum to 1
  • P(x ?j) is called the likelihood of ?j with
    respect to x the category ?j for which P(x
    ?j) is large is more likely to be the true
    category

12
(No Transcript)
13
  • P(?1 x) is the probability of the state of
    nature being ?1 given that feature value x has
    been observed
  • Decision based on the posterior probabilities is
    called the Optimal Bayes Decision rule
  • For a given observation (feature value) X
  • if P(?1 x) gt P(?2 x) decide ?1
  • if P(?1 x) lt P(?2 x) decide ?2
  • To justify the above rule, calculate the
    probability of error
  • P(error x) P(?1 x) if we decide ?2
  • P(error x) P(?2 x) if we decide ?1

14
  • So, for a given x, we can minimize te rob. Of
    error, decide ?1 if
  • P(?1 x) gt P(?2 x) otherwise decide ?2
  • Therefore
  • P(error x) min P(?1 x), P(?2 x)
  • Thus, for each observation x, Bayes decision rule
    minimizes the probability of error
  • Unconditional error P(error) obtained by
    integration over all x w.r.t. p(x)

15
  • Optimal Bayes decision rule
  • Decide ?1 if P(?1 x) gt P(?2 x) otherwise
    decide ?2
  • Special cases
  • (i) P(?1) P(?2) Decide ?1 if
  • p(x ?1) gt p(x ?2), otherwise ?2
  • (ii) p(x ?1) p(x ?2) Decide ?1 if
  • P(?1) gt P(?2), otherwise ?2

16
Bayesian Decision Theory Continuous Features
  • Generalization of the preceding formulation
  • Use of more than one feature (d features)
  • Use of more than two states of nature (c classes)
  • Allowing other actions besides deciding on the
    state of nature
  • Introduce a loss function which is more general
    than the probability of error

17
  • Allowing actions other than classification
    primarily allows the possibility of rejection
  • Refusing to make a decision when it is difficult
    to decide between two classes or in noisy cases!
  • The loss function specifies the cost of each
    action

18
  • Let ?1, ?2,, ?c be the set of c states of
    nature
  • (or categories)
  • Let ?1, ?2,, ?a be the set of a possible
    actions
  • Let ?(?i ?j) be the loss incurred for taking
    action ?i when the true state of nature is ?j
  • General decision rule ?(x) specifies which action
    to take for every possible observation x

19
  • Conditional Risk
  • Overall risk
  • R Expected value of R(?i x) w.r.t. p(x)
  • Minimizing R Minimize R(?i x) for i
    1,, a


For a given x, suppose we take the action ?i
if the true state is ?j , we will incur the loss
?(?i ?j). P(?j x) is the prob. that the true
state is ?j But, any one of the C states is
possible for given x.
Conditional risk
20
  • Select the action ?i for which R(?i x) is
    minimum
  • The overall risk R is minimized
    and the resulting risk is called the Bayes risk
    it is the best performance that can be achieved!

21
  • Two-category classification
  • ?1 deciding ?1
  • ?2 deciding ?2
  • ?ij ?(?i ?j)
  • loss incurred for deciding ?i when the true state
    of nature is ?j
  • Conditional risk
  • R(?1 x) ??11P(?1 x) ?12P(?2 x)
  • R(?2 x) ??21P(?1 x) ?22P(?2 x)

22
  • Bayes decision rule is stated as
  • if R(?1 x) lt R(?2 x)
  • Take action ?1 decide ?1
  • This results in the equivalent rule
  • decide ?1 if
  • (?21- ?11) P(x ?1) P(?1) gt
  • (?12- ?22) P(x ?2) P(?2)
  • and decide ?2 otherwise

23
  • Likelihood ratio
  • The preceding rule is equivalent to the following
    rule
  • then take action ?1 (decide ?1) otherwise take
    action ?2 (decide ?2)
  • Note that the posteriori porbabilities are scaled
    by the loss differences.

24
  • Interpretation of the Bayes decision rule
  • If the likelihood ratio of class ?1 and class ?2
    exceeds a threshold value (that is independent of
    the input pattern x), the optimal action is to
    decide ?1
  • Maximum likelihood decision rule the threshold
    value is 1 0-1 loss function and equal class
    prior probability

25
Bayesian Decision Theory(Sections 2.3-2.5)
  • Minimum Error Rate Classification
  • Classifiers, Discriminant Functions and Decision
    Surfaces
  • The Normal Density

26
Minimum Error Rate Classification
  • Actions are decisions on classes
  • If action ?i is taken and the true state of
    nature is ?j then
  • the decision is correct if i j and in error if
    i ? j
  • Seek a decision rule that minimizes the
    probability of error or the error rate

27
  • Zero-one (0-1) loss function no loss for correct
    decision and a unit loss for any error
  • The conditional risk can now be simplified as
  • The risk corresponding to the 0-1 loss function
    is the average probability of error
  • ?

28
  • Minimizing the risk requires maximizing the
    posterior probability P(?i x) since
  • R(?i x) 1 P(?i x))
  • For Minimum error rate
  • Decide ?i if P (?i x) gt P(?j x) ?j ? i

29
  • Decision boundaries and decision regions
  • If ? is the 0-1 loss function then the threshold
    involves only the priors

30
(No Transcript)
31
Classifiers, Discriminant Functionsand Decision
Surfaces
  • Many different ways to represent pattern
    classifiers one of the most useful is in terms
    of discriminant functions
  • The multi-category case
  • Set of discriminant functions gi(x), i 1,,c
  • Classifier assigns a feature vector x to class ?i
    if
  • gi(x) gt gj(x) ?j ? i

32
Network Representation of a Classifier
33
  • Bayes classifier can be represented in this way,
    but the choice of discriminant function is not
    unique
  • gi(x) - R(?i x)
  • (max. discriminant corresponds to min. risk!)
  • For the minimum error rate, we take
  • gi(x) P(?i x)
  • (max. discrimination corresponds to max.
    posterior!)
  • gi(x) ? P(x ?i) P(?i)
  • gi(x) ln P(x ?i) ln P(?i)
  • (ln natural logarithm!)

34
  • Effect of any decision rule is to divide the
    feature space into c decision regions
  • if gi(x) gt gj(x) ?j ? i then x is in Ri
  • (Region Ri means assign x to ?i)
  • The two-category case
  • Here a classifier is a dichotomizer that has
    two discriminant functions g1 and g2
  • Let g(x) ? g1(x) g2(x)
  • Decide ?1 if g(x) gt 0 Otherwise decide ?2

35
  • So, a dichotomizer computes a single
    discriminant function g(x) and classifies x
    according to whether g(x) is positive or not.
  • Computation of g(x) g1(x) g2(x)

36
(No Transcript)
37
The Normal Density
  • Univariate density N(? , ?2)
  • Normal density is analytically tractable
  • Continuous density
  • A number of processes are asymptotically Gaussian
  • Patterns (e.g., handwritten characters, speech
    signals ) can be viewed as randomly corrupted
    versions of a single typical or prototype
    (Central Limit theorem)
  • where
  • ? mean (or expected value) of x
  • ?2 variance (or expected squared
    deviation) of x

38
(No Transcript)
39
  • Multivariate density N(? , ?)
  • Multivariate normal density in d dimensions
  • where
  • x (x1, x2, , xd)t (t stands for
    the transpose of a vector)
  • ? (?1, ?2, , ?d)t mean vector
  • ? dd covariance matrix
  • ? and ?-1 are determinant and
    inverse of ?, respectively
  • The covariance matrix is always symmetric and
    positive semidefinite we assume ? is positive
    definite so the determinant of ? is strictly
    positive
  • Multivariate normal density is completely
    specified by d d(d1)/2 parameters
  • If variables x1 and x2 are statistically
    independent then the covariance of x1 and x2
    is zero.

40
Multivariate Normal density
Samples drawn from a normal population tend to
fall in a single cloud or cluster cluster center
is determined by the mean vector and shape by the
covariance matrix The loci of points of constant
density are hyperellipsoids whose principal axes
are the eigenvectors of ?
41
Transformation of Normal Variables
Linear combinations of jointly normally
distributed random variables are normally
distributed Coordinate transformation can
convert an arbitrary multivariate normal
distribution into a spherical one
42
Bayesian Decision Theory (Sections 2-6 to 2-9)
  • Discriminant Functions for the Normal Density
  • Bayes Decision Theory Discrete Features

43
Discriminant Functions for the Normal Density
  • The minimum error-rate classification can be
    achieved by the discriminant function
  • gi(x) ln P(x ?i) ln P(?i)
  • In case of multivariate normal densities

44
  • Case ?i ?2.I (I is the identity matrix)
  • Features are statistically independent and each
    feature has the same variance

45
  • A classifier that uses linear discriminant
    functions is called a linear machine
  • The decision surfaces for a linear machine are
    pieces of hyperplanes defined by the linear
    equations
  • gi(x) gj(x)

46
  • The hyperplane separating Ri and Rj
  • is orthogonal to the line linking the means!

47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
  • Case 2 ?i ? (covariance matrices of all
    classes are identical but otherwise arbitrary!)
  • Hyperplane separating Ri and Rj
  • The hyperplane separating Ri and Rj is generally
    not orthogonal to the line between the means!
  • To classify a feature vector x, measure the
    squared Mahalanobis distance from x to each of
    the c means assign x to the category of the
    nearest mean

51
(No Transcript)
52
(No Transcript)
53
Discriminant Functions for 1D Gaussian
54
  • Case 3 ?i arbitrary
  • The covariance matrices are different for each
    category
  • In the 2-category case, the decision surfaces
    are hyperquadrics that can assume any of the
    general forms hyperplanes, pairs of hyperplanes,
    hyperspheres, hyperellipsoids, hyperparaboloids,
    hyperhyperboloids)

55
Discriminant Functions for the Normal Density
56
(No Transcript)
57
Discriminant Functions for the Normal Density
58
Discriminant Functions for the Normal Density
59
Decision Regions for Two-Dimensional Gaussian Data
60
Error Probabilities and Integrals
  • 2-class problem
  • There are two types of errors
  • Multi-class problem
  • Simpler to computer the prob. of being correct
    (more ways to be wrong than to be right)

61
Error Probabilities and Integrals
Bayes optimal decision boundary in 1-D case
62
Error Bounds for Normal Densities
  • The exact calculation of the error for the
    general Guassian case (case 3) is extremely
    difficult
  • However, in the 2-category case the general error
    can be approximated analytically to give us an
    upper bound on the error

63
Error Rate of Linear Discriminant Function (LDF)
  • Assume a 2-class problem
  • Due to the symmetry of the problem (identical ?),
    the two types of errors are identical
  • Decide if or
  • or

64
Error Rate of LDF
  • Let
  • Compute expected values variances of
    when
  • where
  • squared Mahalanobis distance between

65
Error Rate of LDF
  • Similarly

66
Error Rate of LDF
67
Error Rate of LDF
68
Error Rate of LDF
69
Chernoff Bound
  • To derive a bound for the error, we need the
    following inequality

Assume conditional prob. are normal
where
70
Chernoff Bound
Chernoff bound for P(error) is found by
determining the value of ? that minimizes
exp(-k(?))
71
Error Bounds for Normal Densities
  • Bhattacharyya Bound
  • Assume ? 1/2
  • computationally simpler
  • slightly less tight bound
  • Now, Eq. (73) has the form

When the two covariance matrices are equal,
k(1/2) is te same as the Mahalanobis distance
between the two means
72
Error Bounds for Gaussian Distributions
Chernoff Bound
Best Chernoff error bound is 0.008190
Bhattacharya Bound (ß1/2)
2category, 2D data
Bhattacharya error bound is 0.008191
True error using numerical integration 0.0021
73
Neyman-Pearson Rule
Classification, Estimation and Pattern
recognition by Young and Calvert
74
Neyman-Pearson Rule
75
Neyman-Pearson Rule
76
Neyman-Pearson Rule
77
Neyman-Pearson Rule
78
Neyman-Pearson Rule
79
Signal Detection Theory
  • We are interested in detecting a single weak
    pulse, e.g. radar reflection the internal signal
    (x) in detector has mean m1 (m2) when pulse is
    absent (present)

The detector uses a threshold x to determine the
presence of pulse
Discriminability ease of determining whether the
pulse is present or not
For given threshold, define hit, false alarm,
miss and correct rejection
80
Receiver Operating Characteristic (ROC)
  • Experimentally compute hit and false alarm rates
    for fixed x
  • Changing x will change the hit and false alarm
    rates
  • A plot of hit and false alarm rates is called the
    ROC curve

Performance shown at different operating points
81
Operating Characteristic
  • In practice, distributions may not be Gaussian
    and will be multidimensional ROC curve can
    still be plotted
  • Vary a single control parameter for the decision
    rule and plot the resulting hit and false alarm
    rates

82
Bayes Decision Theory Discrete Features
  • Components of x are binary or integer valued x
    can take only one of m discrete values
  • v1, v2, ,vm
  • Case of independent binary features for
    2-category problem
  • Let x x1, x2, , xd t where each xi is
    either 0 or 1, with probabilities
  • pi P(xi 1 ?1)
  • qi P(xi 1 ?2)

83
  • The discriminant function in this case is

84
Bayesian Decision for Three-dimensional Binary
Data
  • Consider a 2-class problem with three
    independent binary features class priors are
    equal and pi 0.8 and qi 0.5, i 1,2,3
  • wi 1.3863
  • w0 1.2
  • Decision surface g(x) 0 is shown below

Decision boundary for 3D binary features. Left
figure shows the case when pi.8 and qi.5. Right
figure shows case when p3q3 (Feature 3 is not
providing any discriminatory information) so
decision surface is parallel to x3 axis
85
Handling Missing Features
  • Suppose it is not possible to measure a certain
    feature for a given pattern
  • Possible solutions
  • Reject the pattern
  • Approximate the missing feature
  • Mean of all the available values for the missing
    feature
  • Marginalize over the distribution of the missing
    feature

 
 
86
Handling Missing Features
 
87
Other Topics
  • Compound Bayes Decision Theory Context
  • Consecutive states of nature might not be
    statistically independent in sorting two types
    of fish, arrival of next fish may not be
    independent of the previous fish
  • Can we exploit such statistical dependence to
    gain improved performance (use of context)
  • Compound decision vs. sequential compound
    decision problems
  • Markov dependence
  • Sequential Decision Making
  • Feature measurement process is sequential (as in
    medical diagnosis)
  • Feature measurement cost
  • Minimize the no. of features to be measured while
    achieving a sufficient accuracy minimize a
    combination of feature measurement cost
    classification accuracy

88
Context in Text Recognition
Write a Comment
User Comments (0)
About PowerShow.com