Bayesian Decision Theory - PowerPoint PPT Presentation

1 / 156
About This Presentation
Title:

Bayesian Decision Theory

Description:

Bayesian Decision Theory Z. Ghassabi * – PowerPoint PPT presentation

Number of Views:557
Avg rating:3.0/5.0
Slides: 157
Provided by: cip89
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Decision Theory


1
Bayesian Decision Theory
  • Z. Ghassabi

2
Outline
  • What is pattern recognition?
  • What is classification?
  • Need for Probabilistic Reasoning
  • Probabilistic Classification Theory
  • What is Bayesian Decision Theory?
  • HISTORY
  • PRIOR PROBABILITIES
  • CLASS-CONDITIONAL PROBABILITIES
  • BAYES FORMULA
  • A Casual Formulation
  • Decision

3
Outline
  • What is Bayesian Decision Theory?
  • Decision fot Two Categories

4
Outline
  • What is classification?
  • Classification by Bayesian Classification
  • Basic Concepts
  • Bayes Rule
  • More General Forms of Bayes Rule
  • Discriminated Functions
  • Bayesian Belief Networks

5
What is pattern recognition?
TYPICAL APPLICATIONS OF PR
IMAGE PROCESSING EXAMPLE
6
Pattern Classification System
  • Preprocessing
  • Segment (isolate) fishes from one another and
    from the background
  • Feature Extraction
  • Reduce the data by measuring certain features
  • Classification
  • Divide the feature space into decision regions

7
(No Transcript)
8
Classification
  • Initially use the length of the fish as a
    possible feature for discrimination

9
TYPICAL APPLICATIONS
LENGTH AS A DISCRIMINATOR
  • Length is a poor discriminator

10
Feature Selection
  • The length is a poor feature alone!
  • Select the lightness as a possible feature

11
TYPICAL APPLICATIONS
ADD ANOTHER FEATURE
  • Lightness is a better feature than length because
    it reduces the misclassification error.
  • Can we combine features in such a way that we
    improve performance? (Hint correlation)

12
Threshold decision boundary and cost relationship
  • Move decision boundary toward smaller values of
    lightness in order to minimize the cost (reduce
    the number of sea bass that are classified
    salmon!)
  • Task of decision theory

13
Feature Vector
  • Adopt the lightness and add the width of the fish
    to the feature vector
  • Fish xT x1, x2

Width
Lightness
14
TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
Straight line decision boundary
  • Treat features as a N-tuple (two-dimensional
    vector)
  • Create a scatter plot
  • Draw a line (regression) separating the two
    classes

15
Features
  • We might add other features that are not highly
    correlated with the ones we already have. Be sure
    not to reduce the performance by adding noisy
    features
  • Ideally, you might think the best decision
    boundary is the one that provides optimal
    performance on the training data (see the
    following figure)

16
TYPICAL APPLICATIONS
DECISION THEORY
  • Can we do better than a linear classifier?

Is this a good decision boundary?
  • What is wrong with this decision surface? (hint
    generalization)

17
Decision Boundary Choice
  • Our satisfaction is premature because the central
    aim of designing a classifier is to correctly
    classify new (test) input
  • Issue of generalization!

18
TYPICAL APPLICATIONS
GENERALIZATION AND RISK
Better decision boundary
  • Why might a smoother decision surface be a better
    choice? (hint Occams Razor).
  • PR investigates how to find such optimal
    decision surfaces and how to provide system
    designers with the tools to make intelligent
    trade-offs.

19
Need for Probabilistic Reasoning
  • Most everyday reasoning is based on uncertain
    evidence and inferences.
  • Classical logic, which only allows conclusions to
    be strictly true or strictly false, does not
    account for this uncertainty or the need to weigh
    and combine conflicting evidence.
  • Todays expert systems employed fairly ad hoc
    methods for reasoning under uncertainty and for
    combining evidence.

20
Probabilistic Classification Theory
  • ?? classification ?????? ?? ?? ???????? ??????
    ???????? ?? ???? ?? ?? ??????? ?? ?? ???? ?????
    ???.
  • In practice, some overlap between classes and
    random variation within classes occur, hence
    perfect separation between classes can not be
    achieved Misclassification may occur.
  • ????? Bayesian decision ????? ?? ??? ?? ????
    ????? ?? ????? ??? ???? ????? ?? ???? ??? ?? ??
    ????? ????? ?? ???? ?? ???? A ??? ??????? ?? ????
    B ???? ???? ???. (misclassify)

21
HISTORY
What is Bayesian Decision Theory?
  • Bayesian Probability was named after Reverend
    Thomas Bayes (1702-1761).
  • He proved a special case of what is currently
    known as the Bayes Theorem.
  • The term Bayesian came into use around the
    1950s.

http//en.wikipedia.org/wiki/Bayesian_probability
22
HISTORY (Cont.)
  • Pierre-Simon, Marquis de Laplace (1749-1827)
    independently proved a generalized version of
    Bayes Theorem.
  • 1970 Bayesian Belief Network at Stanford
    University (Judea Pearl 1988)
  • The ideas proposed above was not fully
    developed until later. BBN became popular in the
    1990s.

23
HISTORY (Cont.)
  • Current uses of Bayesian Networks
  • Microsofts printer troubleshooter.
  • Diagnose diseases (Mycin).
  • Used to predict oil and stock prices
  • Control the space shuttle
  • Risk Analysis Schedule and Cost Overruns.

24
BAYESIAN DECISION THEORY
PROBABILISTIC DECISION THEORY
  • Bayesian decision theory is a fundamental
    statistical approach to the problem of pattern
    classification.
  • Using probabilistic approach to help making
    decision (e.g., classification) so as to minimize
    the risk (cost).
  • Assume all relevant probability distributions are
    known (later we will learn how to estimate these
    from data).

25
BAYESIAN DECISION THEORY
PRIOR PROBABILITIES
  • State of nature is prior information
  • ? denote the state of nature
  • Model as a random variable, ?
  • ? ?1 the event that the next fish is a sea
    bass
  • category 1 sea bass category 2 salmon
  • A priori probabilities
  • P(?1) probability of category 1
  • P(?2) probability of category 2
  • P(?1) P( ?2) 1 (either ?1 or ?2 must occur)
  • Decision rule
  • Decide ?1 if P(?1) gt P(?2) otherwise, decide
    ?2

But we know there will be many mistakes .
http//www.stat.yale.edu/Courses/1997-98/101/ranva
r.htm
26
BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
  • A decision rule with only prior information
    always produces the same result and ignores
    measurements.
  • If P(?1) gtgt P( ?2), we will be correct most of
    the time.
  • Given a feature, x (lightness), which is a
    continuous random variable, p(x?2) is the
    class-conditional probability density function
  • p(x?1) and p(x?2) describe the difference in
    lightness between populations of sea and salmon.

27
Let x be a continuous random variable. p(xw)
is the probability density for x given the state
of nature w.
p(lightness salmon) ?
P(lightness sea bass) ?
28
BAYESIAN DECISION THEORY
BAYES FORMULA
How do we combine a priori and class-conditional
probabilities to know the probability of a state
of nature?
  • Suppose we know both P(?j) and p(x?j), and we
    can measure x. How does this influence our
    decision?
  • The joint probability that of finding a pattern
    that is in category j and that this pattern has a
    feature value of x is
  • Rearranging terms, we arrive at Bayes formula.

29
  • A Casual Formulation
  • The prior probability reflects knowledge of the
    relative frequency of instances of a class
  • The likelihood is a measure of the probability
    that a measurement value occurs in a class
  • The evidence is a scaling term

BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
  • Bayes formula
  • can be expressed in words as
  • By measuring x, we can convert the prior
    probability, P(?j), into a posterior probability,
    P(?jx).
  • Evidence can be viewed as a scale factor and is
    often ignored in optimization applications (e.g.,
    speech recognition).

For two categories
Bayes Decision Choose w1 if P(w1x) gt P(w2x)
otherwise choose w2.
30
BAYESIAN THEOREM
  • A special case of Bayesian Theorem
  • P(AnB) P(B) x P(AB)
  • P(BnA) P(A) x P(BA)
  • Since P(AnB) P(BnA),
  • P(B) x P(AB) P(A) x P(BA)
  • gt P(AB) P(A) x P(BA) / P(B)

31
Preliminaries and Notations
a state of nature
prior probability
feature vector
class-conditional density
posterior probability
32
Decision
Decide ?i if P(?ix) gt P(?jx) ? j ? i
The evidence, p(x), is a scale factor that
assures conditional probabilities sum to 1
P(?1x)P(?2x)1
We can eliminate the scale factor (which appears
on both sides of the equation)
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
  • Special cases
  • P(?1)P(?2) ? ? ?P(?c)
  • p(x?1)p(x?2) ? ? ? p(x?c)

33
Two Categories
Decide ?i if P(?ix) gt P(?jx) ? j ? i
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
Decide ?1 if P(?1x) gt P(?2x) otherwise
decide ?2
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
  • Special cases
  • P(?1)P(?2)
  • Decide ?1 if p(x?1) gt p(x?2) otherwise
    decide ?2
  • 2. p(x?1)p(x?2)
  • Decide ?1 if P(?1) gt P(?2) otherwise decide ?2

34
Example
R2
R1
P(?1)P(?2)
35
Example
P(?1)2/3 P(?2)1/3
Bayes Decision Rule
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
36
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
  • For every value of x, the posteriors sum to 1.0.
  • At x14, the probability it is in category ?2 is
    0.08, and for category ?1 is 0.92.

37
BAYESIAN DECISION THEORY
BAYES DECISION RULE
Classification Error
  • Decision rule
  • For an observation x, decide ?1 if P(?1x) gt
    P(?2x) otherwise, decide ?2
  • Probability of error
  • The average probability of error is given by
  • If for every x we ensure that P(errorx) is as
    small as possible, then the integral is as small
    as possible. Thus, Bayes decision rule for
    minimizes P(error).

38
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
  • Generalization of the preceding ideas
  • Use of more than one feature(e.g., length and
    lightness)
  • Use more than two states of nature(e.g., N-way
    classification)
  • Allowing actions other than a decision to decide
    on the state of nature (e.g., rejection refusing
    to take an action when alternatives are close or
    confidence is low)
  • Introduce a loss of function which is more
    general than the probability of error(e.g.,
    errors are not equally costly)
  • Let us replace the scalar x by the vector x in a
    d-dimensional Euclidean space, Rd, calledthe
    feature space.

39
The Generation
a set of c states of nature or c categories
a set of a possible actions
LOSS FUNCTION
The loss incurred for taking action ?i when the
true state of nature is ?j.
Risk
can be zero.
We want to minimize the expected loss in making
decision.
40
Examples
  • Ex 1 Fish classification
  • X is the image of fish
  • x (brightness, length, fin , etc.)
  • is our belief what the fish type is
  • sea bass, salmon, trout, etc
  • is a decision for the fish type, in this
    case
  • sea bass, salmon, trout, manual
    expection needed, etc
  • Ex 2 Medical diagnosis
  • X all the available medical tests, imaging scans
    that a doctor can order for a patient
  • x (blood pressure, glucose level, cough, x-ray,
    etc.)
  • is an illness type
  • Flu, cold, TB, pneumonia, lung
    cancer, etc
  • is a decision for treatment,
  • Tylenol, Hospitalize, more tests
    needed, etc

41
Conditional Risk
Given x, the expected loss (risk) associated with
taking action ?i.
42
0/1 Loss Function
43
Decision
A general decision rule is a function ?(x) that
tells us which action to take for every possible
observation.
Bayesian Decision Rule
44
Overall Risk
The overall risk is given by
  • Compute the conditional risk for every ? and
    select the action that minimizes R(?ix). This is
    denoted R, and is referred to as the Bayes risk
  • The Bayes risk is the best performance that can
    be achieved.

Decision function
If we choose ?(x) so that R(?i(x)) is as small as
possible for every x, the overall risk will be
minimized.
Bayesian decision rule the optimal one to
minimize the overall risk Its resulting overall
risk is called the Bayesian risk
45
Two-Category Classification
  • Let ?1 correspond to ?1, ?2 to ?2, and ?ij
    ?(?i?j)
  • The conditional risk is given by

46
Two-Category Classification
Our decision rule is choose ?1 if R(?1x) lt
R(?2x) otherwise decide ?2
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
47
Two-Category Classification
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
positive
positive
Posterior probabilities are scaled before
comparison.
  • If the loss incurred for making an error is
    greater than that incurred for being correct, the
    factors (?21- ?11) and(?12- ?22) are positive,
    and the ratio of these factors simply scales the
    posteriors.

48
Two-Category Classification
irrelevant
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
By employing Bayes formula, we can replace the
posteriors by the prior probabilities and
conditional densities
49
Two-Category Classification
This slide will be recalled later.
Stated as Choose a1 if the likelihood ration
exceeds a threshold value independent of the
observation x.
Threshold
Likelihood Ratio
Perform ?1 if
50
(No Transcript)
51
Loss Factors
  • If the loss factors are identical, and the prior
    probabilities are equal, this reduces to a
    standard likelihood ratio

52
MINIMUM ERROR RATE
Error rate (the probability of error) is to be
minimized
  • Consider a symmetrical or zero-one loss function
  • The conditional risk is
  • The conditional risk is the average probability
    of error.
  • To minimize error, maximize P(?ix) also known
    as maximum a posteriori decoding (MAP).

53
Minimum Error Rate
LIKELIHOOD RATIO
  • Minimum error rate classification
  • choose ?i if P(?i x) gt P(?j x) for all j?i

54
Example
  • For sea bass population, the lightness x is a
    normal random variable distributes according to
    N(4,1)
  • for salmon population x is distributed
    according to N(10,1)
  • Select the optimal decision where
  • The two fish are equiprobable
  • P(sea bass) 2X P(salmon)
  • The cost of classifying a fish as a salmon when
    it truly is seabass is 2, and t The cost of
    classifying a fish as a seabass when it is truly
    a salmon is 1.

2
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
  • End of Section 1

59
The Multicategory Classification
How to define discriminant functions?
How do we represent pattern classifiers?
The most common way is through discriminant
functions. Remember we use w1,w2, , wc to be
the possible states of nature.
For each class we create a discriminant function
gi(x).
gi(x)s are called the discriminant functions.
g1(x)
?(x)
g2(x)
Our classifier is a network or machine that
computes c discriminant functions.
gc(x)
The classifier Assign x to ?i if
gi(x) gt gj(x) for all j ? i.
60
Simple Discriminant Functions
If f(.) is a monotonically increasing function,
than f(gi(.) )s are also be discriminant
functions.
Notice the decision is the same if we change
every gi(x) for f(gi(x)) Assuming f(.) is a
monotonically increasing function.
Minimum Risk case
Minimum Error-Rate case
61
Figure 2.5
62
Decision Regions
The net effect is to divide the feature space
into c regions (one for each class). We then have
c decision regions separated by decision
boundaries.
Two-category example
Decision regions are separated by decision
boundaries.
63
Figure 2.6
64
Bayesian Decision Theory(Classification)
  • The Normal Distribution

65
Basics of Probability
Discrete random variable (X) - Assume integer
Probability mass function (pmf)
Cumulative distribution function (cdf)
Continuous random variable (X)
not a probability
Probability density function (pdf)
Cumulative distribution function (cdf)
66
Probability mass function
  • The graph of a probability mass function.
  • All the values of this function must be
  • non-negative and sum up to 1.

67
Probability density function
  • The pdf can be calculated by taking the integral
    of the function f(x) by the integration interval
    of the input variable x.
  • For example the probability of the variable X
    being within the interval 4.3,7.8 would be

68
Expectations
Let g be a function of random variable X.
The kth moment
The 1st moment
The kth central moments
69
Important Expectations
Fact
Mean
Variance
70
Entropy
The entropy measures the fundamental uncertainty
in the value of points selected randomly from a
distribution.
71
Univariate Gaussian Distribution
  • Properties
  • Maximize the entropy
  • Central limit theorem

XN(µ,s2)
EX µ
VarX s2
72
Illustration of the central limit theorem
Let x1,x2,,xn be a sequence of n independent
and identically distributed random variables
having each finite values of expectation µ and
variance s2gt0 The central limit theorem states
that as the sample size n increases, the
distribution of the sample average of these
random variables approaches the normal
distribution with a mean µ and variance s2 / n
irrespective of the shape of the original
distribution.
73
Random Vectors
A d-dimensional random vector
Vector Mean
Covariance Matrix
74
Multivariate Gaussian Distribution
XN(µ,S)
A d-dimensional random vector
EX µ
E(X-µ) (X-µ)T S
75
Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
76
Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
77
On Parameters of N(µ,S)
XN(µ,S)
78
More On Covariance Matrix
? is symmetric and positive semidefinite.
? orthonormal matrix, whose columns are
eigenvectors of ?.
? diagonal matrix (eigenvalues).
79
Whitening Transform
XN(µ,S)
YATX
YN(ATµ, ATSA)
Let
80
Whitening Transform
Whitening
XN(µ,S)
Linear Transform
YATX
YN(ATµ, ATSA)
Let
Projection
81
Whitening Transform
  • The whitening transformation is a decorrelation
    method that converts
  • the covariance matrix S of a set of samples
    into the identity matrix I.
  • This effectively creates new random variables
    that are uncorrelated and
  • have the same variances as the original random
    variables.
  • The method is called the whitening transform
    because it transforms the
  • input matrix closer towards white noise.

This can be expressed as
where F is the matrix with the eigenvectors of
"S" as its columns and ? is the diagonal matrix
of non-increasing eigenvalues.
82
White noise
  • White noise is a random signal (or process) with
    a flat power spectral density.
  • In other words, the signal contains equal power
    within a fixed bandwidth
  • at any center frequency.

Energy spectral density
83
Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
84
Mahalanobis distance
In statistics, Mahalanobis distance is a distance
measure introduced by P. C. Mahalanobis in
1936. It is based on correlations between
variables by which different patterns can be
identified and analyzed. It is a useful way of
determining similarity of an unknown sample set
to a known one. It differs from Euclidean
distance in that it takes into account the
correlations of the data set and is
scale-invariant, i.e. not dependent on the scale
of measurements .
85
Mahalanobis distance
  • Formally, the Mahalanobis distance from a group
    of values with above mean and covariance matrix S
    for a multivariate vector is defined as
  • Mahalanobis distance can also be defined as
    dissimilarity measure between two random vectors
    of the same distribution with the
    covariance matrix S 
  • If the covariance matrix is the identity matrix,
    the Mahalanobis distance reduces to the Euclidean
    distance. If the covariance matrix is diagonal,
    then the resulting distance measure is called the
    normalized Euclidean distance
  • where si is the standard deviation of the xi over
    the sample set.

86
Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
87
Bayesian Decision Theory(Classification)
  • Discriminant Functions for the Normal Populations

88
Normal Density
If features are statistically independent and the
variance is the same for all features, the
discriminant function is simple and is linear in
nature. A classifier that uses linear
discriminant functions is called a linear
machine. The decision surface are pieces of
hyperplanes defined by linear equations.
89
Minimum-Error-Rate Classification
  • Assuming the measurements are normally
    distributed, we have

XiN(µi,Si)
90
Some Algebra to Simplify the Discriminants
  • Since
  • We take the natural logarithm to re-write the
    first term

91
Some Algebra to Simplify the Discriminants
(continued)
92
Minimum-Error-Rate Classification
Three Cases
Case 1
Classes are centered at different mean, and their
feature components are pairwisely independent
have the same variance.
Case 2
Classes are centered at different mean, but have
the same variation.
Case 3
Arbitrary.
93
Case 1. ?i ?2I
irrelevant
irrelevant
94
Case 1. ?i ?2I
95
Case 1. ?i ?2I
Boundary btw. ?i and ?j
96
Case 1. ?i ?2I
The decision boundary will be a hyperplane
perpendicular to the line btw. the means at
somewhere.
0 if P(?i)P(?j)
midpoint
Boundary btw. ?i and ?j
wT
97
Case 1. ?i ?2I
  • The decision region when the priors are equal and
    the support regions are spherical is simply
    halfway between the means (Euclidean distance).

Minimum distance classifier (template matching)
98
(No Transcript)
99
Case 1. ?i ?2I
Note how priors shift the boundary away from the
more likely mean !!!
100
Case 1. ?i ?2I
101
Case 1. ?i ?2I
102
Case 2. ?i ?
  • Covariance matrices are arbitrary, but equal to
    each other for all classes.
  • Features then form hyper-ellipsoidal clusters of
    equal size and shape.

Mahalanobis Distance
Irrelevant if P(?i) P(?j) ?i, j
irrelevant
103
Case 2. ?i ?
Irrelevant
104
Case 2. ?i ?
  • The discriminant hyperplanes are often not
  • orthogonal to the segments joining the class means

105
Case 2. ?i ?
106
Case 2. ?i ?
107
Case 3. ?i ? ? j
The covariance matrices are different for each
category In two class case, the decision
boundaries form hyperquadratics.
  • Decision surfaces are hyperquadrics, e.g.,
  • hyperplanes
  • hyperspheres
  • hyperellipsoids
  • hyperhyperboloids

Without this term In Case 1 and 2
irrelevant
108
Case 3. ?i ? ? j
Non-simply connected decision regions can arise
in one dimensions for Gaussians having unequal
variance.
109
Case 3. ?i ? ? j
110
Case 3. ?i ? ? j
111
Case 3. ?i ? ? j
112
Multi-Category Classification
113
Example A Problem
  • Exemplars (transposed)
  • For w1 (2, 6), (3, 4), (3, 8), (4, 6)
  • For w2 (1, -2), (3, 0), (3, -4), (5, -2)
  • Calculated means (transposed)
  • m1 (3, 6)
  • m2 (3, -2)

114
Example Covariance Matrices
115
Example Covariance Matrices
116
Example Inverse and Determinant for Each of the
Covariance Matrices
117
Example A Discriminant Function for Class 1
118
Example
119
Example A Discriminant Function for Class 2
120
Example
121
Example The Class Boundary
122
Example A Quadratic Separator
123
Example Plot of the Discriminant
124
Summary Steps for Building a Bayesian Classifier
  • Collect class exemplars
  • Estimate class a priori probabilities
  • Estimate class means
  • Form covariance matrices, find the inverse and
    determinant for each
  • Form the discriminant function for each class

125
Using the Classifier
  • Obtain a measurement vector x
  • Evaluate the discriminant function gi(x) for each
    class i 1,,c
  • Decide x is in the class j if gj(x) gt gi(x) for
    all i ? j

126
Bayesian Decision Theory(Classification)
  • Criterions

127
Bayesian Decision Theory(Classification)
  • Minimun error rate Criterion

128
Minimum-Error-Rate Classification
  • If action is taken and the true state is
    , then the decision is correct if and in
    error if
  • Error rate (the probability of error) is to be
    minimized
  • Symmetrical or zero-one loss function
  • Conditional risk

129
Minimum-Error-Rate Classification
130
Bayesian Decision Theory(Classification)
  • Minimax Criterion

131
Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Minimax criterion deals with the case that the
prior probabilities are unknown.
132
Basic Concept on Minimax
To choose the worst-case prior probabilities (the
maximum loss) and, then, pick the decision rule
that will minimize the overall risk.
Minimize the maximum possible overall risk.
So that the worst risk for any value of the
priors is as small as possible
133
Overall Risk
134
Overall Risk
135
Overall Risk
136
Overall Risk
137
Overall Risk
R(x) ax b
The value depends on the setting of decision
boundary
The value depends on the setting of decision
boundary
The overall risk for a particular P(?1).
138
Overall Risk
R(x) ax b
0 for minimax solution
Rmm, minimax risk
Independent on the value of P(?i).
139
Minimax Risk
140
Error Probability
Use 0/1 loss function
141
Minimax Error-Probability
Use 0/1 loss function
P(?1?2)
P(?2?1)
142
Minimax Error-Probability
P(?1?2)
P(?2?1)
143
Minimax Error-Probability
144
Bayesian Decision Theory(Classification)
  • Neyman-Pearson Criterion

145
Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Neyman-Pearson Criterion deals with the case that
both loss functions and the prior probabilities
are unknown.
146
Signal Detection Theory
  • The theory of signal detection theory evolved
    from the development of communications and radar
    equipment the first half of the last century.
  • It migrated to psychology, initially as part of
    sensation and perception, in the 50's and 60's as
    an attempt to understand some of the features of
    human behavior when detecting very faint stimuli
    that were not being explained by traditional
    theories of thresholds.

147
The situation of interest
  • A person is faced with a stimulus (signal) that
    is very faint or confusing.
  • The person must make a decision, is the signal
    there or not. 
  • What makes this situation confusing and difficult
    is the presences of other mess that is similar to
    the signal.  Let us call this mess noise.

148
Example
Noise is present both in the environment and in
the sensory system of the observer. The observer
reacts to the momentary total activation of the
sensory system, which fluctuates from moment to
moment, as well as responding to environmental
stimuli, which may include a signal.
149
Signal Detection Theory
Suppose we want to detect a single pulse from a
signal. We assume the signal has some random
noise. When the signal is present we observe a
normal distribution with mean u2. When the signal
is not present we observe a normal distribution
with mean u1. We assume same standard deviation.
Can we measure the discriminability of the
problem? Can we do this independent of the
threshold x?
Discriminability d u2 u1 / s
150
Example
  • A radiologist is examining a CT scan, looking for
    evidence of a tumor.
  • A Hard job, because there is always some
    uncertainty.
  • There are four possible outcomes
  • hit (tumor present and doctor says "yes'')
  • miss (tumor present and doctor says "no'')
  • false alarm (tumor absent and doctor says "yes")
  • correct rejection (tumor absent and doctor says
    "no").

Two types of Error
151
The Four Cases
Signal detection theory was developed to help us
understand how a continuous and ambiguous signal
can lead to a binary yes/no decision.
Correct Rejection
Miss
P(?1?2)
P(?1?1)
False Alarms
Hit
P(?2?2)
P(?2?1)
152
Decision Making
Discriminability
Based on expectancy (decision bias)
Criterion
Hit
P(?2?2)
False Alarm
P(?2?1)
153
(No Transcript)
154
Signal Detection Theory
  • How do we find d if we do not know u1, u2, or
    x?
  • From the data we can compute
  • P( x gt x w2) a hit.
  • P( x gt x w1) a false alarm.
  • P( x lt x w2) a miss.
  • P( x lt x w1) a correct rejection.
  • If we plot a point in a space representing hit
    and false alarm rates,
  • then we have a ROC (receiver operating
    characteristic) curve.
  • With it we can distinguish between
    discriminability and bias.

155
ROC Curve(Receiver Operating Characteristic)
Hit
PHP(?2?2)
False Alarm
PFAP(?2?1)
156
Neyman-Pearson Criterion
Hit
PHP(?2?2)
NP
max. PH subject to PFA ? a
False Alarm
PFAP(?2?1)
Write a Comment
User Comments (0)
About PowerShow.com