Title: Bayesian Decision Theory
1Bayesian Decision Theory
2Outline
- What is pattern recognition?
- What is classification?
- Need for Probabilistic Reasoning
- Probabilistic Classification Theory
- What is Bayesian Decision Theory?
- HISTORY
- PRIOR PROBABILITIES
- CLASS-CONDITIONAL PROBABILITIES
- BAYES FORMULA
- A Casual Formulation
- Decision
3Outline
- What is Bayesian Decision Theory?
- Decision fot Two Categories
4Outline
- What is classification?
- Classification by Bayesian Classification
- Basic Concepts
- Bayes Rule
- More General Forms of Bayes Rule
- Discriminated Functions
- Bayesian Belief Networks
5What is pattern recognition?
TYPICAL APPLICATIONS OF PR
IMAGE PROCESSING EXAMPLE
6Pattern Classification System
- Preprocessing
- Segment (isolate) fishes from one another and
from the background - Feature Extraction
- Reduce the data by measuring certain features
- Classification
- Divide the feature space into decision regions
7(No Transcript)
8Classification
- Initially use the length of the fish as a
possible feature for discrimination
9TYPICAL APPLICATIONS
LENGTH AS A DISCRIMINATOR
- Length is a poor discriminator
10Feature Selection
- The length is a poor feature alone!
- Select the lightness as a possible feature
11TYPICAL APPLICATIONS
ADD ANOTHER FEATURE
- Lightness is a better feature than length because
it reduces the misclassification error. - Can we combine features in such a way that we
improve performance? (Hint correlation)
12Threshold decision boundary and cost relationship
- Move decision boundary toward smaller values of
lightness in order to minimize the cost (reduce
the number of sea bass that are classified
salmon!) - Task of decision theory
13Feature Vector
- Adopt the lightness and add the width of the fish
to the feature vector - Fish xT x1, x2
Width
Lightness
14TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
Straight line decision boundary
- Treat features as a N-tuple (two-dimensional
vector) - Create a scatter plot
- Draw a line (regression) separating the two
classes
15Features
- We might add other features that are not highly
correlated with the ones we already have. Be sure
not to reduce the performance by adding noisy
features - Ideally, you might think the best decision
boundary is the one that provides optimal
performance on the training data (see the
following figure)
16TYPICAL APPLICATIONS
DECISION THEORY
- Can we do better than a linear classifier?
Is this a good decision boundary?
- What is wrong with this decision surface? (hint
generalization)
17Decision Boundary Choice
- Our satisfaction is premature because the central
aim of designing a classifier is to correctly
classify new (test) input - Issue of generalization!
18TYPICAL APPLICATIONS
GENERALIZATION AND RISK
Better decision boundary
- Why might a smoother decision surface be a better
choice? (hint Occams Razor).
- PR investigates how to find such optimal
decision surfaces and how to provide system
designers with the tools to make intelligent
trade-offs.
19Need for Probabilistic Reasoning
- Most everyday reasoning is based on uncertain
evidence and inferences. - Classical logic, which only allows conclusions to
be strictly true or strictly false, does not
account for this uncertainty or the need to weigh
and combine conflicting evidence. - Todays expert systems employed fairly ad hoc
methods for reasoning under uncertainty and for
combining evidence.
20Probabilistic Classification Theory
- ?? classification ?????? ?? ?? ???????? ??????
???????? ?? ???? ?? ?? ??????? ?? ?? ???? ?????
???. - In practice, some overlap between classes and
random variation within classes occur, hence
perfect separation between classes can not be
achieved Misclassification may occur. - ????? Bayesian decision ????? ?? ??? ?? ????
????? ?? ????? ??? ???? ????? ?? ???? ??? ?? ??
????? ????? ?? ???? ?? ???? A ??? ??????? ?? ????
B ???? ???? ???. (misclassify)
21HISTORY
What is Bayesian Decision Theory?
- Bayesian Probability was named after Reverend
Thomas Bayes (1702-1761). - He proved a special case of what is currently
known as the Bayes Theorem. - The term Bayesian came into use around the
1950s.
http//en.wikipedia.org/wiki/Bayesian_probability
22HISTORY (Cont.)
- Pierre-Simon, Marquis de Laplace (1749-1827)
independently proved a generalized version of
Bayes Theorem. - 1970 Bayesian Belief Network at Stanford
University (Judea Pearl 1988) - The ideas proposed above was not fully
developed until later. BBN became popular in the
1990s.
23HISTORY (Cont.)
- Current uses of Bayesian Networks
- Microsofts printer troubleshooter.
- Diagnose diseases (Mycin).
- Used to predict oil and stock prices
- Control the space shuttle
- Risk Analysis Schedule and Cost Overruns.
24BAYESIAN DECISION THEORY
PROBABILISTIC DECISION THEORY
- Bayesian decision theory is a fundamental
statistical approach to the problem of pattern
classification. - Using probabilistic approach to help making
decision (e.g., classification) so as to minimize
the risk (cost). - Assume all relevant probability distributions are
known (later we will learn how to estimate these
from data).
25BAYESIAN DECISION THEORY
PRIOR PROBABILITIES
- State of nature is prior information
- ? denote the state of nature
- Model as a random variable, ?
- ? ?1 the event that the next fish is a sea
bass - category 1 sea bass category 2 salmon
- A priori probabilities
- P(?1) probability of category 1
- P(?2) probability of category 2
- P(?1) P( ?2) 1 (either ?1 or ?2 must occur)
- Decision rule
- Decide ?1 if P(?1) gt P(?2) otherwise, decide
?2
But we know there will be many mistakes .
http//www.stat.yale.edu/Courses/1997-98/101/ranva
r.htm
26BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
- A decision rule with only prior information
always produces the same result and ignores
measurements. - If P(?1) gtgt P( ?2), we will be correct most of
the time.
- Given a feature, x (lightness), which is a
continuous random variable, p(x?2) is the
class-conditional probability density function
- p(x?1) and p(x?2) describe the difference in
lightness between populations of sea and salmon.
27Let x be a continuous random variable. p(xw)
is the probability density for x given the state
of nature w.
p(lightness salmon) ?
P(lightness sea bass) ?
28BAYESIAN DECISION THEORY
BAYES FORMULA
How do we combine a priori and class-conditional
probabilities to know the probability of a state
of nature?
- Suppose we know both P(?j) and p(x?j), and we
can measure x. How does this influence our
decision? - The joint probability that of finding a pattern
that is in category j and that this pattern has a
feature value of x is
- Rearranging terms, we arrive at Bayes formula.
29- A Casual Formulation
- The prior probability reflects knowledge of the
relative frequency of instances of a class - The likelihood is a measure of the probability
that a measurement value occurs in a class - The evidence is a scaling term
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
- Bayes formula
- can be expressed in words as
- By measuring x, we can convert the prior
probability, P(?j), into a posterior probability,
P(?jx). - Evidence can be viewed as a scale factor and is
often ignored in optimization applications (e.g.,
speech recognition).
For two categories
Bayes Decision Choose w1 if P(w1x) gt P(w2x)
otherwise choose w2.
30BAYESIAN THEOREM
- A special case of Bayesian Theorem
- P(AnB) P(B) x P(AB)
- P(BnA) P(A) x P(BA)
- Since P(AnB) P(BnA),
- P(B) x P(AB) P(A) x P(BA)
- gt P(AB) P(A) x P(BA) / P(B)
31Preliminaries and Notations
a state of nature
prior probability
feature vector
class-conditional density
posterior probability
32Decision
Decide ?i if P(?ix) gt P(?jx) ? j ? i
The evidence, p(x), is a scale factor that
assures conditional probabilities sum to 1
P(?1x)P(?2x)1
We can eliminate the scale factor (which appears
on both sides of the equation)
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
- Special cases
- P(?1)P(?2) ? ? ?P(?c)
- p(x?1)p(x?2) ? ? ? p(x?c)
33Two Categories
Decide ?i if P(?ix) gt P(?jx) ? j ? i
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
Decide ?1 if P(?1x) gt P(?2x) otherwise
decide ?2
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
- Special cases
- P(?1)P(?2)
- Decide ?1 if p(x?1) gt p(x?2) otherwise
decide ?2 - 2. p(x?1)p(x?2)
- Decide ?1 if P(?1) gt P(?2) otherwise decide ?2
34Example
R2
R1
P(?1)P(?2)
35Example
P(?1)2/3 P(?2)1/3
Bayes Decision Rule
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
36BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
- For every value of x, the posteriors sum to 1.0.
- At x14, the probability it is in category ?2 is
0.08, and for category ?1 is 0.92.
37BAYESIAN DECISION THEORY
BAYES DECISION RULE
Classification Error
- Decision rule
- For an observation x, decide ?1 if P(?1x) gt
P(?2x) otherwise, decide ?2 - Probability of error
- The average probability of error is given by
- If for every x we ensure that P(errorx) is as
small as possible, then the integral is as small
as possible. Thus, Bayes decision rule for
minimizes P(error).
38CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
- Generalization of the preceding ideas
- Use of more than one feature(e.g., length and
lightness) - Use more than two states of nature(e.g., N-way
classification) - Allowing actions other than a decision to decide
on the state of nature (e.g., rejection refusing
to take an action when alternatives are close or
confidence is low) - Introduce a loss of function which is more
general than the probability of error(e.g.,
errors are not equally costly) - Let us replace the scalar x by the vector x in a
d-dimensional Euclidean space, Rd, calledthe
feature space.
39The Generation
a set of c states of nature or c categories
a set of a possible actions
LOSS FUNCTION
The loss incurred for taking action ?i when the
true state of nature is ?j.
Risk
can be zero.
We want to minimize the expected loss in making
decision.
40Examples
- Ex 1 Fish classification
- X is the image of fish
- x (brightness, length, fin , etc.)
- is our belief what the fish type is
- sea bass, salmon, trout, etc
- is a decision for the fish type, in this
case - sea bass, salmon, trout, manual
expection needed, etc
- Ex 2 Medical diagnosis
- X all the available medical tests, imaging scans
that a doctor can order for a patient - x (blood pressure, glucose level, cough, x-ray,
etc.) - is an illness type
- Flu, cold, TB, pneumonia, lung
cancer, etc - is a decision for treatment,
- Tylenol, Hospitalize, more tests
needed, etc
41Conditional Risk
Given x, the expected loss (risk) associated with
taking action ?i.
420/1 Loss Function
43Decision
A general decision rule is a function ?(x) that
tells us which action to take for every possible
observation.
Bayesian Decision Rule
44Overall Risk
The overall risk is given by
- Compute the conditional risk for every ? and
select the action that minimizes R(?ix). This is
denoted R, and is referred to as the Bayes risk - The Bayes risk is the best performance that can
be achieved.
Decision function
If we choose ?(x) so that R(?i(x)) is as small as
possible for every x, the overall risk will be
minimized.
Bayesian decision rule the optimal one to
minimize the overall risk Its resulting overall
risk is called the Bayesian risk
45Two-Category Classification
- Let ?1 correspond to ?1, ?2 to ?2, and ?ij
?(?i?j)
- The conditional risk is given by
46Two-Category Classification
Our decision rule is choose ?1 if R(?1x) lt
R(?2x) otherwise decide ?2
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
47Two-Category Classification
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
positive
positive
Posterior probabilities are scaled before
comparison.
- If the loss incurred for making an error is
greater than that incurred for being correct, the
factors (?21- ?11) and(?12- ?22) are positive,
and the ratio of these factors simply scales the
posteriors.
48Two-Category Classification
irrelevant
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
By employing Bayes formula, we can replace the
posteriors by the prior probabilities and
conditional densities
49Two-Category Classification
This slide will be recalled later.
Stated as Choose a1 if the likelihood ration
exceeds a threshold value independent of the
observation x.
Threshold
Likelihood Ratio
Perform ?1 if
50(No Transcript)
51Loss Factors
- If the loss factors are identical, and the prior
probabilities are equal, this reduces to a
standard likelihood ratio
52MINIMUM ERROR RATE
Error rate (the probability of error) is to be
minimized
- Consider a symmetrical or zero-one loss function
- The conditional risk is the average probability
of error. - To minimize error, maximize P(?ix) also known
as maximum a posteriori decoding (MAP).
53Minimum Error Rate
LIKELIHOOD RATIO
- Minimum error rate classification
- choose ?i if P(?i x) gt P(?j x) for all j?i
54Example
- For sea bass population, the lightness x is a
normal random variable distributes according to
N(4,1) - for salmon population x is distributed
according to N(10,1) - Select the optimal decision where
- The two fish are equiprobable
- P(sea bass) 2X P(salmon)
- The cost of classifying a fish as a salmon when
it truly is seabass is 2, and t The cost of
classifying a fish as a seabass when it is truly
a salmon is 1.
2
55(No Transcript)
56(No Transcript)
57(No Transcript)
58 59The Multicategory Classification
How to define discriminant functions?
How do we represent pattern classifiers?
The most common way is through discriminant
functions. Remember we use w1,w2, , wc to be
the possible states of nature.
For each class we create a discriminant function
gi(x).
gi(x)s are called the discriminant functions.
g1(x)
?(x)
g2(x)
Our classifier is a network or machine that
computes c discriminant functions.
gc(x)
The classifier Assign x to ?i if
gi(x) gt gj(x) for all j ? i.
60Simple Discriminant Functions
If f(.) is a monotonically increasing function,
than f(gi(.) )s are also be discriminant
functions.
Notice the decision is the same if we change
every gi(x) for f(gi(x)) Assuming f(.) is a
monotonically increasing function.
Minimum Risk case
Minimum Error-Rate case
61Figure 2.5
62Decision Regions
The net effect is to divide the feature space
into c regions (one for each class). We then have
c decision regions separated by decision
boundaries.
Two-category example
Decision regions are separated by decision
boundaries.
63Figure 2.6
64Bayesian Decision Theory(Classification)
65Basics of Probability
Discrete random variable (X) - Assume integer
Probability mass function (pmf)
Cumulative distribution function (cdf)
Continuous random variable (X)
not a probability
Probability density function (pdf)
Cumulative distribution function (cdf)
66Probability mass function
- The graph of a probability mass function.
- All the values of this function must be
- non-negative and sum up to 1.
67Probability density function
- The pdf can be calculated by taking the integral
of the function f(x) by the integration interval
of the input variable x. - For example the probability of the variable X
being within the interval 4.3,7.8 would be
68Expectations
Let g be a function of random variable X.
The kth moment
The 1st moment
The kth central moments
69Important Expectations
Fact
Mean
Variance
70Entropy
The entropy measures the fundamental uncertainty
in the value of points selected randomly from a
distribution.
71Univariate Gaussian Distribution
- Properties
- Maximize the entropy
- Central limit theorem
XN(µ,s2)
EX µ
VarX s2
72Illustration of the central limit theorem
Let x1,x2,,xn be a sequence of n independent
and identically distributed random variables
having each finite values of expectation µ and
variance s2gt0 The central limit theorem states
that as the sample size n increases, the
distribution of the sample average of these
random variables approaches the normal
distribution with a mean µ and variance s2 / n
irrespective of the shape of the original
distribution.
73Random Vectors
A d-dimensional random vector
Vector Mean
Covariance Matrix
74Multivariate Gaussian Distribution
XN(µ,S)
A d-dimensional random vector
EX µ
E(X-µ) (X-µ)T S
75Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
76Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
77On Parameters of N(µ,S)
XN(µ,S)
78More On Covariance Matrix
? is symmetric and positive semidefinite.
? orthonormal matrix, whose columns are
eigenvectors of ?.
? diagonal matrix (eigenvalues).
79Whitening Transform
XN(µ,S)
YATX
YN(ATµ, ATSA)
Let
80Whitening Transform
Whitening
XN(µ,S)
Linear Transform
YATX
YN(ATµ, ATSA)
Let
Projection
81Whitening Transform
- The whitening transformation is a decorrelation
method that converts - the covariance matrix S of a set of samples
into the identity matrix I. - This effectively creates new random variables
that are uncorrelated and - have the same variances as the original random
variables. - The method is called the whitening transform
because it transforms the - input matrix closer towards white noise.
This can be expressed as
where F is the matrix with the eigenvectors of
"S" as its columns and ? is the diagonal matrix
of non-increasing eigenvalues.
82White noise
- White noise is a random signal (or process) with
a flat power spectral density. - In other words, the signal contains equal power
within a fixed bandwidth - at any center frequency.
Energy spectral density
83Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
84Mahalanobis distance
In statistics, Mahalanobis distance is a distance
measure introduced by P. C. Mahalanobis in
1936. It is based on correlations between
variables by which different patterns can be
identified and analyzed. It is a useful way of
determining similarity of an unknown sample set
to a known one. It differs from Euclidean
distance in that it takes into account the
correlations of the data set and is
scale-invariant, i.e. not dependent on the scale
of measurements .
85Mahalanobis distance
- Formally, the Mahalanobis distance from a group
of values with above mean and covariance matrix S
for a multivariate vector is defined as - Mahalanobis distance can also be defined as
dissimilarity measure between two random vectors
of the same distribution with the
covariance matrix SÂ - If the covariance matrix is the identity matrix,
the Mahalanobis distance reduces to the Euclidean
distance. If the covariance matrix is diagonal,
then the resulting distance measure is called the
normalized Euclidean distance - where si is the standard deviation of the xi over
the sample set.
86Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
87Bayesian Decision Theory(Classification)
- Discriminant Functions for the Normal Populations
88Normal Density
If features are statistically independent and the
variance is the same for all features, the
discriminant function is simple and is linear in
nature. A classifier that uses linear
discriminant functions is called a linear
machine. The decision surface are pieces of
hyperplanes defined by linear equations.
89Minimum-Error-Rate Classification
- Assuming the measurements are normally
distributed, we have
XiN(µi,Si)
90Some Algebra to Simplify the Discriminants
- Since
- We take the natural logarithm to re-write the
first term
91Some Algebra to Simplify the Discriminants
(continued)
92Minimum-Error-Rate Classification
Three Cases
Case 1
Classes are centered at different mean, and their
feature components are pairwisely independent
have the same variance.
Case 2
Classes are centered at different mean, but have
the same variation.
Case 3
Arbitrary.
93Case 1. ?i ?2I
irrelevant
irrelevant
94Case 1. ?i ?2I
95Case 1. ?i ?2I
Boundary btw. ?i and ?j
96Case 1. ?i ?2I
The decision boundary will be a hyperplane
perpendicular to the line btw. the means at
somewhere.
0 if P(?i)P(?j)
midpoint
Boundary btw. ?i and ?j
wT
97Case 1. ?i ?2I
- The decision region when the priors are equal and
the support regions are spherical is simply
halfway between the means (Euclidean distance).
Minimum distance classifier (template matching)
98(No Transcript)
99Case 1. ?i ?2I
Note how priors shift the boundary away from the
more likely mean !!!
100Case 1. ?i ?2I
101Case 1. ?i ?2I
102Case 2. ?i ?
- Covariance matrices are arbitrary, but equal to
each other for all classes. - Features then form hyper-ellipsoidal clusters of
equal size and shape.
Mahalanobis Distance
Irrelevant if P(?i) P(?j) ?i, j
irrelevant
103Case 2. ?i ?
Irrelevant
104Case 2. ?i ?
- The discriminant hyperplanes are often not
- orthogonal to the segments joining the class means
105Case 2. ?i ?
106Case 2. ?i ?
107Case 3. ?i ? ? j
The covariance matrices are different for each
category In two class case, the decision
boundaries form hyperquadratics.
- Decision surfaces are hyperquadrics, e.g.,
- hyperplanes
- hyperspheres
- hyperellipsoids
- hyperhyperboloids
Without this term In Case 1 and 2
irrelevant
108Case 3. ?i ? ? j
Non-simply connected decision regions can arise
in one dimensions for Gaussians having unequal
variance.
109Case 3. ?i ? ? j
110Case 3. ?i ? ? j
111Case 3. ?i ? ? j
112Multi-Category Classification
113Example A Problem
- Exemplars (transposed)
- For w1 (2, 6), (3, 4), (3, 8), (4, 6)
- For w2 (1, -2), (3, 0), (3, -4), (5, -2)
- Calculated means (transposed)
- m1 (3, 6)
- m2 (3, -2)
114Example Covariance Matrices
115Example Covariance Matrices
116Example Inverse and Determinant for Each of the
Covariance Matrices
117Example A Discriminant Function for Class 1
118Example
119Example A Discriminant Function for Class 2
120Example
121Example The Class Boundary
122Example A Quadratic Separator
123Example Plot of the Discriminant
124Summary Steps for Building a Bayesian Classifier
- Collect class exemplars
- Estimate class a priori probabilities
- Estimate class means
- Form covariance matrices, find the inverse and
determinant for each - Form the discriminant function for each class
125Using the Classifier
- Obtain a measurement vector x
- Evaluate the discriminant function gi(x) for each
class i 1,,c - Decide x is in the class j if gj(x) gt gi(x) for
all i ? j
126Bayesian Decision Theory(Classification)
127Bayesian Decision Theory(Classification)
- Minimun error rate Criterion
128Minimum-Error-Rate Classification
- If action is taken and the true state is
, then the decision is correct if and in
error if - Error rate (the probability of error) is to be
minimized - Symmetrical or zero-one loss function
- Conditional risk
129Minimum-Error-Rate Classification
130Bayesian Decision Theory(Classification)
131Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Minimax criterion deals with the case that the
prior probabilities are unknown.
132Basic Concept on Minimax
To choose the worst-case prior probabilities (the
maximum loss) and, then, pick the decision rule
that will minimize the overall risk.
Minimize the maximum possible overall risk.
So that the worst risk for any value of the
priors is as small as possible
133Overall Risk
134Overall Risk
135Overall Risk
136Overall Risk
137Overall Risk
R(x) ax b
The value depends on the setting of decision
boundary
The value depends on the setting of decision
boundary
The overall risk for a particular P(?1).
138Overall Risk
R(x) ax b
0 for minimax solution
Rmm, minimax risk
Independent on the value of P(?i).
139Minimax Risk
140Error Probability
Use 0/1 loss function
141Minimax Error-Probability
Use 0/1 loss function
P(?1?2)
P(?2?1)
142Minimax Error-Probability
P(?1?2)
P(?2?1)
143Minimax Error-Probability
144Bayesian Decision Theory(Classification)
145Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Neyman-Pearson Criterion deals with the case that
both loss functions and the prior probabilities
are unknown.
146Signal Detection Theory
- The theory of signal detection theory evolved
from the development of communications and radar
equipment the first half of the last century. - It migrated to psychology, initially as part of
sensation and perception, in the 50's and 60's as
an attempt to understand some of the features of
human behavior when detecting very faint stimuli
that were not being explained by traditional
theories of thresholds.
147The situation of interest
- A person is faced with a stimulus (signal) that
is very faint or confusing. - The person must make a decision, is the signal
there or not. - What makes this situation confusing and difficult
is the presences of other mess that is similar to
the signal. Let us call this mess noise.
148Example
Noise is present both in the environment and in
the sensory system of the observer. The observer
reacts to the momentary total activation of the
sensory system, which fluctuates from moment to
moment, as well as responding to environmental
stimuli, which may include a signal.
149Signal Detection Theory
Suppose we want to detect a single pulse from a
signal. We assume the signal has some random
noise. When the signal is present we observe a
normal distribution with mean u2. When the signal
is not present we observe a normal distribution
with mean u1. We assume same standard deviation.
Can we measure the discriminability of the
problem? Can we do this independent of the
threshold x?
Discriminability d u2 u1 / s
150Example
- A radiologist is examining a CT scan, looking for
evidence of a tumor. - A Hard job, because there is always some
uncertainty. - There are four possible outcomes
- hit (tumor present and doctor says "yes'')
- miss (tumor present and doctor says "no'')
- false alarm (tumor absent and doctor says "yes")
- correct rejection (tumor absent and doctor says
"no").
Two types of Error
151The Four Cases
Signal detection theory was developed to help us
understand how a continuous and ambiguous signal
can lead to a binary yes/no decision.
Correct Rejection
Miss
P(?1?2)
P(?1?1)
False Alarms
Hit
P(?2?2)
P(?2?1)
152Decision Making
Discriminability
Based on expectancy (decision bias)
Criterion
Hit
P(?2?2)
False Alarm
P(?2?1)
153(No Transcript)
154Signal Detection Theory
- How do we find d if we do not know u1, u2, or
x? - From the data we can compute
- P( x gt x w2) a hit.
- P( x gt x w1) a false alarm.
- P( x lt x w2) a miss.
- P( x lt x w1) a correct rejection.
- If we plot a point in a space representing hit
and false alarm rates, - then we have a ROC (receiver operating
characteristic) curve. - With it we can distinguish between
discriminability and bias.
155ROC Curve(Receiver Operating Characteristic)
Hit
PHP(?2?2)
False Alarm
PFAP(?2?1)
156Neyman-Pearson Criterion
Hit
PHP(?2?2)
NP
max. PH subject to PFA ? a
False Alarm
PFAP(?2?1)