Title: Discriminant Analysis Concepts
1Discriminant Analysis Concepts
- Used to predict group membership from a set of
continuous predictors - Think of it as MANOVA in reverse in MANOVA we
asked if groups are significantly different on a
set of linearly combined responses. - The same responses can be used to predict group
membership.
2Discriminant Analysis Concepts
- Determine how can continuous variables be
linearly combined to best classify a subject into
a group. - A better term may be separation.
- Slightly different is classification when we
seek rules that allocate new subjects into
established classes. - Logistic regression is a competitor.
3Classification
- Two populations p1 and p2
- We have measurements x' x1 x2 . . .xp on
each of the individuals concerned. - Given a new value of x for an unknown individual
our problem is how we can best classify this
individual.
4Illustration
f1(x)
f2(x)
R1
R2
Probability of misclassifying Population 2 member
in Population 1
Probability of misclassifying Population 1 member
in Population 2
5Misclassification
The probability an individual from p1 is wrongly
classified is ? f1(x)dx P(2 1) and an
individual from p2 is wrongly classified is ?
f2(x)dx P(1 2)
R2
R1
6Four Possibilities
- Assume p1 and p2 are the prior probabilities of
p1 and p2, respectively.
Classified
Actual
7Costs
- In general there is a cost associated with
misclassification. - Assume the cost is zero for correct
classification. - C(2?1) as the cost of misclassifying a p1
individual as a p2 individual. - C(1?2) as the cost of misclassifying a p2
individual as a p1 individual.
Classified
Actual
8Expected Cost of Misclassification (ECM)
ECM c(21)P(21)p1 c(12)P(12)p2
Goal To minimize ECM
9It can be shown that the ECM is minimized if R1
contains those values of for which C(12)p2f2(x)
-C(21)p1f1(x) ? 0 and excludes those x for
which the above is gt 0. In other words R1 is the
set of points x for which f1(x) p2 C(12)
f2(x) p1 C(21) So when x
satisfies this inequality we would classify the
corresponding individual in p1.
gt
10Conversely since R2 is the complement of R1 R2 is
the set such that f1(x) p2 C(12)
f2(x) p1 C(21) and an individual
whose x vector satisfied this inequality would be
allocated to p2.
lt
11- Assuming x has a multivariate normal distribution
i.e. - x Np(µi , ?) in population i (i 1,2)
- (note that this implies the same covariance
matrix applies to each population) we have - f1(x) exp-1/2(x-m1)S-1(x-m1)
- f2(x) exp-1/2(x-m2)S-1(x-m2)
12- and the general rule (1) after taking natural
logs and some rearrangement can be shown to be
equivalent to -
- ?'x - ? '( µ1 µ2 ) ? c
- 2
- where ? ?-1( µ1 - µ2 ) d1
- d2
- .
- .
- dp say
- (Correspondingly d d1,d2.,dp (m1-m2)S-1
) - and c ln C(12) p2
-
- C(21) p1
13Priors
- Typically, information is not available on the
prior probabilities p1 and p2. - Typically taken to be the same making c a
function of only the ratio of the two costs. - In addition, if the misclassification costs,
C(12) and C(21), are the same then c 0.
14- Ordinarily ?, µ1, µ2 are not known and need to
- be estimated from the data by S, x1 and x2
respectively and we therefore use - S-1( x1 - x2 ) for ? etc
- where S-1 is taken as the inverse of
- Spooled (n1-1)S1 (n2-1)S2
- (n1n2-2)
- Where S1 and S2 are the sample covariance
matrices for the each of the two groups
(populations) respectively.
15Minimum ECN for Two Normals
- Allocate xo to p1 for which
- ?'xo - 1 ?(x1 x2) ? c
- 2
-
16Linear Discriminant Function
- ? x ( µ1 - µ2 ) ?-1x is called the linear
discriminant function of x. - This linear combination of x summarizes all of
the possible information in x that is available
for discriminating between the two populations.
17Unequal Covariance Matrices
Allocate xo to p1 for which
-0.5xo (S1-1 - S2-1 ) xo (x1S1-1 x2S2-1 ) xo
k ? c
where k 0.5ln(S1 S2-1 ) xo (x1S1-1x1
x2S2-1 x2)
(Quadratic Classification Rule)
18Fishers Discriminant Function
Allocate xo to p1 for which
(x1-x2) SPooled-1 xo ? 0.5 (x1-x2) SPooled-1 (x1
x2)
Note The p-variate standard distance between two
vectors is defined as
For this problem maximized at a SPooled-1
(x1-x2)
19Linear Discriminant Function,Alternative View
The linear combination of x, say is
called a linear discriminant function if .
20Example
21Example
Example of Linear Discriminant Function
4
The unscaled
3
2
1
0
b
-1
-2
-3
-4
-4
-3
-2
-1
0
1
2
3
4
22Example
Example With Correlation
4
Correlation of 0.6
3
2
1
0
-1
b
-2
-3
-4
-4
-3
-2
-1
0
1
2
3
4
23More Than Two Groups
Among-Groups sums of squares and cross-products
matrix
24More Than Two Groups
- Note Same decomposition we used with MANOVA
25More Than Two Groups
Referred to as canonical discriminant analysis
26Canonical Correlation Analysis
- A statistical technique to identify and measure
the association between two sets of variables. - Multiple regression can be interpreted as a
special case of such an analysis. - The multiple correlation coefficient, R, can be
thought of as the maximum correlation that is
attainable between the dependent variable and a
linear combination of the independent variables
27Canonical Correlation Analysis
- CCA is an extension of the multiple R in multiple
regression. - In CCA, there can be multiple response variables.
- Canonical correlations are the maximum
correlation between a linear combination of the
responses and a linear combination of the
predictor variables.
28Canonical Correlations
Suppose where x1(x11,,x1q)? and
x2(x21,,x2,p-q)?. Â Note that Var(x1) ?11 is
q?q, Var(x2) ?22 is (p-q)?(p-q), Cov(x1,x2)
?12 is q?(p-q), Cov(x2,x1) ?21 is (p-q)?q, and
?12
29The First Canonical Correlation
- Find a1 and b1 (vectors of constants) such that
is large as possible. - Let U1 and V1 and call them
canonical variables. - Then Var(U1) ,
- Var(V1) ,
- and Cov(U1,V1) .
30The First Canonical Correlation
The correlation between U1 and V1 is
31Finding the Correlation
Let ?1 . It
can be shown that
     is the largest eigenvalue of     Â
a1 is the eigenvector corresponding to      b1
is the eigenvector corresponding to the largest
eigenvalue of - this
largest eigenvalue also is . Â
Note that 0??1?1.