Title: Discriminant Analysis
1Chapter 8
2 8.1 Introduction
- Classification is an important issue in
multivariate analysis and data mining. - Classification
- classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data, i.e., predicts unknown or missing
values
3ClassificationA Two-Step Process
- Model construction describing a set of
predetermined classes - Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute - The set of tuples used for model construction is
training set - The model is represented as classification rules,
decision trees, or mathematical formulae - Prediction for classifying future or unknown
objects - Estimate accuracy of the model
- The known label of test sample is compared with
the classified result from the model - Accuracy rate is the percentage of test set
samples that are correctly classified by the
model - Test set is independent of training set,
otherwise over-fitting will occur - If the accuracy is acceptable, use the model to
classify data tuples whose class labels are not
known
4Classification Process Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
5Classification Process Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
6Supervised vs. Unsupervised Learning
- Supervised learning (classification)
- Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations - New data is classified based on the training set
- Unsupervised learning (clustering)
- The class labels of training data is unknown
- Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
7Discrimination Introduction
- Discrimination is a technique concerned with
allocating new observations to previously defined
groups. - There are k samples from k distinct populations
-
- One wants to find the so-called discriminant
function and related rule to identify the new
observations.
8Example 11.3 Bivariate case
9Discriminant function and rule
10Example 11.1 Riding mowers
Consider two groups in city riding-mower owners
and those without riding mowers. In order to
identify the best sales prospects for an
intensive sales campaign, a riding-mower
manufacturer is interested in classifying
families as prospective owners or non-owners on
the basis of income and lot size.
11Example 11.1 Riding mowers
12Example 11.1 Riding mowers
138.2 Discriminant by Distance
148.2 Discriminant by Distance
- Consider the Mahalanobis distance
158.2 Discriminant by Distance
Let
168.2 Discriminant by Distance
17Example Univariate Case with equal variance
18Example Univariate Case with equal variance
a
198.3 Fishers Discriminant Function
208.3 Fishers Discriminant Function
Training samples
218.3 Fishers Discriminant Function
Projection the data on a direction , the
F-statistics
where
228.3 Fishers Discriminant Function
To find such that
The solution of is the eigenvector
associated with the largest eigenvalue of
.
Discriminant function
23(B) Two Populations
Note
We have and
There is only one non-zero eigenvalue of as
24(B) Two Populations
The associated eigenvector is
where
25(B) Two Populations
When is replaced by
where
26Example Inset Classification
Note data x1 and x2 are the characteristics of
insect (Hoel,1947) n.g. means natural group
(species), c.g. the classified group, y the
value of the discriminant function
27Example Inset Classification
The eigenvalue of is 1.9187 and the
associated eigenvector is
28Example Inset Classification
The discriminant function is and the
associated value of each observation is given in
the table. The cutting point is
Classification is
If we use , we have the same
classification.
298.4 Bayes Discriminant Analysis
- Idea
- There are k populations G1, , Gk in Rp.
- A partition of Rp, R1, , Rk , is determined
based on a training - sample.
Rule if falls into Ri Loss is
from Gi , but falls into Rj The Probability
of this misclassification where is the density
of .
308.4 Bayes Discriminant Analysis
Expected cost of misclassification is where q1,
, qk are prior probabilities. We want to
minimize ECM(R1, , Rk ) w.r.t. R1, , Rk .
31B. Method
Theorem 6.4.1 Let Then the optimal Rts are
32Corollary 1
Take if and 0 if . Then
Proof
33Corollary 2
In the case of k2
we have
34(No Transcript)
35Corollary 3
In the case of k2 and
36Then
37C. Example 11.3 Detection of hemophilia A
carriers
For the detection of hemophilia A carriers, to
construct a procedure for detecting potential
hemophilia A carriers, blood samples were assayed
for two groups of women and measurements on the
two variables. The first group of 30 women were
selected from a population of women who did not
carry the hemophilia gene. This group was called
the normal group. The second group of 22 women
was selected from known hemophilia A carriers.
This group was called the obligatory carriers.
38C. Example 11.3 Detection of hemophilia a
carriers
Variables log10 (AHF activity) log10
(AHF-like antigen)
Populations population of women who did not
carry the hemophilia gene (n130) populati
on of women who are known hemophilia A
carriers (n245)
39C. Example 11.3 Detection of hemophilia a
carriers
40C. Example 11.3 Detection of hemophilia a
carriers
Data set
-0.0056 -0.1698 -0.3469 -0.0894 -0.1679
-0.0836 -0.1979 -0.0762 -0.1913 -0.1092
-0.5268 -0.0842 -0.0225 0.0084 -0.1827
0.1237 -0.4702 -0.1519 0.0006 -0.2015
-0.1932 0.1507 -0.1259 -0.1551 -0.1952
0.0291 -0.228 -0.0997 -0.1972
-0.0867 -0.1657 -0.1585 -0.1879 0.0064
0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119
-0.4773 0.0248 -0.058 0.0782 -0.1138 0.214
-0.3099 -0.0686 -0.1153 -0.0498 -0.2293
0.0933 -0.0669 -0.1232 -0.1007 0.0442
-0.171 -0.0733 -0.0607 -0.056
-0.3478 -0.3618 -0.4986 -0.5015 -0.1326
-0.6911 -0.3608 -0.4535 -0.3479 -0.3539
-0.4719 -0.361 -0.3226 -0.4319 -0.2734
-0.5573 -0.3755 -0.495 -0.5107 -0.1652
-0.2447 -0.4232 -0.2375 -0.2205 -0.2154
-0.3447 -0.254 -0.3778 -0.4046 -0.0639
-0.3351 -0.0149 -0.0312 -0.174 -0.1416
-0.1508 -0.0964 -0.2642 -0.0234 -0.3352
-0.1878 -0.1744 -0.4055 -0.2444
-0.4784 0.1151 -0.2008 -0.086 -0.2984
0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722
-0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548
-0.1865 -0.0153 -0.2483 0.2132 -0.0407
-0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573
-0.2682 -0.1162 0.1569 -0.1368 0.1539 0.14
-0.0776 0.1642 0.1137 0.0531 0.0867 0.0804
0.0875 0.251 0.1892 -0.2418 0.1614 0.0282
41C. Example 11.3 Detection of hemophilia a
carriers
SAS output
42C. Example 11.3 Detection of hemophilia a
carriers
43C. Example 11.3 Detection of hemophilia a
carriers
44C. Example 11.3 Detection of hemophilia a
carriers