Discriminant Analysis - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Discriminant Analysis

Description:

Chapter 8 Discriminant Analysis 8.1 Introduction Classification is an important issue in multivariate analysis and data mining. Classification: classifies data ... – PowerPoint PPT presentation

Number of Views:224

Avg rating:3.0/5.0

Slides: 45

Provided by: Departme251

Category:

more less

Transcript and Presenter's Notes

Title: Discriminant Analysis

1
Chapter 8

Discriminant Analysis

2
8.1 Introduction

Classification is an important issue in
multivariate analysis and data mining.
Classification
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data, i.e., predicts unknown or missing
values

3
ClassificationA Two-Step Process

Model construction describing a set of
predetermined classes
Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
The set of tuples used for model construction is
training set
The model is represented as classification rules,
decision trees, or mathematical formulae
Prediction for classifying future or unknown
objects
Estimate accuracy of the model
The known label of test sample is compared with
the classified result from the model
Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
Test set is independent of training set,
otherwise over-fitting will occur
If the accuracy is acceptable, use the model to
classify data tuples whose class labels are not
known

4
Classification Process Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
5
Classification Process Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
6
Supervised vs. Unsupervised Learning

Supervised learning (classification)
Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data

7
Discrimination Introduction

Discrimination is a technique concerned with
allocating new observations to previously defined
groups.
There are k samples from k distinct populations
One wants to find the so-called discriminant
function and related rule to identify the new
observations.

8
Example 11.3 Bivariate case
9
Discriminant function and rule
10
Example 11.1 Riding mowers
Consider two groups in city riding-mower owners
and those without riding mowers. In order to
identify the best sales prospects for an
intensive sales campaign, a riding-mower
manufacturer is interested in classifying
families as prospective owners or non-owners on
the basis of income and lot size.
11
Example 11.1 Riding mowers
12
Example 11.1 Riding mowers
13
8.2 Discriminant by Distance

Assume k2 for simplicity

14
8.2 Discriminant by Distance

Consider the Mahalanobis distance

15
8.2 Discriminant by Distance
Let
16
8.2 Discriminant by Distance
17
Example Univariate Case with equal variance
18
Example Univariate Case with equal variance
a
19
8.3 Fishers Discriminant Function

Idea projection, ANOVA

20
8.3 Fishers Discriminant Function

Training samples
21
8.3 Fishers Discriminant Function

Projection the data on a direction , the
F-statistics
where
22
8.3 Fishers Discriminant Function

To find such that
The solution of is the eigenvector
associated with the largest eigenvalue of
.
Discriminant function
23
(B) Two Populations
Note
We have and
There is only one non-zero eigenvalue of as

24
(B) Two Populations
The associated eigenvector is
where
25
(B) Two Populations
When is replaced by
where
26
Example Inset Classification
Note data x1 and x2 are the characteristics of
insect (Hoel,1947) n.g. means natural group
(species), c.g. the classified group, y the
value of the discriminant function
27
Example Inset Classification
The eigenvalue of is 1.9187 and the
associated eigenvector is
28
Example Inset Classification
The discriminant function is and the
associated value of each observation is given in
the table. The cutting point is
Classification is
If we use , we have the same
classification.
29
8.4 Bayes Discriminant Analysis

Idea
There are k populations G1, , Gk in Rp.
A partition of Rp, R1, , Rk , is determined
based on a training
sample.

Rule if falls into Ri Loss is
from Gi , but falls into Rj The Probability
of this misclassification where is the density
of .
30
8.4 Bayes Discriminant Analysis
Expected cost of misclassification is where q1,
, qk are prior probabilities. We want to
minimize ECM(R1, , Rk ) w.r.t. R1, , Rk .
31
B. Method
Theorem 6.4.1 Let Then the optimal Rts are
32
Corollary 1
Take if and 0 if . Then
Proof
33

Corollary 2
In the case of k2
we have
34
(No Transcript)
35
Corollary 3
In the case of k2 and
36
Then
37
C. Example 11.3 Detection of hemophilia A
carriers
For the detection of hemophilia A carriers, to
construct a procedure for detecting potential
hemophilia A carriers, blood samples were assayed
for two groups of women and measurements on the
two variables. The first group of 30 women were
selected from a population of women who did not
carry the hemophilia gene. This group was called
the normal group. The second group of 22 women
was selected from known hemophilia A carriers.
This group was called the obligatory carriers.
38
C. Example 11.3 Detection of hemophilia a
carriers
Variables log10 (AHF activity) log10
(AHF-like antigen)
Populations population of women who did not
carry the hemophilia gene (n130) populati
on of women who are known hemophilia A
carriers (n245)
39
C. Example 11.3 Detection of hemophilia a
carriers
40
C. Example 11.3 Detection of hemophilia a
carriers
Data set
-0.0056 -0.1698 -0.3469 -0.0894 -0.1679
-0.0836 -0.1979 -0.0762 -0.1913 -0.1092
-0.5268 -0.0842 -0.0225 0.0084 -0.1827
0.1237 -0.4702 -0.1519 0.0006 -0.2015
-0.1932 0.1507 -0.1259 -0.1551 -0.1952
0.0291 -0.228 -0.0997 -0.1972
-0.0867 -0.1657 -0.1585 -0.1879 0.0064
0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119
-0.4773 0.0248 -0.058 0.0782 -0.1138 0.214
-0.3099 -0.0686 -0.1153 -0.0498 -0.2293
0.0933 -0.0669 -0.1232 -0.1007 0.0442
-0.171 -0.0733 -0.0607 -0.056
-0.3478 -0.3618 -0.4986 -0.5015 -0.1326
-0.6911 -0.3608 -0.4535 -0.3479 -0.3539
-0.4719 -0.361 -0.3226 -0.4319 -0.2734
-0.5573 -0.3755 -0.495 -0.5107 -0.1652
-0.2447 -0.4232 -0.2375 -0.2205 -0.2154
-0.3447 -0.254 -0.3778 -0.4046 -0.0639
-0.3351 -0.0149 -0.0312 -0.174 -0.1416
-0.1508 -0.0964 -0.2642 -0.0234 -0.3352
-0.1878 -0.1744 -0.4055 -0.2444
-0.4784 0.1151 -0.2008 -0.086 -0.2984
0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722
-0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548
-0.1865 -0.0153 -0.2483 0.2132 -0.0407
-0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573
-0.2682 -0.1162 0.1569 -0.1368 0.1539 0.14
-0.0776 0.1642 0.1137 0.0531 0.0867 0.0804
0.0875 0.251 0.1892 -0.2418 0.1614 0.0282
41
C. Example 11.3 Detection of hemophilia a
carriers
SAS output
42
C. Example 11.3 Detection of hemophilia a
carriers
43
C. Example 11.3 Detection of hemophilia a
carriers
44
C. Example 11.3 Detection of hemophilia a
carriers

Write a Comment

User Comments (0)