Discriminant Analysis - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Discriminant Analysis

Description:

Chapter 8 Discriminant Analysis 8.1 Introduction Classification is an important issue in multivariate analysis and data mining. Classification: classifies data ... – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 45
Provided by: Departme251
Category:

less

Transcript and Presenter's Notes

Title: Discriminant Analysis


1
Chapter 8
  • Discriminant Analysis

2
8.1 Introduction
  • Classification is an important issue in
    multivariate analysis and data mining.
  • Classification
  • classifies data (constructs a model) based on the
    training set and the values (class labels) in a
    classifying attribute and uses it in classifying
    new data, i.e., predicts unknown or missing
    values

3
ClassificationA Two-Step Process
  • Model construction describing a set of
    predetermined classes
  • Each tuple/sample is assumed to belong to a
    predefined class, as determined by the class
    label attribute
  • The set of tuples used for model construction is
    training set
  • The model is represented as classification rules,
    decision trees, or mathematical formulae
  • Prediction for classifying future or unknown
    objects
  • Estimate accuracy of the model
  • The known label of test sample is compared with
    the classified result from the model
  • Accuracy rate is the percentage of test set
    samples that are correctly classified by the
    model
  • Test set is independent of training set,
    otherwise over-fitting will occur
  • If the accuracy is acceptable, use the model to
    classify data tuples whose class labels are not
    known

4
Classification Process Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
5
Classification Process Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
6
Supervised vs. Unsupervised Learning
  • Supervised learning (classification)
  • Supervision The training data (observations,
    measurements, etc.) are accompanied by labels
    indicating the class of the observations
  • New data is classified based on the training set
  • Unsupervised learning (clustering)
  • The class labels of training data is unknown
  • Given a set of measurements, observations, etc.
    with the aim of establishing the existence of
    classes or clusters in the data

7
Discrimination Introduction
  • Discrimination is a technique concerned with
    allocating new observations to previously defined
    groups.
  • There are k samples from k distinct populations
  • One wants to find the so-called discriminant
    function and related rule to identify the new
    observations.

8
Example 11.3 Bivariate case
9
Discriminant function and rule
10
Example 11.1 Riding mowers
Consider two groups in city riding-mower owners
and those without riding mowers. In order to
identify the best sales prospects for an
intensive sales campaign, a riding-mower
manufacturer is interested in classifying
families as prospective owners or non-owners on
the basis of income and lot size.
11
Example 11.1 Riding mowers
12
Example 11.1 Riding mowers
13
8.2 Discriminant by Distance
  • Assume k2 for simplicity

14
8.2 Discriminant by Distance
  • Consider the Mahalanobis distance

15
8.2 Discriminant by Distance
Let
16
8.2 Discriminant by Distance
17
Example Univariate Case with equal variance
18
Example Univariate Case with equal variance
a
19
8.3 Fishers Discriminant Function
  • Idea projection, ANOVA

20
8.3 Fishers Discriminant Function

Training samples
21
8.3 Fishers Discriminant Function

Projection the data on a direction , the
F-statistics
where
22
8.3 Fishers Discriminant Function

To find such that
The solution of is the eigenvector
associated with the largest eigenvalue of
.
Discriminant function
23
(B) Two Populations
Note
We have and
There is only one non-zero eigenvalue of as

24
(B) Two Populations
The associated eigenvector is
where
25
(B) Two Populations
When is replaced by
where
26
Example Inset Classification
Note data x1 and x2 are the characteristics of
insect (Hoel,1947) n.g. means natural group
(species), c.g. the classified group, y the
value of the discriminant function
27
Example Inset Classification
The eigenvalue of is 1.9187 and the
associated eigenvector is
28
Example Inset Classification
The discriminant function is and the
associated value of each observation is given in
the table. The cutting point is
Classification is
If we use , we have the same
classification.
29
8.4 Bayes Discriminant Analysis
  • Idea
  • There are k populations G1, , Gk in Rp.
  • A partition of Rp, R1, , Rk , is determined
    based on a training
  • sample.

Rule if falls into Ri Loss is
from Gi , but falls into Rj The Probability
of this misclassification where is the density
of .
30
8.4 Bayes Discriminant Analysis
Expected cost of misclassification is where q1,
, qk are prior probabilities. We want to
minimize ECM(R1, , Rk ) w.r.t. R1, , Rk .
31
B. Method
Theorem 6.4.1 Let Then the optimal Rts are
32
Corollary 1
Take if and 0 if . Then
Proof
33

Corollary 2
In the case of k2
we have
34
(No Transcript)
35
Corollary 3
In the case of k2 and
36
Then
37
C. Example 11.3 Detection of hemophilia A
carriers
For the detection of hemophilia A carriers, to
construct a procedure for detecting potential
hemophilia A carriers, blood samples were assayed
for two groups of women and measurements on the
two variables. The first group of 30 women were
selected from a population of women who did not
carry the hemophilia gene. This group was called
the normal group. The second group of 22 women
was selected from known hemophilia A carriers.
This group was called the obligatory carriers.
38
C. Example 11.3 Detection of hemophilia a
carriers
Variables log10 (AHF activity) log10
(AHF-like antigen)
Populations population of women who did not
carry the hemophilia gene (n130) populati
on of women who are known hemophilia A
carriers (n245)
39
C. Example 11.3 Detection of hemophilia a
carriers
40
C. Example 11.3 Detection of hemophilia a
carriers
Data set
-0.0056 -0.1698 -0.3469 -0.0894 -0.1679
-0.0836 -0.1979 -0.0762 -0.1913 -0.1092
-0.5268 -0.0842 -0.0225 0.0084 -0.1827
0.1237 -0.4702 -0.1519 0.0006 -0.2015
-0.1932 0.1507 -0.1259 -0.1551 -0.1952
0.0291 -0.228 -0.0997 -0.1972
-0.0867 -0.1657 -0.1585 -0.1879 0.0064
0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119
-0.4773 0.0248 -0.058 0.0782 -0.1138 0.214
-0.3099 -0.0686 -0.1153 -0.0498 -0.2293
0.0933 -0.0669 -0.1232 -0.1007 0.0442
-0.171 -0.0733 -0.0607 -0.056
-0.3478 -0.3618 -0.4986 -0.5015 -0.1326
-0.6911 -0.3608 -0.4535 -0.3479 -0.3539
-0.4719 -0.361 -0.3226 -0.4319 -0.2734
-0.5573 -0.3755 -0.495 -0.5107 -0.1652
-0.2447 -0.4232 -0.2375 -0.2205 -0.2154
-0.3447 -0.254 -0.3778 -0.4046 -0.0639
-0.3351 -0.0149 -0.0312 -0.174 -0.1416
-0.1508 -0.0964 -0.2642 -0.0234 -0.3352
-0.1878 -0.1744 -0.4055 -0.2444
-0.4784   0.1151 -0.2008 -0.086 -0.2984
0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722
-0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548
-0.1865 -0.0153 -0.2483 0.2132 -0.0407
-0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573
-0.2682 -0.1162 0.1569 -0.1368 0.1539 0.14
-0.0776 0.1642 0.1137 0.0531 0.0867 0.0804
0.0875 0.251 0.1892 -0.2418 0.1614 0.0282
41
C. Example 11.3 Detection of hemophilia a
carriers
SAS output
42
C. Example 11.3 Detection of hemophilia a
carriers
43
C. Example 11.3 Detection of hemophilia a
carriers
44
C. Example 11.3 Detection of hemophilia a
carriers
Write a Comment
User Comments (0)
About PowerShow.com