Clustering, Classification and Validation via the L1 Data Depth - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Clustering, Classification and Validation via the L1 Data Depth

Description:

Often based on within-cluster sum of squares. The Silhouette width ... Can lead to under-fitting, selecting too few clusters for the data set. L1 Data Depth ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 33
Provided by: r588
Category:

less

Transcript and Presenter's Notes

Title: Clustering, Classification and Validation via the L1 Data Depth


1
Clustering, Classification and Validation via the
L1 Data Depth
  • Rebecka Jornsten
  • Department of Statistics
  • Rutgers University
  • http//www.stat.rutgers.edu/rebecka
  • June 19 2003

2
  • Outline
  • Cluster validation via The Relative Data Depth
    (ReD)
  • DDclust - Improved Clustering Accuracy via ReD
  • DDclass Classification based on the L1 Data
    Depth
  • Conclusion

3
  • Clustering
  • - Group similar observations
  • - Find cluster representatives
  • Multitude of different clustering algorithms
  • Hierarchical
  • K-means, PAM
  • Model-based (e.g. mixture of MVN)
  • .

4
  • Clustering
  • In the first part of this talk we focus on
  • K-median clustering
  • Popular in applications
  • Robust, tight clusters
  • Fast approximation algorithms exist PAM
    (Partition around mediods)
  • K-median algorithm developed with Vardi and
    Zhang. Cluster representatives Multivariate
    Medians

5
  • Cluster Validation
  • Select an appropriate number of clusters for the
    data set
  • Identify outliers

- Often based on within-cluster sum of squares
  • Distance to cluster representative
  • - Cluster membership

6
The Silhouette width
  • a is the average distance from observation i to
    all members of the cluster i has been allocated
    to
  • b is the average distance from observation i to
    all members of the nearest competing clusters

i
i
  • sil is b a
  • Choose the number of clusters K with the maximum
    average sil

i
i
max(b , a )
i
i
5. sil can also identify outliers look out
for negative sil values.
7
Silhouette width plot
outlier?
8
Problems with the silhouette width
  • If the cluster scales (within-cluster variances)
    differ, the sil values can be misleading
  • A loose cluster will look bad if the nearest
    competing cluster is tight
  • Observations in a loose cluster may be mislabeled
    as outliers, and outliers in a tight cluster be
    missed
  • Can lead to under-fitting, selecting too few
    clusters for the data set

9
L1 Data Depth
  • e(z,i) is the unit vector from observation i to z
  • e(z) is the avg. of the unit vectors from all
    observations to z
  • e(z) is close to 1 if z is close to the edge
    of the data, close to 0 if z is close to the
    center
  • D(z)1-e(z) is the L1 data depth of z

10
The Relative Data Depth
w
  • D is the data depth of observation i with
    respect to the cluster i has been allocated to
  • D is the data depth of observation i with
    respect to the nearest competing clusters

i
b
i
w
b
  • ReD is D D
  • Choose the number of clusters K with the maximum
    average ReD

i
i
i
5. ReD can also identify outliers look out
for negative ReD values.
11
Relative Data Depth plot
outliers?
outlier?
12
Data Depth plot
Another example Gene expression data (drug
experiment)
Within-cluster depths
Between-cluster depths, colored by cluster
13
Silhouette widths
a loose cluster
Relative Data Depths
a tight cluster
Error rates Tight cluster 17 Loose cluster
14
14
  • ReD vs gap and sil
  • ReD is less sensitive than sil and gap to the
    inclusion of unrelated features, and more robust
    wrt the noise level of the data
  • Performs as well or better than sil and gap in
    many simulated scenarios
  • For more details see paper with Vardi and
    Zhang, and slides on

http//www.stat.rutgers.edu/rebecka
15
Improved clustering accuracy via ReD Clustering
and Cluster validation a two step process But
if ReD (and sil) can identify outliers why not
include the validation criteria in the clustering
objective function directly? Suggests Data
Depth Vector Quantizer DDVQ Find partition that
minimizes K-median objective function -
ReD
16
Improved clustering accuracy via ReD Constrained
VQ a tool often used in engineering Example
Entropy constrained VQ Approach Search for the
that satisfies the constraint. What makes the
present problem different? We dont have a
natural constraint and so dont know what an
appropriate value for might be. Furthermore,
the optimal value for will vary from data set
to data set.
17
DDclust Our approach use the following
clustering criterion C where
sil plays the role of the K-median cost (scale
dependent) and ReD the role of the depth penalty
(scale independent). Now is the trade-off
between scale dependent and scale independent
costs, and the optimal is unaffected by
shifts and scale changes. We use simulated
annealing to find the partition I(K) that
maximizes C, for a given number of clusters K.
18
  • Results
  • Gene expression data
  • Simulated data
  • Leukemia PAM and DDclust agrees (2 errors)
  • Colon PAM 18/62 errors, DDclust 6 errors
  • Prostate PAM 3/25 errors, DDclust 1 error
  • MVN data
  • Equal and unequal cluster scale scenarios
  • equal to 0,0.25,0.5,0.9,1

19
(No Transcript)
20
(No Transcript)
21
Unequal scale model
test error decreases with increasing
PAM
increasing
22
Unequal scale model
23
Unequal scale model
Clustering with sil can even increase the error
rate when scales are unequal
24
Equal scale model
Still see some improvements but now for more
moderate
25
Equal scale model
26
Equal scale models
27
  • DDclass - Classification via the L1 Data Depth
  • We expect that the L1 Data depth of an
    observation is maximized with respect to the
    cluster corresponding to the correct class label.
  • This suggest a very simple classification rule
  • 1. Classify unlabeled observation x by the
    class in the
  • training set wrt which x is the most
    deep.
  • 2. Validate classification by the Relative
    Data Depth.
  • 3. ReD(training x)lt0 a training error
  • ReD(test x) small low classification
    confidence

28
Leukemia data 72 observations, cross-validation
TE and ReD
high test error
Low ReD value
29
SILclass - Classification via avg. distance and
sil DDclass using the average distance in
place of the depth. Classification rule 1.
Classify unlabeled observation x by the class in
the training set wrt which the average
distance to x is minimized 2.
Validate classification by the sil. 3.
sil(training x)lt0 a training error
sil(test x) small low classification confidence
30
Leukemia data 72 observations, cross-validation
TE and sil
high test error
Low sil value
31
DDclass-CV and DDclass-DD Two tuning methods for
removing noisy or mislabeled observations from
the training set. This aim is two-fold -
improve test error rate performance -
reduce the size of the training set (comp. time)
DDclass-CV remove any training observations
that are misclassified via leave-one-out cross
validation DDclass-DD remove any training
observations with ReD value below a threshold
(minimizing CV error)
32
Leukemia data
Observations that were frequently removed from
the training set across 500 crossvalidation sets
solid black DDclass-CV, dashed red DDclass-DD
33
Leukemia Results on 500 10-fold CV data sets
Fivenum summary error rates of CV sets
w. best rate DDclass (0,0,0,12.5,25)
92.6 (1) DDclass-CV

DDclass-DD
91.4 (2) SILclass
(0,0,0,12.5,37.5) 86.8
SILclass-CV/SILclass-DD
NN (0,0,0,12.5,25)
88.8 (3) DLDA(0,0,0,12.5,25)
87.6
Centroid(0,0,0,12.5,37.5)
86.6 Median
87.2
34
Colon Results on 500 10-fold CV data sets
Fivenum summary error rates of CV sets
w. best rate DDclass (0,0,16.7,16.7,66.7)
85.8 (3) DDclass-CV

DDclass-DD
SILclass (0,0,16.7,16.7,66.7)
81.2 SILclass-CV/SILclass-DD
79.2 NN
(0,0,16.7,16.7,66.7)
76.2 DLDA(0,0,16.7,16.7,50)
92.8 (1) Centroid
92.6 (2)
Median

35
Simulated data - Unequal scale
kNN
DDclass
DA
Prototypes
SILclass
36
Equal scale
kNN
DDclass
DA
SILclass
Prototypes
37
  • Conclusions
  • ReD is a robust cluster validation tool
  • DDclust can improve clustering accuracy over PAM,
    significantly so when cluster scales differ.
    ReDplots identify outliers
  • DDclass is competitive with the best reported
    methods on gene expression data, and comparable
    with the Bayes rules on simulated data. ReDplots
    identify observations we classify with low
    confidence.
  • Paper, preprints and Rcode available at
    http//www.stat.rutgers.edu/rebecka/papers
  • Current work extensions to missing data scenarios

38
Acknowledgements Yehuda Vardi and Cun-Hui Zhang
(Dept. of Statistics, Rutgers) Ron Hart,
Jonathan Zan (Dept. of Neuroscience and the
William Keck center, Rutgers)
Write a Comment
User Comments (0)
About PowerShow.com