Title: Anh
1Fourth German Stata Users Group Meeting
New Tools for Evaluating the Results of Cluster
Analyses Hilde Schaeper Higher Education
Information System (HIS), Hannover/Germany schaepe
r_at_his.de
Mannheim, March 31st, 2006
2Main features of cluster analysis
Basic idea to form groups of similar objects
(observations or variables) such that the
classification objects are homogeneous within
groups/clusters and heterogeneous between clusters
Type of analysis heuristic tool of discovery
lacking an underlying coherent body of
statistical theory
Range of methods cluster analysis is a family of
more or less closely related techniques
3Steps and decisions in cluster analysis
I Selection of a sample (outliers may influence
the results)
II Selection and transformation of variables
(irrelevant and correlated variables can bias the
classification cluster analysis requires the
variables to have equal scales)
III Choice of the basic approach (in particular
agglomerative hierarchical vs. partitioning
cluster analysis)
IV Choice of a particular clustering technique
V Selection of a dissimilarity or similarity
measure (depends partly on the mea-surement level
of the variables and the clustering technique
chosen)
VI Choice of the initial partition in case of
partition methods
4Criteria for a good classification
? Interpretability ? Clusters should be
substantively interpretable.
? Internal validity (internal homogeneity and
external heterogeneity) ? Objects that belong
to the same cluster should be similar. ?
Objects of different clusters should be
different. The clusters should be well isolated
from each other. ? The classification should fit
to the data and should be able to explain the
variation in the data.
? Reasonable number and size of clusters
(additional) ? The number of clusters should be
as small as possible. The size of the clusters
should not be too small.
? Stability ? Small modifications in data and
methods should not change the results.
5Criteria for a good classification (cont.)
? External validity ? Clusters should correlate
with external variables that are known to be
correlated with the classification and that are
not used for clustering.
? Relative validity ? The classification
should be better than the null model which
assumes that no clusters are present. ? The
classification should be better than other
classifications.
6Tools for decision making and evaluation
? Tools for determining the number of clusters
? Tools for testing the stability of a
classification
(? Tools for assessing the internal validity of
a classification)
7Determining the number of clusters hierarchical
methods
? (Visual) inspection of the fusion/agglomeration
levels ? dendrogram (official Stata program) ?
scree diagram (easy to produce) ?
agglomeration schedule (new program)
8Determining the number of clusters agglomeration
schedule
Syntax
cluster stop clname, rule(schedule)
laststeps()
Description cluster stop, rule(schedule) displays
the agglomeration schedule for hierarchical
agglomerative cluster analysis und computes the
differences between the stages of the clustering
process. Additional options laststeps()
specifies the number of steps to be displayed.
9Determining the number of clusters agglomeration
schedule
Example Cluster analysis of 799 observations,
using Wards linkage and squared Euclidean
distances
cluster stop ward, rule(schedule) last(15)
Number Fusion Stage clusters
value Increase ---------------------
----------------------------- 798 1
1529,7205 834,5939 797 2
695,1265 15,2987 796 3
679,8278 414,1430 795 4
265,6848 60,3970 794 5
205,2878 32,0320 793 6
173,2559 12,1593 792 7
161,0966 22,5605 791 8
138,5361 29,6152 790 9
108,9209 3,4233 789 10
105,4976 14,2701 788 11
91,2275 6,7869 787 12
84,4405 2,2950 786 13
82,1455 1,5409 785 14
80,6046 14,8871 784 15
65,7175 3,2681
10Determining the number of clusters dendrogram
11Determining the number of clusters hierarchical
methods
? (Visual) inspection of the fusion/agglomeration
levels ? dendrogram (official Stata program) ?
scree diagram (easy to produce) ?
agglomeration schedule (new program)
? Statistical measures/tests for the number of
clusters ? Dudas and Harts stopping
rule/Calinskis and Harabaszs stopping rule
(official Stata program) ? Mojenas stopping
rules (new program)
12Determining the number of clusters Statas
stopping rules
13Determining the number of clusters Mojenas
stopping rules
Model I ? assumes that the agglomeration levels
are normally distributed with a particular mean
and standard deviation ? tests at level k
whether level k1 comes from the aforementioned
distribution ? suggests the choice of the
k-cluster solution when the null hypothesis has
to be reject-ed for the first time (i. e. when a
sharp increase/decrease of the fusion levels
occurs)
Model I modified ? assumes that the
agglomeration levels up to level k are normally
distributed
Model II ? assumes that the agglomeration levels
up to step k can be described by a linear
re-gression line ? tests at level k whether the
fusion value of level k1 equals the predicted
value ? suggests to set the number of clusters
equal to k when the null hypothesis has to be
rejected for the first time
14Determining the number of clusters Mojenas
stopping rules
Syntax
cluster stop clname, rule(mojena) laststeps()
m1only
Description cluster stop, rule(mojena) calculates
Mojenas test statistics (Mojena I, Mojena I
modified, and Mojena II) for determining the
number of clusters of hierarchical agglomerative
clustering methods and the corresponding
signifi-cance levels. Additional
options laststeps() specifies the number of
steps to be displayed. m1only is used to suppress
the calculation of Mojena I modified and Mojena
II.
15Determining the number of clusters Mojenas
stopping rules
cluster stop ward, rule(mojena) last(15)
No. of Mojena I Mojena
I mod. Mojena II Stage clusters t
p t p t
p ------------------------------------------------
------------------------- 798 1
. . . . .
. 797 2 22,9003 0,0000 39,2306
0,0000 38,8261 0,0000 796 3
10,3453 0,0000 22,8581 0,0000 22,4229
0,0000 795 4 10,1152 0,0000
36,7300 0,0000 36,1526 0,0000 794 5
3,8851 0,0001 16,4988 0,0000
15,8908 0,0000 793 6 2,9765 0,0015
14,2385 0,0000 13,6099 0,0000 792
7 2,4946 0,0064 13,2516 0,0000
12,6058 0,0000 791 8 2,3117 0,0105
13,6952 0,0000 13,0275 0,0000 790
9 1,9723 0,0245 12,9355 0,0000
12,2483 0,0000 789 10 1,5268 0,0636
10,8525 0,0000 10,1556 0,0000 788
11 1,4753 0,0703 11,3345 0,0000
10,6247 0,0000 787 12 1,2607 0,1039
10,4254 0,0000 9,7065 0,0000 786
13 1,1586 0,1235 10,2615 0,0000
9,5338 0,0000 785 14 1,1240 0,1307
10,6825 0,0000 9,9431 0,0000 784
15 1,1009 0,1356 11,3061 0,0000
10,5505 0,0000
16Determining the number of clusters partitioning
methods
Measures using the error sum of squares
? Explained variance (Eta2) specifies to which
extent a particular solution improves the
solution with one cluster ? Proportional
reduction of errors (PRE) compares a k-cluster
solution with the previous (k1) solution ? F-max
statistic corrects for the fact that more
clusters automatically result in a higher
explained variance ? Beales F statistic tests
the null hypothesis that a solution with k
clusters is not improved by a solution with more
clusters (conservative test, provides only
convincing results if the clusters are well
separated)
new program
? Calinskis and Harabaszs stopping rule
? official Stata program
17Determining the number of clusters Statas
stopping rule
Example Cluster analysis of 799 observations,
using the kmeans partition method and squared
Euclidean distances
18Determining the number of clusters Eta2, PRE,
F-max, Beales F
Syntax
clnumber varlist, maxclus()kmeans_options
Description clnumber performs kmeans cluster
analyses with the variables specified in var-list
and computes Eta2, the PRE coefficient, the F-max
statistic, Beales F va-lues and the
corresponding p-values.
Options maxclus() is required and specifies the
maximum number of clusters for which cluster
analyses are performed. maxclus(4), for example,
requests cluster analyses for two, three, and
four clusters. kmeans_options specifiy options
allowed with kmeans cluster analysis except for
k() and start(group(varname)).
19Determining the number of clusters Eta2, PRE,
F-max, Beales F
clnumber v1v7, max(8) start(prandom(154698))
First part of the output
Eta square, PRE coefficient, F-max
value A8,3 Eta2 Pre
F-max cl_1 0 . . cl_2
,27878797 ,27878797 308,08417 cl_3 ,44732155
,23368104 322,12939 cl_4 ,50863156 ,11093251
274,31017 cl_5 ,58504803 ,15551767
279,86862 cl_6 ,61414929 ,07013162
252,4398 cl_7 ,63407795 ,05164863
228,73256 cl_8 ,65036945 ,04452178 210,1983
20Determining the number of clusters Eta2, PRE,
F-max, Beales F
Second part of the output
Upper triangle Beales F statistic lower
triangle probability B8,8 c1
c2 c3 c4 c5 c6
c7 r1 0 1,7527399 2,1746911
2,1056324 2,3824281 2,3440412 2,2895238 r2
,09228984 0 2,454542 2,1062746
2,4264587 2,3137652 2,2097109 r3 ,0067297
,01637451 0 1,4336405 2,0737441
1,9334281 1,8205698 r4 ,00225687 ,00910208
,18682639 0 2,7415018 2,1763191
1,9278474 r5 ,00005713 ,00028111 ,01048918
,00768272 0 1,3762712 1,2922468 r6
,00001341 ,00010317 ,00641878 ,00668153
,21053567 0 1,1750857 r7 4,554e-06
,00005409 ,00517453 ,00663361 ,20309237
,3133238 0 r8 1,600e-06 ,00002861
,00408524 ,00598863 ,18868905 ,28969417
,32290457 c8 r1 2,2479901 r2
2,1372527 r3 1,7502537 r4 1,8003233 r5
1,261865 r6 1,1717306 r7 1,1590454 r8
0
21Determining the number of clusters Eta2, PRE,
F-max, Beales F
Second part of the output
Upper triangle Beales F statistic lower
triangle probability B8,8 c1
c2 c3 c4 c5 c6
c7 r1 0 1,7527399 2,1746911
2,1056324 2,3824281 2,3440412 2,2895238 r2
,09228984 0 2,454542 2,1062746
2,4264587 2,3137652 2,2097109 r3 ,0067297
,01637451 0 1,4336405 2,0737441
1,9334281 1,8205698 r4 ,00225687 ,00910208
,18682639 0 2,7415018 2,1763191
1,9278474 r5 ,00005713 ,00028111 ,01048918
,00768272 0 1,3762712 1,2922468 r6
,00001341 ,00010317 ,00641878 ,00668153
,21053567 0 1,1750857 r7 4,554e-06
,00005409 ,00517453 ,00663361 ,20309237
,3133238 0 r8 1,600e-06 ,00002861
,00408524 ,00598863 ,18868905 ,28969417
,32290457 c8 r1 2,2479901 r2
2,1372527 r3 1,7502537 r4 1,8003233 r5
1,261865 r6 1,1717306 r7 1,1590454 r8
0
22Testing the stability of a classification
Stability
? is a precondition of validity
? refers to the property of a cluster solution
that it is not affected by small modi-fications
of data and methods
? can be measured by comparing two
classifications and computing the propor-tion of
consistent allocations
23Testing the stability of a classification the
Rand index
Original Rand index (Rand 1971) ? ranges between
0 and 1 with 1 perfect agreement ? values
greater than 0.7 are considered as sufficient
Adjusted Rand index (Hubert Arabie 1985) ?
accounts for chance agreement ? offers a
solution for the problem that the expected value
of the Rand index does not take a constant
value ? maximum value of 1 expected value of
zero, if the classifications are select-ed
randomly ? usually yields much smaller values
than the Rand index
24Testing the stability of a classification the
Rand index
Syntax
clrand groupvar1 groupvar2
Description clrand compares two classifications
with respect to the (in)consistency of
as-signments of the classification objects to
clusters and computes the Rand index and the
adjusted Rand index proposed by Hubert Arabie.
The command re-quires the specification of two
grouping variables obtained from previous cluster
analyses.
25Testing the stability of a classification the
Rand index
Comparisons of the 3-cluster solutions using
different start options (adj. Rand)
Comparisons of the 5-cluster solutions using
different start options (adj. Rand)
26Outlook
? speeding up the program for calculating
Mojenas stopping rules ? improvement of
clnumber ? improvement of clrand ? new program
for checking whether a local minimum is found
with kmeans or kmedians cluster analysis ? new
programs for calculating additional statistics
(e. g. homogeneity mea-sures, measures for the
fit of a dendrogram)
27Basic idea examples
Finding groups of observations
28Consequences of decision making example
Comparison of two kmeans cluster analyses using
different initial group centres
29Determining the number of clusters inverse scree
test