Title: Cluster Analysis
1Cluster Analysis
- Purpose and process of clustering
- Profile analysis
- Selection of variables and sample
- Determining the of clusters
- Profile Similarity Measures
- Cluster combination procedures
- Hierarchical vs. Non-hierarchical Clustering
- Statistical follow-up analyses
- Internal ANOVA ldf Analyses
- External ldf ANOVA ldf Analyses
2Intro to Clustering
- Clustering is like reverse linear discriminant
analysis - you are looking for groups (but cant define them
a priori) - The usual starting point is a belief that a
particular population is not homogeneous - that there are two or more kinds of folks in
the group - Reasons for this belief -- usually your own
failure - predictive models dont work or seem way too
complicated (need lots of unrelated predictors) - treatment programs only work for some folks
- best predictors or best treatments vary
across folks - gut feeling
3Process of a Cluster Analysis
- Identification of the target population
- Identification of a likely set of variables to
carve the population into groups - Broad sampling procedures -- rep all groups
- Construct profile for each participant
- Compute similarities among all participant
profiles - Determine the of clusters
- Identify who is in what cluster
- Describe/Interpret the clusters
- Plan the next study -- replication extension
4Profiles and Profile Comparisons
- A profile is a persons pattern of scores for a
set of variables - Profiles differ in two basic ways
- level -- average value (height)
- shape -- peaks dips
- Profiles for 4 folks across 5 variables
- Whos most similar to whom?
- How should we measure similarity level,
shape or both ??
A B C D E
Cluster analysts are usually interested in both
shape and level differences among groups.
5Commonly found MMPI Clusters -- simplified
normal
elevated
misrepresenting
unhappy
L F K Hs D Hy Pd Mf Pa Pt Sz Ma Si
Validity Scales
Clinical Scales
6How Hierarchical Clustering works
- Date in an X matrix (cases x variables)
- Compute the profile similarity of all pairs of
cases and put those values in a D matrix (cases
x cases) - Start with clusters cases (1 case in _at_
cluster) - On each step
- Identify the 2 clusters that are most similar
combine those into a single cluster - Compute profile error (not everyone in a
cluster are ) - Re-compute the profile similarity among all
cluster pairs - Repeat until there is a single cluster
7Cluster people to find the number and identity of
the different kinds of characteristic profiles
X
D
Cases
Cluster
Variables
C
53248. . 1
11122 . . 3
Cases
Cases
Cases
D captures the similarities and differences
among the cases which are summarized in C
(Cluster membership), which provides the basis
for deciding how many and what are the sets of
people
-2 -1 0 1 2
Variables
8Determining the of hierarchical clusters
- With each agglomeration step in the clustering
procedure the 2 most similar groups are combined
into a single cluster - parsimony increases -- fewer clusters simpler
solution - error increases -- cases combined into clusters
arent identical - We want to identify the parsimony error
trade-off - Examine the error increase at each
agglomeration step - a large jump in error indicates too few
clusters - have just combined two clusters that are very
dissimilar - frankly this doesnt often work very well by
itself -- need to include more info to decide how
many clusters you have !!!
9Determining the of clusters, cont
- There is really no substitute for obtaining and
plotting multiple clustering solutions paying
close attention to the of cases in the
different clusters. - Follow the merging of clusters, asking if
importantly dissimilar groups of substantial
size have been merged - be sure to consider the sizes of the groups
subgroups of less than 5 are usually too small
to trust without theory replication - well look at some statistical help for this
later
- When evaluating a cluster solution, be sure to
consider ... - Stability are clusters similar if you add or
delete clusters ? - Replicability split-half or replication
analyses - Meaningfulness (e.g., knowing a priori about
groups helps)
10How different is different enough to keep as
separate clusters? That is a tough one On how
many variables must the clusters differ? By how
much must they differ? Are level differences
really important? Or only shape differences? How
many have to be in the cluster for it to be
interesting? This is a great example of
something weve discussed many times before The
more content knowledge you bring to the analysis
the more informative the analysis is likely to
be!!! You need to know about the population and
related literature to know how much of a
difference is a difference that matters.
11- Strays in Hierarchical Analyses
- A stray is a person with a profile that matches
no one else. - data collection, collation, computation or other
error - member(s) of a population/group not otherwise
represented - Strays can cause us a couple of kinds of
problems - a 10-group clustering might be 6 strays and 4
substantial clusters the agglomeration error
cant tell you you have to track the cluster
frequencies - a stray may be forced into a group, without
really belonging there, and change the profile of
that group such that which other cases join it
are changed you have to check if group members
are really similar (more later)
12Within-cluster Variability in Cluster
Analyses When we plot profiles -- differences in
level or shape can look important enough to keep
two clusters separated.
Adding whiskers (Std, SEM or CIs) can help us
recognize when groups are and arent really
different (these arent) HNST tests can help too
(more later)
13- Making cluster solutions more readable
- Some variable sets and the their ordering are
well known - MMPI, WISC, NEO, MCMI , etc.
- if so, follow the expected ordering
- Most of the time, the researcher can select the
variable order - pick an order that highlights and simplifies
cluster comparisons - minimize the number of humps cross-overs
- the one on the left below is probably better
A B C D E F
C D F A B E
14Profile Dissimilarity Measures
- For each formula
- y are data from 1st person x are data from 2nd
person being compared - summing across vars
- Euclidean ? ? (y - x)²
- Squared Euclidean ? (y - x)² (probably most
popular) - City-Block ? y - x
- Chebychev max y - x
- Cosine cos rxy
(similarity index)
15X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
-3 -2 -1 0 1 2 3 V1
-2 02
( 42 22) ? 20 4.47
4.47 represents the multivariate dissimilarity
of X Y
16- Squared Euclidean ? (y - x)²
X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
-2 02
(-2 22 )
-3 -2 -1 0 1 2 3 V1
( 42 22) 20
- 20 represents the multivariate dissimilarity of
X Y - Squared Euclidean is a little better at
noticing strays - remember that we use a square root transform to
pull in outliers - leaving the value squared makes the strays stand
out a bit more
17X 2, 0
Y -2, -2
-3 -2 -1 0 1 2 3 V2
-2 0
(-2 2 )
( 4 2) 6
-3 -2 -1 0 1 2 3 V1
So named because in a city you have to go around
the block you cant cut the diagonal
18Chebychev max y - x
X 2, 0
Y -2, -2
-3 -2 -1 0 1 2 3 V2
-2 2 4
-2 0 2
max ( 4 2) 4
-3 -2 -1 0 1 2 3 V1
Uses the greatest univariate difference to
represent the multivariate dissimilarity.
19Cosine cos rxy
X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
F
First Correlate scores from 2 cases across
variables
-3 -2 -1 0 1 2 3 V1
Second Find the cosine (angle F) of that
correlation
- This is a similarity index all the other
measures we have looked at are dissimilarity
indices - Using correlations ignores level differences
between cases looks only at shape differences
(see next page)
20- Based on Euclidean or Squared Euclidean these
four cases would probably group as - orange blue
- green yellow
- While those within the groups have somewhat
different shapes, they are very similar levels
A B C D E
- Based on Cos r these four cases would probably
group as - blue green
- orange yellow
- Because correlation pays attention only to
profile shape
It is important to carefully consider how you
want to define profile similarity when
clustering it will likely change the results
you get.
21How Hierarchical Clustering works
- Data in an X matrix (cases x variables)
- Compute the profile similarity of all pairs of
cases and put those values in a D matrix (cases
x cases) - Start with clusters cases (1 case in _at_
cluster) - On each step
- Identify the 2 clusters that are most similar
- A cluster may have 1 or more cases
- Combine those 2 into a single cluster
- Re-compute the profile similarity among all
cluster pairs - Repeat until there is a single cluster
22Amalgamation Linkage Procedures -- which
clusters to combine ?
- Wards -- joins the two clusters that will produce
the smallest increase in the pooled
within-cluster variation (works best with Squared
Euclidean) - Centroid Condensation -- joins the two clusters
with the closest centroids -- profile of joined
cluster is mean of two (works best with squared
Euclidean distance metric) - Median Condensation -- same as centroid, except
that equal weighting is used to construct the
centroid of the joined cluster (as if the 2
clusters being joined had equal-N) - Between Groups Average Linkage -- joins the two
clusters for which the average distance between
members of those two clusters is the smallest
23Amalgamation Linkage Procedures, cont.
- Within-groups Average Linkage -- joins the two
clusters for which the average distance between
members of the resulting cluster will be smallest
- Single Linkage -- two clusters are joined which
have the most similar two cases - Complete Linkage -- two clusters are joined for
which the maximum distance between a pair of
cases in the two clusters is the smallest
24- Wards -- joins the two clusters that will produce
the smallest increase in the pooled
within-cluster variation - works well with Squared Euclidean metrics
(identifies strays) - attempts to reduce cluster overlay by minimizing
SSerror - produces more clusters, each with lower
variability
- Computationally intensive, but Statistically
simple - On each step
- Take every pair of clusters combine them
- Compute the variance across cases for each
variable - Combine those univariate variances into a
multivariate variance index - Identify the pair of clusters with the smallest
multivariate variance index - Those two clusters are combined
25- Centroid Condensation
- -- joins the two clusters with the closest
centroids - -- profile of joined cluster is mean of two
Compute the centroid for each cluster
The distance between every pair of cluster
centroids is computed.
The two clusters with the shortest centroid
distance are joined
The centroid for the new cluster is computed as
the mean of the joined centroids
- new centroid will be closest to the larger group
it contributes more cases - if a stray is added, it is unlikely to
mis-position the new centroid
26- Median Condensation
- -- joins the two clusters with the closest
centroids - -- profile of joined cluster is median of two
- -- is better than last if suspect groups of
different size
Compute the centroid for each cluster
The distance between every pair of cluster
centroids is computed.
The two clusters with the shortest centroid
distance are joined
The centroid for the new cluster is computed as
the median of the joined centroids
- new centroid will be closest to the larger group
it contributes more cases - if a stray is added, it is very likely to
mis-position new centroid
27- Between Groups Average Linkage
- -- joins the two clusters with smallest average
cross-linkage - -- profile of joined cluster is mean of two
For each pair of clusters find the links across
the clusters links for one shown yep, there
are lots of these
The two clusters with the shortest average
centroid distance are joined -- more complete
than just comparing centroid distances
The centroid for the new cluster is computed as
the mean of the joined centroids
- new centroid will be closest to the larger group
it contributes more cases - if a stray is added, it is unlikely to
mis-position the new centroid
28- Within Groups Average Linkage
- -- joins the two clusters with smallest average
within linkage - -- profile of joined cluster is mean of two
For each pair of clusters find the links within
that cluster pair a few between within
shown yep, there are scads of these
The two clusters with the shortest average
centroid distance are joined -- more complete
than between groups average linkage
The centroid for the new cluster is computed as
the mean of the joined centroids
- like Wards, but w/ smallest distance instead
of minimum SS - new centroid will be closest to the larger group
- if a stray is added, it is unlikely to
mis-position the new centroid
29- Single Linkage
- -- joins the two clusters with the nearest
neighbors - -- profile of joined cluster is computed from
case data
Compute the nearest neighbor distance for each
cluster pair
The two clusters with the shortest nearest
neighbor distance are joined
The centroid for the new cluster is computed from
all cases in the new cluster
- groupings based on position of a single pair of
cases - outlying cases can lead to undisciplined
groupings see above
30- Complete Linkage
- -- joins the two clusters with the nearest
farthest neighbors - -- profile of joined cluster is computed from
case data
Compute the farthest neighbor distance for each
cluster pair
The two clusters with the shortest farthest
neighbor distance are joined
The centroid for the new cluster is computed from
all cases in the new cluster
- groupings based on position of a single pair of
cases - can lead to undisciplined groupings
see above
31- k-means Clustering Non-hierarchical
- select the desired number of clusters
- identify the k clustering variables
- First Iteration
- the computer places each case into the
k-dimensional space - the computer randomly assigns cases to the k
groups computes the k-dim centroid of each
group - compute the distance from each case to each
group centroid - cases are re-assigned to the group to which they
are closest - Subsequent Iterations
- re-compute the centroid for each group
- for each case re-compute the distance to each
group centroid - cases are re-assigned to the group to which they
are closest - Stop
- when cases dont change groups or centroids
dont change - failure to converge can happen, but doesnt often
32- Hierarchical k-means Clustering
- There are two major issues in cluster
analysisleading to a third - How many clusters are there ?
- Who belongs to each cluster ?
- 3. What are the clusters ? That is, how do we
describe them, based on a description of who is
in each ??
- Different combinations of clustering metrics,
amalgamation and link often lead to different
answers to these questions. - Hierarchical K-means clustering often lead to
different answers as well. - The more clusters in the solutions derived by
different procedures, the more likely those
clusters are to disagree - How different procedures handle strays and
small-frequency profiles often accounts for the
resulting differences
33Using ldf when clustering
- It is common to hear that following a clustering
with an ldf is silly -- depends ! - There are two different kinds of ldfs -- with
different goals . . . - predicting groups using the same variables used
to create the clusters ? an internal ldf - always works -- there are discriminable groups
(duh!!) - but will learn something from what variables
separate which groups (may be a small subset of
the variables used) - Reclassification errors tell you about strays
forces - gives you a spatial model of the clusters
(concentrated vs. diffuse structure, etc.) - predicting groups using a different set of
variables than those used to create the clusters
? an external ldf - asks if knowing group membership tells you
anything