Metrics, Algorithms - PowerPoint PPT Presentation

About This Presentation

Title:

Metrics, Algorithms

Description:

Metrics, Algorithms – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 22

Provided by: calvinp7

Learn more at: https://psych.unl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Metrics, Algorithms

1
Metrics, Algorithms Follow-ups

Profile Similarity Measures
Cluster combination procedures
Hierarchical vs. Non-hierarchical Clustering
Statistical follow-up analyses
Internal ANOVA ldf Analyses
External ldf ANOVA ldf Analyses

2
Profile Dissimilarity Measures

For each formula
y are data from 1st person x are data from 2nd
person being compared
summing across vars
Euclidean ? ? (y - x)²
Squared Euclidean ? (y - x)² (probably most
popular)
City-Block ? y - x
Chebychev max y - x
Cosine cos rxy
(similarity index)

Euclidean ? ? (y - x)²

X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
-3 -2 -1 0 1 2 3 V1
-2 02

(-2 22 )

( 42 22) ? 20 4.47
4.47 represents the multivariate dissimilarity
of X Y
4

Squared Euclidean ? (y - x)²

X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
-2 02
(-2 22 )
-3 -2 -1 0 1 2 3 V1
( 42 22) 20

20 represents the multivariate dissimilarity of
X Y
Squared Euclidean is a little better at
noticing strays
remember that we use a square root transform to
pull in outliers
leaving the value squared makes the strays stand
out a bit more

City Block ? y x

X 2, 0
Y -2, -2
-3 -2 -1 0 1 2 3 V2
-2 0
(-2 2 )
( 4 2) 6
-3 -2 -1 0 1 2 3 V1
So named because in a city you have to go around
the block you cant cut the diagonal
6
Chebychev max y - x
X 2, 0
Y -2, -2
-3 -2 -1 0 1 2 3 V2
-2 2 4
-2 0 2
max ( 4 2) 4
-3 -2 -1 0 1 2 3 V1
Uses the greatest univariate difference to
represent the multivariate dissimilarity.
7
Cosine cos rxy
X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
F
First Correlate scores from 2 cases across
variables
-3 -2 -1 0 1 2 3 V1
Second Find the cosine (angle F) of that
correlation

This is a similarity index all the other
measures we have looked at are dissimilarity
indices
Using correlations ignores level differences
between cases looks only at shape differences
(see next page)

Based on Euclidean or Squared Euclidean these
four cases would probably group as
orange blue
green yellow
While those within the groups have somewhat
different shapes, they are very similar levels

A B C D E

Based on Cos r these four cases would probably
group as
blue green
orange yellow
Because correlation pays attention only to
profile shape

It is important to carefully consider how you
want to define profile similarity when
clustering it will likely change the results
you get.
9
How Hierarchical Clustering works

Data in an X matrix (cases x variables)
Compute the profile similarity of all pairs of
cases and put those values in a D matrix (cases
x cases)
Start with clusters cases (1 case in _at_
cluster)
On each step
Identify the 2 clusters that are most similar
A cluster may have 1 or more cases
Combine those 2 into a single cluster
Re-compute the profile similarity among all
cluster pairs
Repeat until there is a single cluster

10
Amalgamation Linkage Procedures -- which
clusters to combine ?

Wards -- joins the two clusters that will produce
the smallest increase in the pooled
within-cluster variation (works best with Squared
Euclidean)
Centroid Condensation -- joins the two clusters
with the closest centroids -- profile of joined
cluster is mean of two (works best with squared
Euclidean distance metric)
Median Condensation -- same as centroid, except
that equal weighting is used to construct the
centroid of the joined cluster (as if the 2
clusters being joined had equal-N)
Between Groups Average Linkage -- joins the two
clusters for which the average distance between
members of those two clusters is the smallest

11
Amalgamation Linkage Procedures, cont.

Within-groups Average Linkage -- joins the two
clusters for which the average distance between
members of the resulting cluster will be smallest
Single Linkage -- two clusters are joined which
have the most similar two cases
Complete Linkage -- two clusters are joined for
which the maximum distance between a pair of
cases in the two clusters is the smallest

Wards -- joins the two clusters that will produce
the smallest increase in the pooled
within-cluster variation
works well with Squared Euclidean metrics
(identifies strays)
attempts to reduce cluster overlay by minimizing
SSerror
produces more clusters, each with lower
variability

Computationally intensive, but Statistically
simple
On each step
Take every pair of clusters combine them
Compute the variance across cases for each
variable
Combine those univariate variances into a
multivariate variance index
Identify the pair of clusters with the smallest
multivariate variance index
Those two clusters are combined

Centroid Condensation
-- joins the two clusters with the closest
centroids
-- profile of joined cluster is mean of two

Compute the centroid for each cluster
The distance between every pair of cluster
centroids is computed.
The two clusters with the shortest centroid
distance are joined
The centroid for the new cluster is computed as
the mean of the joined centroids

new centroid will be closest to the larger group
it contributes more cases
if a stray is added, it is unlikely to
mis-position the new centroid

Median Condensation
-- joins the two clusters with the closest
centroids
-- profile of joined cluster is median of two
-- is better than last if suspect groups of
different size

new centroid will be closest to the larger group
it contributes more cases
if a stray is added, it is very likely to
mis-position new centroid

Between Groups Average Linkage
-- joins the two clusters with smallest average
cross-linkage
-- profile of joined cluster is mean of two

For each pair of clusters find the links across
the clusters links for one shown yep, there
are lots of these
The two clusters with the shortest average
centroid distance are joined -- more complete
than just comparing centroid distances
The centroid for the new cluster is computed as
the mean of the joined centroids

new centroid will be closest to the larger group
it contributes more cases
if a stray is added, it is unlikely to
mis-position the new centroid

Within Groups Average Linkage
-- joins the two clusters with smallest average
within linkage
-- profile of joined cluster is mean of two

For each pair of clusters find the links within
that cluster pair a few between within
shown yep, there are scads of these
The two clusters with the shortest average
centroid distance are joined -- more complete
than between groups average linkage
The centroid for the new cluster is computed as
the mean of the joined centroids

like Wards, but w/ smallest distance instead
of minimum SS
new centroid will be closest to the larger group
if a stray is added, it is unlikely to
mis-position the new centroid

Single Linkage
-- joins the two clusters with the nearest
neighbors
-- profile of joined cluster is computed from
case data

Compute the nearest neighbor distance for each
cluster pair
The two clusters with the shortest nearest
neighbor distance are joined
The centroid for the new cluster is computed from
all cases in the new cluster

groupings based on position of a single pair of
cases
outlying cases can lead to undisciplined
groupings see above

Complete Linkage
-- joins the two clusters with the nearest
farthest neighbors
-- profile of joined cluster is computed from
case data

Compute the farthest neighbor distance for each
cluster pair
The two clusters with the shortest farthest
neighbor distance are joined
The centroid for the new cluster is computed from
all cases in the new cluster

groupings based on position of a single pair of
cases
can lead to undisciplined groupings
see above

k-means Clustering Non-hierarchical
select the desired number of clusters
identify the k clustering variables
First Iteration
the computer places each case into the
k-dimensional space
the computer randomly assigns cases to the k
groups computes the k-dim centroid of each
group
compute the distance from each case to each
group centroid
cases are re-assigned to the group to which they
are closest
Subsequent Iterations
re-compute the centroid for each group
for each case re-compute the distance to each
group centroid
cases are re-assigned to the group to which they
are closest
Stop
when cases dont change groups or centroids
dont change
failure to converge can happen, but doesnt often

Hierarchical k-means Clustering
There are two major issues in cluster
analysisleading to a third
How many clusters are there ?
Who belongs to each cluster ?
3. What are the clusters ? That is, how do we
describe them, based on a description of who is
in each ??

Different combinations of clustering metrics,
amalgamation and link often lead to different
answers to these questions.
Hierarchical K-means clustering often lead to
different answers as well.
The more clusters in the solutions derived by
different procedures, the more likely those
clusters are to disagree
How different procedures handle strays and
small-frequency profiles often accounts for the
resulting differences

21
Using ldf when clustering

It is common to hear that following a clustering
with an ldf is silly -- depends !
There are two different kinds of ldfs -- with
different goals . . .
predicting groups using the same variables used
to create the clusters ? an internal ldf
always works -- there are discriminable groups
(duh!!)
but will learn something from what variables
separate which groups (may be a small subset of
the variables used)
Reclassification errors tell you about strays
forces
gives you a spatial model of the clusters
(concentrated vs. diffuse structure, etc.)
predicting groups using a different set of
variables than those used to create the clusters
? an external ldf
asks if knowing group membership tells you
anything