Metrics, Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Metrics, Algorithms

Description:

Metrics, Algorithms – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 22
Provided by: calvinp7
Learn more at: https://psych.unl.edu
Category:

less

Transcript and Presenter's Notes

Title: Metrics, Algorithms


1
Metrics, Algorithms Follow-ups
  • Profile Similarity Measures
  • Cluster combination procedures
  • Hierarchical vs. Non-hierarchical Clustering
  • Statistical follow-up analyses
  • Internal ANOVA ldf Analyses
  • External ldf ANOVA ldf Analyses

2
Profile Dissimilarity Measures
  • For each formula
  • y are data from 1st person x are data from 2nd
    person being compared
  • summing across vars
  • Euclidean ? ? (y - x)²
  • Squared Euclidean ? (y - x)² (probably most
    popular)
  • City-Block ? y - x
  • Chebychev max y - x
  • Cosine cos rxy
    (similarity index)

3
  • Euclidean ? ? (y - x)²

X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
-3 -2 -1 0 1 2 3 V1
-2 02
  • (-2 22 )

( 42 22) ? 20 4.47
4.47 represents the multivariate dissimilarity
of X Y
4
  • Squared Euclidean ? (y - x)²

X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
-2 02
(-2 22 )
-3 -2 -1 0 1 2 3 V1
( 42 22) 20
  • 20 represents the multivariate dissimilarity of
    X Y
  • Squared Euclidean is a little better at
    noticing strays
  • remember that we use a square root transform to
    pull in outliers
  • leaving the value squared makes the strays stand
    out a bit more

5
  • City Block ? y x

X 2, 0
Y -2, -2
-3 -2 -1 0 1 2 3 V2
-2 0
(-2 2 )
( 4 2) 6
-3 -2 -1 0 1 2 3 V1
So named because in a city you have to go around
the block you cant cut the diagonal
6
Chebychev max y - x
X 2, 0
Y -2, -2
-3 -2 -1 0 1 2 3 V2
-2 2 4
-2 0 2
max ( 4 2) 4
-3 -2 -1 0 1 2 3 V1
Uses the greatest univariate difference to
represent the multivariate dissimilarity.
7
Cosine cos rxy
X 2, 0
-3 -2 -1 0 1 2 3 V2
Y -2, -2
F
First Correlate scores from 2 cases across
variables
-3 -2 -1 0 1 2 3 V1
Second Find the cosine (angle F) of that
correlation
  • This is a similarity index all the other
    measures we have looked at are dissimilarity
    indices
  • Using correlations ignores level differences
    between cases looks only at shape differences
    (see next page)

8
  • Based on Euclidean or Squared Euclidean these
    four cases would probably group as
  • orange blue
  • green yellow
  • While those within the groups have somewhat
    different shapes, they are very similar levels

A B C D E
  • Based on Cos r these four cases would probably
    group as
  • blue green
  • orange yellow
  • Because correlation pays attention only to
    profile shape

It is important to carefully consider how you
want to define profile similarity when
clustering it will likely change the results
you get.
9
How Hierarchical Clustering works
  • Data in an X matrix (cases x variables)
  • Compute the profile similarity of all pairs of
    cases and put those values in a D matrix (cases
    x cases)
  • Start with clusters cases (1 case in _at_
    cluster)
  • On each step
  • Identify the 2 clusters that are most similar
  • A cluster may have 1 or more cases
  • Combine those 2 into a single cluster
  • Re-compute the profile similarity among all
    cluster pairs
  • Repeat until there is a single cluster

10
Amalgamation Linkage Procedures -- which
clusters to combine ?
  • Wards -- joins the two clusters that will produce
    the smallest increase in the pooled
    within-cluster variation (works best with Squared
    Euclidean)
  • Centroid Condensation -- joins the two clusters
    with the closest centroids -- profile of joined
    cluster is mean of two (works best with squared
    Euclidean distance metric)
  • Median Condensation -- same as centroid, except
    that equal weighting is used to construct the
    centroid of the joined cluster (as if the 2
    clusters being joined had equal-N)
  • Between Groups Average Linkage -- joins the two
    clusters for which the average distance between
    members of those two clusters is the smallest

11
Amalgamation Linkage Procedures, cont.
  • Within-groups Average Linkage -- joins the two
    clusters for which the average distance between
    members of the resulting cluster will be smallest
  • Single Linkage -- two clusters are joined which
    have the most similar two cases
  • Complete Linkage -- two clusters are joined for
    which the maximum distance between a pair of
    cases in the two clusters is the smallest

12
  • Wards -- joins the two clusters that will produce
    the smallest increase in the pooled
    within-cluster variation
  • works well with Squared Euclidean metrics
    (identifies strays)
  • attempts to reduce cluster overlay by minimizing
    SSerror
  • produces more clusters, each with lower
    variability
  • Computationally intensive, but Statistically
    simple
  • On each step
  • Take every pair of clusters combine them
  • Compute the variance across cases for each
    variable
  • Combine those univariate variances into a
    multivariate variance index
  • Identify the pair of clusters with the smallest
    multivariate variance index
  • Those two clusters are combined

13
  • Centroid Condensation
  • -- joins the two clusters with the closest
    centroids
  • -- profile of joined cluster is mean of two

Compute the centroid for each cluster
The distance between every pair of cluster
centroids is computed.
The two clusters with the shortest centroid
distance are joined
The centroid for the new cluster is computed as
the mean of the joined centroids
  • new centroid will be closest to the larger group
    it contributes more cases
  • if a stray is added, it is unlikely to
    mis-position the new centroid

14
  • Median Condensation
  • -- joins the two clusters with the closest
    centroids
  • -- profile of joined cluster is median of two
  • -- is better than last if suspect groups of
    different size

Compute the centroid for each cluster
The distance between every pair of cluster
centroids is computed.
The two clusters with the shortest centroid
distance are joined
The centroid for the new cluster is computed as
the median of the joined centroids
  • new centroid will be closest to the larger group
    it contributes more cases
  • if a stray is added, it is very likely to
    mis-position new centroid

15
  • Between Groups Average Linkage
  • -- joins the two clusters with smallest average
    cross-linkage
  • -- profile of joined cluster is mean of two

For each pair of clusters find the links across
the clusters links for one shown yep, there
are lots of these
The two clusters with the shortest average
centroid distance are joined -- more complete
than just comparing centroid distances
The centroid for the new cluster is computed as
the mean of the joined centroids
  • new centroid will be closest to the larger group
    it contributes more cases
  • if a stray is added, it is unlikely to
    mis-position the new centroid

16
  • Within Groups Average Linkage
  • -- joins the two clusters with smallest average
    within linkage
  • -- profile of joined cluster is mean of two

For each pair of clusters find the links within
that cluster pair a few between within
shown yep, there are scads of these
The two clusters with the shortest average
centroid distance are joined -- more complete
than between groups average linkage
The centroid for the new cluster is computed as
the mean of the joined centroids
  • like Wards, but w/ smallest distance instead
    of minimum SS
  • new centroid will be closest to the larger group
  • if a stray is added, it is unlikely to
    mis-position the new centroid

17
  • Single Linkage
  • -- joins the two clusters with the nearest
    neighbors
  • -- profile of joined cluster is computed from
    case data

Compute the nearest neighbor distance for each
cluster pair
The two clusters with the shortest nearest
neighbor distance are joined
The centroid for the new cluster is computed from
all cases in the new cluster
  • groupings based on position of a single pair of
    cases
  • outlying cases can lead to undisciplined
    groupings see above

18
  • Complete Linkage
  • -- joins the two clusters with the nearest
    farthest neighbors
  • -- profile of joined cluster is computed from
    case data

Compute the farthest neighbor distance for each
cluster pair
The two clusters with the shortest farthest
neighbor distance are joined
The centroid for the new cluster is computed from
all cases in the new cluster
  • groupings based on position of a single pair of
    cases
  • can lead to undisciplined groupings
    see above

19
  • k-means Clustering Non-hierarchical
  • select the desired number of clusters
  • identify the k clustering variables
  • First Iteration
  • the computer places each case into the
    k-dimensional space
  • the computer randomly assigns cases to the k
    groups computes the k-dim centroid of each
    group
  • compute the distance from each case to each
    group centroid
  • cases are re-assigned to the group to which they
    are closest
  • Subsequent Iterations
  • re-compute the centroid for each group
  • for each case re-compute the distance to each
    group centroid
  • cases are re-assigned to the group to which they
    are closest
  • Stop
  • when cases dont change groups or centroids
    dont change
  • failure to converge can happen, but doesnt often

20
  • Hierarchical k-means Clustering
  • There are two major issues in cluster
    analysisleading to a third
  • How many clusters are there ?
  • Who belongs to each cluster ?
  • 3. What are the clusters ? That is, how do we
    describe them, based on a description of who is
    in each ??
  • Different combinations of clustering metrics,
    amalgamation and link often lead to different
    answers to these questions.
  • Hierarchical K-means clustering often lead to
    different answers as well.
  • The more clusters in the solutions derived by
    different procedures, the more likely those
    clusters are to disagree
  • How different procedures handle strays and
    small-frequency profiles often accounts for the
    resulting differences

21
Using ldf when clustering
  • It is common to hear that following a clustering
    with an ldf is silly -- depends !
  • There are two different kinds of ldfs -- with
    different goals . . .
  • predicting groups using the same variables used
    to create the clusters ? an internal ldf
  • always works -- there are discriminable groups
    (duh!!)
  • but will learn something from what variables
    separate which groups (may be a small subset of
    the variables used)
  • Reclassification errors tell you about strays
    forces
  • gives you a spatial model of the clusters
    (concentrated vs. diffuse structure, etc.)
  • predicting groups using a different set of
    variables than those used to create the clusters
    ? an external ldf
  • asks if knowing group membership tells you
    anything
Write a Comment
User Comments (0)
About PowerShow.com