Title: Clustering Analysis: Outline
1Clustering Analysis Outline
- General Pupose
- Clustering Categories
- Clustering vs Other Multivariate Data Analysis
- Area of Application
- Distance Measures
- Hierarchical Tree Clustering
- K-Means Clustering
- K-Means vs ANOVA
- Expectation Maximization Clustering(for
Categorical Variables) - Two-Way Clustering (Block Clustering)
- Ex Content Clustering in Texts
2Clustering AnalysisGeneral Purpose
- How to organize observed data into meaningful
structures, that is, to develop taxonomies. - Ex To organize the different species of animals
before a meaningful description of the
differences between animals is possible. - Target Both minimize within-group variation
and maximize between-group variation.
3Clustering Analysis Categories
- A. Hierarchical Clustering
- Ex Tree Clustering.
- B. i ) K-means Clustering
- ii) Expecatation Maximization Clustering
- C. Block Clustering(Two-way Joining)
- Ex Concept Clustering within Texts
4Clustering Analysis Similarity to Discriminant
and Factor Analysis
- In Factor Analysis original set of variables are
reduced to smaller number of Factors, while in
clustering original set of variables are grouped. - In Discriminant Analysis Clusters are known in
advanced and discriminating variables are worked,
while in clustering we try to discover natural
clusters within the data.
5Clustering Analysis Statistical Significance
Testing
- Unlike many other statistical procedures, cluster
analysis methods are mostly used when we do not
have any a priori hypotheses, but are still in
the exploratory phase of our research. In a
sense, cluster analysis finds the "most
significant solution possible." Therefore,
statistical significance testing is really not
appropriate for clustering analysis.
6Clustering Analysis Area of Application
- In general, whenever one needs to classify a
"mountain" of information into manageable
meaningful piles, cluster analysis is of great
utility. - Ex clustering diseases, cures for diseases, or
symptoms of diseases can lead to very useful
taxonomies. In the field of psychiatry, the
correct diagnosis of clusters of symptoms such as
paranoia, schizophrenia, etc. is essential for
successful therapy.
7Clustering Analysis Hierarchical Clustering
- i.) Bottom-Up(upward)The purpose of this method
is to join together variables into successively
larger clusters, using some measure of similarity
or distance. - Initially each variable is considered as a
separate cluster. - Thus, similarity threshold is relaxed.
- ii.) Top-Down(downward) In that method, a
partitioning scheme is followed. - Initially all data set is considered as a single
cluster. - Repeatedly similarity threshold is tightened.
8Clustering Analysis Hierarchical Tree Clustering
9Clustering AnalysisDistance Measures
- Euclidean distance
- The distance between any two objects is not
affected by the addition of new objects to the
analysis, which may be outliers. - distance(x,y) i (xi - yi)2 ½
- Chebychev distance
- differentiate furthest dimensions or attributes
- distance(x,y) Maximumxi - yi
- Percent disagreement
- This measure is particularly useful if the data
for the dimensions included in the analysis are
categorical in nature. - distance(x,y) (Number of xi yi)/ i
10Clustering AnalysisK-Means Clustering
- When we already have hypotheses concerning the
number of clusters in your cases or variables
then we can address the k- means clustering. - In general, the k-means method will produce
exactly k different clusters of greatest possible
distinction.
11Clustering Analysis K-Means vs ANOVA
- K-Means clustering is analogous to "ANOVA in
reverse" in the sense that - - The significance test in ANOVA evaluates the
between group variability against the
within-group variability when computing the
significance test for the hypothesis that the
means in the groups are different from each
other. - - In k-means clustering, the program tries to
move objects (e.g., cases) in and out of groups
(clusters) to get the most significant ANOVA
results.
12Clustering AnalysisExpectation Maximization
Clustering(Categorical Variables)
- Classification probabilities instead of
classifications Each observation belongs to each
cluster with a certain probability. - Categorical variablesThe EM algorithm can also
accommodate categorical variables. The program
will at first randomly assign different
probabilities (weights, to be precise) to each
class or category, for each cluster. In
successive iterations, these probabilities are
refined (adjusted) to maximize the likelihood of
the data given the specified number of clusters.
13Clustering Analysis Two-Way Clustering (Block
Clustering)
- Block Clustering is useful in the relatively rare
circumstances when one expects that both cases
and variables will simultaneously contribute to
the uncovering of meaningful patterns of clusters.