Unsupervised learning - PowerPoint PPT Presentation

1 / 92

About This Presentation

Title:

Unsupervised learning

Description:

Unsupervised learning & Cluster Analysis: Basic Concepts and Algorithms Assaf Gottlieb Some of the s are taken form Introduction to data mining, by Tan ... – PowerPoint PPT presentation

Number of Views:229

Avg rating:3.0/5.0

Slides: 93

Provided by: KSU70

Category:

more less

Transcript and Presenter's Notes

Title: Unsupervised learning

1
Unsupervised learning Cluster Analysis Basic
Concepts and Algorithms
Assaf Gottlieb
Some of the slides are taken form Introduction to
data mining, by Tan, Steinbach, and Kumar
2
What is unsupervised learning Cluster Analysis ?

Learning without a priori knowledge about the
classification of samples learning without a
teacher.
Kohonen (1995), Self-Organizing Maps
Cluster analysis is a set of methods for
constructing a (hopefully) sensible and
informative classification of an initially
unclassified set of data, using the variable
values observed on each individual.
B. S. Everitt (1998), The Cambridge Dictionary
of Statistics

3
What do we cluster?
Features/Variables
Samples/Instances
4
Applications of Cluster Analysis

UnderstandingGroup related documents for
browsing, group genes and proteins that have
similar functionality, or group stocks with
similar price fluctuations
Data Exploration
Get insight into data distribution
Understand patterns in the data
SummarizationReduce the size of large data
setsA preprocessing step

5
Objectives of Cluster Analysis

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

Competing objectives
Inter-cluster distances are maximized
Intra-cluster distances are minimized
6
Notion of a Cluster can be Ambiguous
Depends on resolution !
7
Prerequisites

Understand the nature of your problem, the type
of features, etc.
The metric that you choose for similarity (for
example, Euclidean distance or Pearson
correlation) often impacts the clusters you
recover.

8
Similarity/Distance measures

Euclidean Distance
Highly depends on scaleof features may require
normalization
City Block

9
deuc0.5846
deuc1.1345
These examples of Euclidean distance match our
intuition of dissimilarity pretty well
deuc2.6115
10
deuc1.41
deuc1.22
But what about these? What might be going on
with the expression profiles on the left? On the
right?
11
Similarity/Distance measures

Cosine
Pearson Correlation
Invariant to scaling (Pearson also to addition)
Spearman correlation for ranks

12
Similarity/Distance measures

Jaccard similarity
When interested in intersection size

X U Y
X
Y
X n Y
13
Types of Clusterings

Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a
hierarchical tree

14
Partitional Clustering
Original Points
15
Hierarchical Clustering
Dendrogram 1
Dendrogram 2
16
Other Distinctions Between Sets of Clustering
methods

Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong
to multiple clusters.
Can represent multiple classes or border points
Fuzzy versus non-fuzzy
In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1
Weights must sum to 1
Partial versus complete
In some cases, we only want to cluster some of
the data
Heterogeneous versus homogeneous
Cluster of widely different sizes, shapes, and
densities

17
Clustering Algorithms

Hierarchical clustering
K-means
Bi-clustering

18
Hierarchical Clustering

Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits

19
Strengths of Hierarchical Clustering

Do not have to assume any particular number of
clusters
Any desired number of clusters can be obtained by
cutting the dendogram at the proper level
They may correspond to meaningful taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

20
Hierarchical Clustering

Two main types of hierarchical clustering
Agglomerative (bottom up)
Start with the points as individual clusters
At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
Divisive (top down)
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster
contains a point (or there are k clusters)
Traditional hierarchical algorithms use a
similarity or distance matrix
Merge or split one cluster at a time

21
Agglomerative Clustering Algorithm

More popular hierarchical clustering technique
Basic algorithm is straightforward
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the proximity
of two clusters
Different approaches to defining the distance
between clusters distinguish the different
algorithms

22
Starting Situation

Start with clusters of individual points and a
proximity matrix

Proximity Matrix
23
Intermediate Situation

After some merging steps, we have some clusters

C3
C4
C1
Proximity Matrix
C5
C2
24
Intermediate Situation

We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.

C3
C4
C1
Proximity Matrix
C5
C2
25
After Merging

The question is How do we update the proximity
matrix?

C2 U C5
C1
C3
C4
?
C1
C3
? ? ? ?
C2 U C5
C4
?
C3
?
C4
C1
Proximity Matrix
C2 U C5
26
How to Define Inter-Cluster Similarity
Similarity?

MIN
MAX
Group Average
Distance Between Centroids
Wards method (not discussed)

Proximity Matrix
27
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
28
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
29
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
30
How to Define Inter-Cluster Similarity
?
?

MIN
MAX
Group Average
Distance Between Centroids

Proximity Matrix
31
Cluster Similarity MIN or Single Link

Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
Determined by one pair of points, i.e., by one
link in the proximity graph.

32
Hierarchical Clustering MIN
Nested Clusters
Dendrogram
33
Strength of MIN
Original Points

Can handle non-elliptical shapes

34
Limitations of MIN
Original Points

Sensitive to noise and outliers

35
Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two
least similar (most distant) points in the
different clusters
Determined by all pairs of points in the two
clusters

36
Hierarchical Clustering MAX
Nested Clusters
Dendrogram
37
Strength of MAX
Original Points

Less susceptible to noise and outliers

38
Limitations of MAX
Original Points

Tends to break large clusters
Biased towards globular clusters

39
Cluster Similarity Group Average

Proximity of two clusters is the average of
pairwise proximity between points in the two
clusters.
Need to use average connectivity for scalability
since total proximity favors large clusters

40
Hierarchical Clustering Group Average
Nested Clusters
Dendrogram
41
Hierarchical Clustering Group Average

Compromise between Single and Complete Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters

42
Cluster Similarity Wards Method

Similarity of two clusters is based on the
increase in squared error when two clusters are
merged
Similar to group average if distance between
points is distance squared
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Can be used to initialize K-means

43
Hierarchical Clustering Comparison
MAX
MIN
Group Average
44
Hierarchical Clustering Time and Space
requirements

O(N2) space since it uses the proximity matrix.
N is the number of points.
O(N3) time in many cases
There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time
for some approaches

45
Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters,
it cannot be undone
Different schemes have problems with one or more
of the following
Sensitivity to noise and outliers
Difficulty handling different sized clusters and
convex shapes
Breaking large clusters (divisive)
Dendrogram correspond to a given hierarchical
clustering is not unique, since for each merge
one needs to specify which subtree should go on
the left and which on the right
They impose structure on the data, instead of
revealing structure in these data.
How many clusters? (some suggestions later)

46
K-means Clustering

Partitional clustering approach
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple

47
K-means Clustering Details

Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the
points in the cluster.
Closeness is measured mostly by Euclidean
distance, cosine similarity, correlation, etc.
K-means will converge for common similarity
measures mentioned above.
Most of the convergence happens in the first few
iterations.
Often the stopping condition is changed to Until
relatively few points change clusters
Complexity is O( n K I d )
n number of points, K number of clusters, I
number of iterations, d number of attributes

Typical choice
48
Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the
nearest cluster
To get SSE, we square these errors and sum them.
x is a data point in cluster Ci and mi is the
representative point for cluster Ci
can show that mi corresponds to the center
(mean) of the cluster
Given two clusters, we can choose the one with
the smallest error
One easy way to reduce SSE is to increase K, the
number of clusters
A good clustering with smaller K can have a
lower SSE than a poor clustering with higher K

49
Issues and Limitations for K-means

How to choose initial centers?
How to choose K?
How to handle Outliers?
Clusters different in
Shape
Density
Size
Assumes clusters are spherical in vector space
Sensitive to coordinate changes

50
Two different K-means Clusterings
Original Points
51
Importance of Choosing Initial Centroids
52
Importance of Choosing Initial Centroids
53
Importance of Choosing Initial Centroids
54
Importance of Choosing Initial Centroids
55
Solutions to Initial Centroids Problem

Multiple runs
Sample and use hierarchical clustering to
determine initial centroids
Select more than k initial centroids and then
select among these initial centroids
Select most widely separated
Bisecting K-means
Not as susceptible to initialization issues

56
Bisecting K-means

Bisecting K-means algorithm
Variant of K-means that can produce a partitional
or a hierarchical clustering

57
Bisecting K-means Example
58
Issues and Limitations for K-means

How to choose initial centers?
How to choose K?
Depends on the problem some suggestions later
How to handle Outliers?
Preprocessing
Clusters different in
Shape
Density
Size

59
Issues and Limitations for K-means

How to choose initial centers?
How to choose K?
How to handle Outliers?
Clusters different in
Shape
Density
Size

60
Limitations of K-means Differing Sizes
Original Points
K-means (3 Clusters)
61
Limitations of K-means Differing Density
K-means (3 Clusters)
Original Points
62
Limitations of K-means Non-globular Shapes
Original Points
K-means (2 Clusters)
63
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters. Find parts
of clusters, but need to put together.
64
Overcoming K-means Limitations
Original Points K-means Clusters
65
Overcoming K-means Limitations
Original Points K-means Clusters
66
K-means

Pros
Simple
Fast for low dimensional data
It can find pure sub clusters if large number of
clusters is specified
Cons
K-Means cannot handle non-globular data of
different sizes and densities
K-Means will not identify outliers
K-Means is restricted to data which has the
notion of a center (centroid)

67
Biclustering/Co-clustering
M conditions

Two genes can have similar expression patterns
only under some conditions
Similarly, in two related conditions, some genes
may exhibit different expression patterns

N genes
68
Biclustering

As a result, each cluster may involve only a
subset of genes and a subset of conditions, which
form a checkerboard structure

69
Biclustering

In general a hard task (NP-hard)
Heuristic algorithms described briefly
Cheng Church deletion of rows and columns.
Biclusters discovered one at a time
Order-Preserving SubMatrixes Ben-Dor et al.
Coupled Two-Way Clustering (Getz. Et al)
Spectral Co-clustering

70
Cheng and Church

Objective function for heuristic methods (to
minimize)
Greedy method
Initialization the bicluster contains all rows
and columns.
Iteration
Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
Remove a row or column that gives the maximum
decrease of H.
Termination when no action will decrease H or H
lt ?.
Mask this bicluster and continue
Problem removing trivial biclusters

71
Ben-Dor et al. (OPSM)

Model
For a condition set T and a gene g, the
conditions in T can be ordered in a way so that
the expression values are sorted in ascending
order (suppose the values are all unique).
Submatrix A is a bicluster if there is an
ordering (permutation) of T such that the
expression values of all genes in G are sorted in
ascending order.
Idea of algorithm to grow partial models until
they become complete models.

t1 t2 t3 t4 t5
g1 7 13 19 2 50
g2 19 23 39 6 42
g3 4 6 8 2 10
Induced permutation 2 3 4 1 5
72
Ben-Dor et al. (OPSM)
73
Getz et al. (CTWC)

Idea repeatedly perform one-way clustering on
genes/conditions.
Stable clusters of genes are used as the
attributes for condition clustering, and vice
versa.

74
Spectral Co-clustering

Main idea
Normalize the 2 dimension
Form a matrix of size mn (using SVD)
Use k-means to cluster both types of data
http//adios.tau.ac.il/SpectralCoClustering/

75
Evaluating cluster quality

Use known classes (pairwise F-measure, best class
F-measure)
Clusters can be evaluated with internal as well
as external measures
Internal measures are related to the inter/intra
cluster distance
External measures are related to how
representative are the current clusters to true
classes

76
Inter/Intra Cluster Distances

Intra-cluster distance
(Sum/Min/Max/Avg) the (absolute/squared) distance
between
All pairs of points in the cluster OR
Between the centroid and all points in the
cluster OR
Between the medoid and all points in the
cluster

Inter-cluster distance
Sum the (squared) distance between all pairs of
clusters
Where distance between two clusters is defined
as
distance between their centroids/medoids
(Spherical clusters)
Distance between the closest pair of points
belonging to the clusters
(Chain shaped clusters)

77
Davies-Bouldin index

A function of the ratio of the sum of
within-cluster (i.e. intra-cluster) scatter to
between cluster (i.e. inter-cluster) separation
Let CC1,.., Ck be a clustering of a set of N
objects
with and
where Ci is the ith cluster and ci is the
centroid for cluster i

78
Davies-Bouldin index example

For eg for the clusters shown
Compute
var(C1)0, var(C2)4.5, var(C3)2.33
Centroid is simply the mean here, so c13,
c28.5, c318.33
So, R121, R130.152, R230.797
Now, compute
R11 (max of R12 and R13) R21 (max of R21 and
R23) R30.797 (max of R31 and R32)
Finally, compute
DB0.932

79
Davies-Bouldin index example (ctd)

For eg for the clusters shown
Compute
Only 2 clusters here
var(C1)12.33 while var(C2)2.33 c16.67 while
c218.33
R121.26
Now compute
Since we have only 2 clusters here, R1R121.26
R2R211.26
Finally, compute
DB1.26

80
Other criteria

Dunn method
?(Xi, Xj) intercluster distance between clusters
Xi and Xj ?(Xk) intracluster distance of cluster
Xk
Silhouette method
Identifying outliers
C-index
Compare sum of distances S over all pairs from
the same cluster against the same of smallest
and largest pairs.

81
Example datasetAML/ALL dataset (Golub et al.)

Leukemia
72 patients (samples)
7129 genes
4 groups
Two major types ALL AML
T B Cells in ALL
With/without treatment in AML

82
AML/ALL dataset

Davies-Bouldin index - C4
Dunn method - C2
Silhouette method C2

83
Visual evaluation - coherency
84
Cluster quality example do you see clusters?
C Silhouette
2 0.4922
3 0.5739
4 0.4773
5 0.4991
6 0.5404
7 0.541
8 0.5171
9 0.5956
10 0.6446
C Silhouette
2 0.4863
3 0.5762
4 0.5957
5 0.5351
6 0.5701
7 0.5487
8 0.5083
9 0.5311
10 0.5229
85
Kleinbergs Axioms

Scale Invariance
F(?d)F(d) for all d and all strictly
positive ?.

Consistency
If d equals d, except for shrinking
distances within clusters of F(d) or stretching
between-cluster distances, then F(d)F(d).

Richness
For any partition P of S, there exists a
distance
function d over S so that F(d)P.

86
Quality estimation

Gamma is the best performing measure in
Milligans study of 30 internal criterions
(Milligan, 1981).
Let d() denote the number of times that points
which were clustered together in C had distance
greater than two points which were not in the
same cluster
Let d(-) denote the opposite result
Gamma satisfies scale-invariance, consistency,
richness, and isomorphism invariance.

87
Dimensionality Reduction

Map points in high-dimensional space tolower
number of dimensions
Preserve structure pairwise distances, etc.
Useful for further processing
Less computation, fewer parameters
Easier to understand, visualize

88
Dimensionality Reduction

Feature selection vs. Feature Extraction
Feature selection select important features
Pros
meaningful features
Less work acquiring
Unsupervised
Variance, Fold
UFF

89
Dimensionality Reduction

Feature Extraction
Transforms the entire feature set to lower
dimension.
Pros
Uses objective function to select the best
projection
Sometime single features are not good enough
Unsupervised
PCA, SVD

90
Principal Components Analysis (PCA)

approximating a high-dimensional data setwith a
lower-dimensional linear subspace

Original axes
91
Singular Value Decomposition
92
Principal Components Analysis (PCA)

Rule of thumb for selecting number of components
Knee in screeplot
Cumulative percentage variance

93
Tools for clustering

Matlab COMPACT
http//adios.tau.ac.il/compact/

94
Tools for clustering

Matlab COMPACT
http//adios.tau.ac.il/compact/

95
Tools for clustering

Cluster TreeView (Eisen et al.)
http//rana.lbl.gov/eisen/?page_id42

96
Summary

Clustering is ill-defined and considered an art
In fact, this means you need to
Understand your data beforehand
Know how to interpret clusters afterwards
The problem determines the best solution (which
measure, which clustering algorithm) try to
experiment with different options.

Write a Comment

User Comments (0)