Clustering

About This Presentation

Title:

Clustering

Description:

We truncate it to r=4 and obtain, once again, our best results for s=0.5, where ... Apply SVD and truncate to r-space by selecting the first r significant eigenvectors ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 91

Provided by: wdw

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering
(an ill-defined problem)

Introduction
Preprocessing dimensional reduction with SVD
Clustering methods K-means, FCM
Hierarchical methods
Model based methods (at the end)
Competitive NN (SOM) (not shown here)
SVC, QC
Applications
COMPACT

2
What Is Clustering ?

Clustering is partitioning of data into
meaningful (?) groups called clusters.
Cluster a collection of objects that are
similar to one another ? what is similar?
? unsupervised learning no predefined classes

Not Classification!!

Why? To help understand the natural grouping or
structure in a data set
When? Used either as
a stand-alone tool to get insight into data
distribution or
as a preprocessing step for other algorithms,
e.g., to discover classes

3
Clustering Applications

Operations Research
Facility Location Problem locate fire stations
so as to minimize the maximum/average distance a
fire truck must travel
Signal Processing
Vector Quantization Transmit large files (e.g.,
video, speech) by computing quantizers
Astronomy
SkyCat Clustered 2x109 sky objects into stars,
galaxies, quasars, etc based on radiation emitted
in different spectrum bands.

4
Clustering Applications

Marketing
Segmentation of customers for target marketing
Segmentation of customers based on online
clickstream data.
Web
To discover categories of content.
Search results
Bioinformatics
Gene expression
Finding groups of individuals (sick Vs. healthy)
Finding groups of genes
Motifs search.
In practice, clustering is one of the most widely
used data mining techniques
Association rule algorithms produce too many
rules
Other machine learning algorithms require labeled
data.

5
Points/Metric Space

Points could be in in Rd, 0,1d,
Metric Space dist(x,y) is a distance metric if
Reflexive dist(x,y)0 iff xy
Symmetric dist(x,y) dist(y,x)
Triangle Inequality dist(x,y) ? dist(x,z)
dist(z,y)

y
x
6
Example of Distance Metrics

The distance between xltx1,,xngt and ylty1,,yngt
is
L2 norm
Manhattan Distance (L1 norm)
Documents Cosine measure
Similarity
i.e., more similar -gt close to 1
less similar -gt close to 0
Not a metric space, but 1-cos? is

7
Correlation

We might care more about the overall shape of
expression profiles rather than the actual
magnitudes
That is, we might want to consider genes similar
when they are up and down together
When might we want this kind of measure? What
experimental issues might make this appropriate?

8
Pearson Linear Correlation

Were shifting the expression profiles down
(subtracting the means) and scaling by the
standard deviations (i.e., making the data have
mean 0 and std 1)

9
Pearson Linear Correlation

Pearson linear correlation (PLC) is a measure
that is invariant to scaling and shifting
(vertically) of the expression values
Always between 1 and 1 (perfectly
anti-correlated and perfectly correlated)
This is a similarity measure, but we can easily
make it into a dissimilarity measure

10
PLC (cont.)

PLC only measures the degree of a linear
relationship between two expression profiles!
If you want to measure other relationships, there
are many other possible measures (see Jagota book
and project 3 for more examples)

? 0.0249, so dp 0.4876 The green curve is the
square of the blue curve this relationship is
not captured with PLC
11
More correlation examples
What do you think the correlation is here? Is
this what we want?
How about here? Is this what we want?
12
Missing Values

A common problem w/ microarray data
One approach with Euclidean distance or PLC is
just to ignore missing values (i.e., pretend the
data has fewer dimensions)
There are more sophisticated approaches that use
information such as continuity of a time series
or related genes to estimate missing values
better to use these if possible

13
Preprocessing

For methods that are not applicable in very high
dimensions you may want to apply
Dimensional reduction, e.g. consider the first
few SVD components (truncate S at r-dimensions)
and use the remaing values of the U or V matrices
Dimensional reduction normalization after
applying dimensional reduction normalize all
resulting vectors to unit length (i.e. consider
angles as proximity measures)
Feature selection, e.g. consider only features
that have large variance. More on feature
selection in the future.

14
Clustering Types

Exclusive vs. Overlapping Clustering
Hierarchical vs. Global Clustering
Formal vs. Heuristic Clustering

First two examples K-Means exclusive, global,
heuristic FCM (fuzzy c-means) overlapping,
global, heuristic
15
Two classes of data described by (o) and (). The
objective is to reproduce the two classes by K2
clustering.
16

Place two cluster centres (x) at random.
Assign each data point ( and o) to the nearest
cluster centre (x)

Compute the new centre of each class
Move the crosses (x)

18
Iteration 2
19
Iteration 3
20
Iteration 4 (then stop, because no visible
change) Each data point belongs to the cluster
defined by the nearest centre
21
M 0.0000 1.0000 0.0000 1.0000
0.0000 1.0000 0.0000 1.0000 0.0000
1.0000 1.0000 0.0000 1.0000
0.0000 1.0000 0.0000 1.0000 0.0000
1.0000 0.0000

The membership matrix M
The last five data points (rows) belong to the
first cluster (column)
The first five data points (rows) belong to the
second cluster (column)

22
Membership matrix M
cluster centre i
cluster centre j
data point k
distance
Results of K-means depend on the starting point
of the algorithm. Repeat it several times to get
a better feeling whether the results are
meaningful.
23
c-partition
All clusters C together fills the whole universe U
Clusters do not overlap
A cluster C is never empty and it is smaller than
the whole universe U
There must be at least 2 clusters in a
c-partition and at most as many as the number of
data points K
24
Objective function
Minimise the total sum of all distances
25

Algorithm fuzzy c-means (FCM)

26
Each data point belongs to two clusters to
different degrees
27

Place two cluster centres
Assign a fuzzy membership to each data point
depending on distance

Compute the new centre of each class
Move the crosses (x)

29
Iteration 2
30
Iteration 5
31
Iteration 10
32
Iteration 13 (then stop, because no visible
change) Each data point belongs to the two
clusters to a degree
33
M 0.0025 0.9975 0.0091 0.9909
0.0129 0.9871 0.0001 0.9999 0.0107
0.9893 0.9393 0.0607 0.9638
0.0362 0.9574 0.0426 0.9906 0.0094
0.9807 0.0193

The membership matrix M
The last five data points (rows) belong mostly to
the first cluster (column)
The first five data points (rows) belong mostly
to the second cluster (column)

34
Hard Classifier (HCM)
A cell is either one or the other class defined
by a colour.
35
Fuzzy Classifier (FCM)
A cell can belong to several classes to
a degree, i.e., one column may have several
colours.
36

Hierarchical Clustering
Greedy
Agglomerative vs. Divisive

Dendrograms allow us to visualize
visualization is not unique!
Tends to be sensitive to small changes in the
data
Provided with clusters of every size where to
cut is user-determined
Large storage demand
Running Time O(n2 levels) O(n3)
Depends on distance measure, linkage method

37
Hierarchical Agglomerative Clustering

We start with every data point in a separate
cluster
We keep merging the most similar pairs of data
points/clusters until we have one big cluster
left
This is called a bottom-up or agglomerative
method

38
Hierarchical Clustering (cont.)

This produces a binary tree or dendrogram
The final cluster is the root and each data item
is a leaf
The height of the bars indicate how close the
items are

39
Hierarchical Clustering Demo
40
Hierarchical Clustering Issues

Distinct clusters are not produced sometimes
this can be good, if the data has a hierarchical
structure w/o clear boundaries
There are methods for producing distinct
clusters, but these usually involve specifying
somewhat arbitrary cutoff values
What if data doesnt have a hierarchical
structure? Is HC appropriate?

41
Support Vector Clustering

Given points x in data space, define images in
Hilbert space.
Require all images to be enclosed by a minimal
sphere in Hilbert space.
Reflection of this sphere in data space defines
cluster boundaries.
Two parameters width of Gaussian kernel and
fraction of outliers

Ben-Hur, Horn, Siegelmann Vapnik. JMLR 2 (2001)
125-127
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
Variation of q allows for clustering solutions on
various scales
q1, 20, 24, 48
46
(No Transcript)
47
Example that allows for SVclustering only in
presence of outliers. Procedure limit ß
ltC1/pN, where pfraction of assumed outliers in
the data.
q3.5 p0
q1 p0.3
48
Similarity to scale space approach for high
values of q and p. Probability distribution
obtained from R(x).
q4.8 p0.7
49
From Scale-space to Quantum Clustering

Parzen window approach estimate the probability
density by kernel functions (Gaussians) located
at data points.

s 1/v(2q)
50
Quantum Clustering

View P? as the solution of the Schrödinger
equation
with the potential V(x) responsible for
attraction to cluster centers and the Lagrangian
causing the spread.
Find V(x)

Horn and Gottlieb, Phys. Rev. Lett. 88 (2002)
018702
51
The Crabs Example (from Ripleys textbook)4
classes, 50 samples each, d5
A topographic map of the probability
distribution for the crab data set with ?1/?2
using principal components 2 and 3. There exists
only one maximum.
52
The Crabs ExampleQC potential exhibits four
minima identified with cluster centers
A topographic map of the potential for the crab
data set with ?1/?2 using principal components 2
and 3 . The four minima are denoted by crossed
circles. The contours are set at values VcE for
c0.2,,1.
53
The Crabs Example - Contd.
A three dimensional plot of the potential for the
crab data set with ?1/?3 using principal
components 2 and 3
54
The Crabs Example - Contd.
A three dimensional plot of the potential for the
crab data set with ?1/2 using principal
components 2 and 3
55
Properties of V and E
E is chosen so that min(V)0. E sets the scale
of structure observed in V(x). The single point
case corresponds to the harmonic potential
In general

56
Identifying Clusters

Local minima of the potential are identified with
cluster centers.
Data points are assigned to clusters according
to
-minimal distance from centers, or,
-sliding points down the slopes of the potential
with gradient descent until they reach the
centers.

57
The Iris Example3 classes, each containing 50
samples, d4
A topographic map of the potential for the iris
data set with ?0.25 using principal components 1
and 2. The three minima are denoted by crossed
circles. The contours are set at values VcE for
c0.2,,1.
58
The Iris Example - Gradient Descent Dynamics
59
The Iris Example - Using Raw Data in 4D.
There are only 5 misclassifications. ?0.21.
60
Example Yeast cell cycle
Yeast cell cycle data were studied by several
groups who have applied SVD. (Spellman et al.
Molecular Biology of the Cell, 9, Dec. 2000)
We use it to test clustering of genes,
whose classification into groups was investigated
by Spellman et al. The gene/sample matrix that we
start from has dimensions of 798x72, using the
same selection as made by (Shamir, R. and Sharan,
R. 2002 ). We truncate it to r4 and obtain,
once again, our best results for s0.5, where
four clusters follow from the QC algorithm.
61
Example Yeast cell cycle
The five gene families as represented in two
coordinates of our r4 dimensional space.
62
Example Yeast cell cycle
Cluster assignments of genes for QC with s0.46
, as compared to the classification by Spellman
into five classes, shown as alternating gray and
white areas .
63
Yeast cell cycle in normalized 2 dimensions
64
Hierarchical Quantum Clustering (HQC)

Start with raw data matrix containing gene
expression profiles of the samples.
Apply SVD and truncate to r-space by selecting
the first r significant eigenvectors
Apply QC in r-dimensions starting at small scale
? , obtaining many clusters. Move data points to
cluster centers and reiterate the process at
higher s. This produces hierarchical clustering
that can be represented by a dendrogram.

65
Example Clustering of human cancer cells
The NCI60 set is a gene expression profile of
8000 genes in 60 human cancer cells. NCI60
includes cell lines derived from cancers of
colorectal, renal, ovarian, breast, prostate,
lung and central nervous system, as well as
leukemias and melanomas. After application of
selective filters the number of gene spots is
reduced to 1,376 gene subset. (Scherf et al.
Nature Genetics 24 , 2000) We applied HQC with
r5 dimension.
66
Example Clustering of human cancer cells
Dendrogram of 60 cancer cell samples. The
clustering was done in 5 truncated dimensions.
The first 2 letters in each sample represent the
tissue/cancer type.
67
Example - Projection onto the unit sphere
Representation of data of four classes of cancer
cells on two dimensions of the truncated space.
The circles denote the locations of the data
points before this normalization was applied
68
COMPACT a comparative package for clustering
assessment

Compact is a GUI Matlab tool that enables an easy
and intuitive way to compare some clustering
methods.
Compact is a five-step wizard that contains basic
Matlab clustering methods as well as the quantum
clustering algorithm. Compact provides a flexible
and customizable interface for clustering data
with high dimensionality.
Compact allows both textual and graphical display
of the clustering results

69
How to Install?

COMPACT is a self-extracting package. In order
to install and run the QUI tool, follow these
three easy steps
Download the COMPACT.zip package to your local
drive.
Add the COMPACT destination directory to
your Matlab path.
Within Matlab, type compact at the command
prompt.

70
Steps 1

Input parameters

71
Steps 1

Selecting variables

72
Steps 2

Determining the matrix shape and vectors to
cluster

73
Steps 3

Preprocessing Procedures
Components variance graphs
Preprocessing parameters

74
Steps 4

Points distribution preview
and clustering method selection

75
Steps 5

Parameters for clustering algorithms
Kmeans

76
Steps 5

Parameters for clustering algorithms
FCM

77
Steps 5

Parameters for clustering algorithms
NN

78
Steps 5

Parameters for clustering algorithms
QC

79
Steps 6

COMPACT results

80
Steps 6Results
81
Clustering Methods Model-Based

Data are generated from a mixture of underlying
probability distributions

82
Some Examples

Two univariate normal components
Equal proportions
Common variance s21

Two univariate normal components
proportions 0.75 and 0.25
Common variance s21

84
and some more
85
Probability Models

Classification Likelihood

set of parameters of cluster K

Mixture Likelihood

is the probability that an observation belongs
to cluster K ( )

86
Probability Models (Cont.)

Most used multivariate normal distribution
Tk has a means vector µk and a covariance matrix
Sk

How is the covariance matrix Sk calculated?

87
Calculating the covariance matrix Sk

The idea parameterize the covariance matrix

Dk Orthogonal matrix of eigenvectors
Determines the orientation of the PCs of Sk
Ak Diagonal matrix whose elements are
proportional to the eigenvalues of Sk
Determines the shape of the density contours
?k Scalar
Determines the volume of the corresponding
ellipsoid

88
Sk Definition Determines the Model

spherical, equal (SOS criterion)

all ellipsoids are equal

89
How is Tk computed? EM algorithm
The complete-data log-likelihood ()
Density of an observation given zi is
is the conditional
expectation of zik given xi and T1,, TG
90

E calculate

M given maximize ()

91
Limitations of the EM Algorithm

Low rate of convergence
You should start with good starting points and
hope for separable clusters
Not practical for large number of clusters (
probabilities)
"Crashes" when covariance matrix becomes singular
Problems when there are few observation in a
cluster
EM must not get more clusters than exist in
nature

Write a Comment

User Comments (0)

About PowerShow.com

Clustering - PowerPoint PPT Presentation

Clustering

We truncate it to r=4 and obtain, once again, our best results for s=0.5, where ... Apply SVD and truncate to r-space by selecting the first r significant eigenvectors ... – PowerPoint PPT presentation