Title: Clustering
1Clustering
(an ill-defined problem)
- Introduction
- Preprocessing dimensional reduction with SVD
- Clustering methods K-means, FCM
- Hierarchical methods
- Model based methods (at the end)
- Competitive NN (SOM) (not shown here)
- SVC, QC
- Applications
- COMPACT
2What Is Clustering ?
- Clustering is partitioning of data into
meaningful (?) groups called clusters. - Cluster a collection of objects that are
similar to one another ? what is similar? - ? unsupervised learning no predefined classes
Not Classification!!
- Why? To help understand the natural grouping or
structure in a data set - When? Used either as
- a stand-alone tool to get insight into data
distribution or - as a preprocessing step for other algorithms,
e.g., to discover classes
3 Clustering Applications
- Operations Research
- Facility Location Problem locate fire stations
so as to minimize the maximum/average distance a
fire truck must travel - Signal Processing
- Vector Quantization Transmit large files (e.g.,
video, speech) by computing quantizers - Astronomy
- SkyCat Clustered 2x109 sky objects into stars,
galaxies, quasars, etc based on radiation emitted
in different spectrum bands.
4Clustering Applications
- Marketing
- Segmentation of customers for target marketing
- Segmentation of customers based on online
clickstream data. - Web
- To discover categories of content.
- Search results
- Bioinformatics
- Gene expression
- Finding groups of individuals (sick Vs. healthy)
- Finding groups of genes
- Motifs search.
-
- In practice, clustering is one of the most widely
used data mining techniques - Association rule algorithms produce too many
rules - Other machine learning algorithms require labeled
data.
5Points/Metric Space
- Points could be in in Rd, 0,1d,
- Metric Space dist(x,y) is a distance metric if
- Reflexive dist(x,y)0 iff xy
- Symmetric dist(x,y) dist(y,x)
- Triangle Inequality dist(x,y) ? dist(x,z)
dist(z,y)
y
x
6Example of Distance Metrics
- The distance between xltx1,,xngt and ylty1,,yngt
is - L2 norm
- Manhattan Distance (L1 norm)
- Documents Cosine measure
- Similarity
- i.e., more similar -gt close to 1
- less similar -gt close to 0
- Not a metric space, but 1-cos? is
7Correlation
- We might care more about the overall shape of
expression profiles rather than the actual
magnitudes - That is, we might want to consider genes similar
when they are up and down together - When might we want this kind of measure? What
experimental issues might make this appropriate?
8Pearson Linear Correlation
- Were shifting the expression profiles down
(subtracting the means) and scaling by the
standard deviations (i.e., making the data have
mean 0 and std 1)
9Pearson Linear Correlation
- Pearson linear correlation (PLC) is a measure
that is invariant to scaling and shifting
(vertically) of the expression values - Always between 1 and 1 (perfectly
anti-correlated and perfectly correlated) - This is a similarity measure, but we can easily
make it into a dissimilarity measure
10PLC (cont.)
- PLC only measures the degree of a linear
relationship between two expression profiles! - If you want to measure other relationships, there
are many other possible measures (see Jagota book
and project 3 for more examples)
? 0.0249, so dp 0.4876 The green curve is the
square of the blue curve this relationship is
not captured with PLC
11More correlation examples
What do you think the correlation is here? Is
this what we want?
How about here? Is this what we want?
12Missing Values
- A common problem w/ microarray data
- One approach with Euclidean distance or PLC is
just to ignore missing values (i.e., pretend the
data has fewer dimensions) - There are more sophisticated approaches that use
information such as continuity of a time series
or related genes to estimate missing values
better to use these if possible
13Preprocessing
- For methods that are not applicable in very high
dimensions you may want to apply - Dimensional reduction, e.g. consider the first
few SVD components (truncate S at r-dimensions)
and use the remaing values of the U or V matrices - Dimensional reduction normalization after
applying dimensional reduction normalize all
resulting vectors to unit length (i.e. consider
angles as proximity measures) - Feature selection, e.g. consider only features
that have large variance. More on feature
selection in the future.
14Clustering Types
- Exclusive vs. Overlapping Clustering
- Hierarchical vs. Global Clustering
- Formal vs. Heuristic Clustering
-
First two examples K-Means exclusive, global,
heuristic FCM (fuzzy c-means) overlapping,
global, heuristic
15Two classes of data described by (o) and (). The
objective is to reproduce the two classes by K2
clustering.
16- Place two cluster centres (x) at random.
- Assign each data point ( and o) to the nearest
cluster centre (x)
17- Compute the new centre of each class
- Move the crosses (x)
18Iteration 2
19Iteration 3
20Iteration 4 (then stop, because no visible
change) Each data point belongs to the cluster
defined by the nearest centre
21M 0.0000 1.0000 0.0000 1.0000
0.0000 1.0000 0.0000 1.0000 0.0000
1.0000 1.0000 0.0000 1.0000
0.0000 1.0000 0.0000 1.0000 0.0000
1.0000 0.0000
- The membership matrix M
- The last five data points (rows) belong to the
first cluster (column) - The first five data points (rows) belong to the
second cluster (column)
22Membership matrix M
cluster centre i
cluster centre j
data point k
distance
Results of K-means depend on the starting point
of the algorithm. Repeat it several times to get
a better feeling whether the results are
meaningful.
23c-partition
All clusters C together fills the whole universe U
Clusters do not overlap
A cluster C is never empty and it is smaller than
the whole universe U
There must be at least 2 clusters in a
c-partition and at most as many as the number of
data points K
24Objective function
Minimise the total sum of all distances
25- Algorithm fuzzy c-means (FCM)
26Each data point belongs to two clusters to
different degrees
27- Place two cluster centres
- Assign a fuzzy membership to each data point
depending on distance
28- Compute the new centre of each class
- Move the crosses (x)
29Iteration 2
30Iteration 5
31Iteration 10
32Iteration 13 (then stop, because no visible
change) Each data point belongs to the two
clusters to a degree
33M 0.0025 0.9975 0.0091 0.9909
0.0129 0.9871 0.0001 0.9999 0.0107
0.9893 0.9393 0.0607 0.9638
0.0362 0.9574 0.0426 0.9906 0.0094
0.9807 0.0193
- The membership matrix M
- The last five data points (rows) belong mostly to
the first cluster (column) - The first five data points (rows) belong mostly
to the second cluster (column)
34Hard Classifier (HCM)
A cell is either one or the other class defined
by a colour.
35Fuzzy Classifier (FCM)
A cell can belong to several classes to
a degree, i.e., one column may have several
colours.
36- Hierarchical Clustering
- Greedy
- Agglomerative vs. Divisive
- Dendrograms allow us to visualize
- visualization is not unique!
- Tends to be sensitive to small changes in the
data - Provided with clusters of every size where to
cut is user-determined - Large storage demand
- Running Time O(n2 levels) O(n3)
- Depends on distance measure, linkage method
37Hierarchical Agglomerative Clustering
- We start with every data point in a separate
cluster - We keep merging the most similar pairs of data
points/clusters until we have one big cluster
left - This is called a bottom-up or agglomerative
method
38Hierarchical Clustering (cont.)
- This produces a binary tree or dendrogram
- The final cluster is the root and each data item
is a leaf - The height of the bars indicate how close the
items are
39Hierarchical Clustering Demo
40Hierarchical Clustering Issues
- Distinct clusters are not produced sometimes
this can be good, if the data has a hierarchical
structure w/o clear boundaries - There are methods for producing distinct
clusters, but these usually involve specifying
somewhat arbitrary cutoff values - What if data doesnt have a hierarchical
structure? Is HC appropriate?
41Support Vector Clustering
- Given points x in data space, define images in
Hilbert space. - Require all images to be enclosed by a minimal
sphere in Hilbert space. - Reflection of this sphere in data space defines
cluster boundaries. - Two parameters width of Gaussian kernel and
fraction of outliers
Ben-Hur, Horn, Siegelmann Vapnik. JMLR 2 (2001)
125-127
42(No Transcript)
43(No Transcript)
44(No Transcript)
45Variation of q allows for clustering solutions on
various scales
q1, 20, 24, 48
46(No Transcript)
47Example that allows for SVclustering only in
presence of outliers. Procedure limit ß
ltC1/pN, where pfraction of assumed outliers in
the data.
q3.5 p0
q1 p0.3
48Similarity to scale space approach for high
values of q and p. Probability distribution
obtained from R(x).
q4.8 p0.7
49From Scale-space to Quantum Clustering
- Parzen window approach estimate the probability
density by kernel functions (Gaussians) located
at data points.
s 1/v(2q)
50Quantum Clustering
- View P? as the solution of the Schrödinger
equation - with the potential V(x) responsible for
attraction to cluster centers and the Lagrangian
causing the spread. - Find V(x)
Horn and Gottlieb, Phys. Rev. Lett. 88 (2002)
018702
51The Crabs Example (from Ripleys textbook)4
classes, 50 samples each, d5
A topographic map of the probability
distribution for the crab data set with ?1/?2
using principal components 2 and 3. There exists
only one maximum.
52The Crabs ExampleQC potential exhibits four
minima identified with cluster centers
A topographic map of the potential for the crab
data set with ?1/?2 using principal components 2
and 3 . The four minima are denoted by crossed
circles. The contours are set at values VcE for
c0.2,,1.
53The Crabs Example - Contd.
A three dimensional plot of the potential for the
crab data set with ?1/?3 using principal
components 2 and 3
54The Crabs Example - Contd.
A three dimensional plot of the potential for the
crab data set with ?1/2 using principal
components 2 and 3
55Properties of V and E
E is chosen so that min(V)0. E sets the scale
of structure observed in V(x). The single point
case corresponds to the harmonic potential
In general
56Identifying Clusters
- Local minima of the potential are identified with
cluster centers. - Data points are assigned to clusters according
to - -minimal distance from centers, or,
- -sliding points down the slopes of the potential
with gradient descent until they reach the
centers.
57The Iris Example3 classes, each containing 50
samples, d4
A topographic map of the potential for the iris
data set with ?0.25 using principal components 1
and 2. The three minima are denoted by crossed
circles. The contours are set at values VcE for
c0.2,,1.
58The Iris Example - Gradient Descent Dynamics
59The Iris Example - Using Raw Data in 4D.
There are only 5 misclassifications. ?0.21.
60Example Yeast cell cycle
Yeast cell cycle data were studied by several
groups who have applied SVD. (Spellman et al.
Molecular Biology of the Cell, 9, Dec. 2000)
We use it to test clustering of genes,
whose classification into groups was investigated
by Spellman et al. The gene/sample matrix that we
start from has dimensions of 798x72, using the
same selection as made by (Shamir, R. and Sharan,
R. 2002 ). We truncate it to r4 and obtain,
once again, our best results for s0.5, where
four clusters follow from the QC algorithm.
61Example Yeast cell cycle
The five gene families as represented in two
coordinates of our r4 dimensional space.
62Example Yeast cell cycle
Cluster assignments of genes for QC with s0.46
, as compared to the classification by Spellman
into five classes, shown as alternating gray and
white areas .
63Yeast cell cycle in normalized 2 dimensions
64Hierarchical Quantum Clustering (HQC)
- Start with raw data matrix containing gene
expression profiles of the samples. - Apply SVD and truncate to r-space by selecting
the first r significant eigenvectors - Apply QC in r-dimensions starting at small scale
? , obtaining many clusters. Move data points to
cluster centers and reiterate the process at
higher s. This produces hierarchical clustering
that can be represented by a dendrogram.
65Example Clustering of human cancer cells
The NCI60 set is a gene expression profile of
8000 genes in 60 human cancer cells. NCI60
includes cell lines derived from cancers of
colorectal, renal, ovarian, breast, prostate,
lung and central nervous system, as well as
leukemias and melanomas. After application of
selective filters the number of gene spots is
reduced to 1,376 gene subset. (Scherf et al.
Nature Genetics 24 , 2000) We applied HQC with
r5 dimension.
66Example Clustering of human cancer cells
Dendrogram of 60 cancer cell samples. The
clustering was done in 5 truncated dimensions.
The first 2 letters in each sample represent the
tissue/cancer type.
67Example - Projection onto the unit sphere
Representation of data of four classes of cancer
cells on two dimensions of the truncated space.
The circles denote the locations of the data
points before this normalization was applied
68COMPACT a comparative package for clustering
assessment
- Compact is a GUI Matlab tool that enables an easy
and intuitive way to compare some clustering
methods. - Compact is a five-step wizard that contains basic
Matlab clustering methods as well as the quantum
clustering algorithm. Compact provides a flexible
and customizable interface for clustering data
with high dimensionality. - Compact allows both textual and graphical display
of the clustering results
69How to Install?
- COMPACT is a self-extracting package. In order
to install and run the QUI tool, follow these
three easy steps - Download the COMPACT.zip package to your local
drive. - Add the COMPACT destination directory to
your Matlab path. - Within Matlab, type compact at the command
prompt.
70Steps 1
71Steps 1
72Steps 2
- Determining the matrix shape and vectors to
cluster
73Steps 3
- Preprocessing Procedures
- Components variance graphs
- Preprocessing parameters
74Steps 4
- Points distribution preview
- and clustering method selection
75Steps 5
- Parameters for clustering algorithms
- Kmeans
76Steps 5
- Parameters for clustering algorithms
- FCM
77Steps 5
- Parameters for clustering algorithms
- NN
78Steps 5
- Parameters for clustering algorithms
- QC
79Steps 6
80Steps 6Results
81Clustering Methods Model-Based
- Data are generated from a mixture of underlying
probability distributions
82Some Examples
- Two univariate normal components
- Equal proportions
- Common variance s21
83- Two univariate normal components
- proportions 0.75 and 0.25
- Common variance s21
84and some more
85Probability Models
- Classification Likelihood
- set of parameters of cluster K
-
- is the probability that an observation belongs
to cluster K ( )
86Probability Models (Cont.)
- Most used multivariate normal distribution
- Tk has a means vector µk and a covariance matrix
Sk
- How is the covariance matrix Sk calculated?
87Calculating the covariance matrix Sk
- The idea parameterize the covariance matrix
- Dk Orthogonal matrix of eigenvectors
- Determines the orientation of the PCs of Sk
- Ak Diagonal matrix whose elements are
proportional to the eigenvalues of Sk - Determines the shape of the density contours
- ?k Scalar
- Determines the volume of the corresponding
ellipsoid
88Sk Definition Determines the Model
- spherical, equal (SOS criterion)
89How is Tk computed? EM algorithm
The complete-data log-likelihood ()
Density of an observation given zi is
is the conditional
expectation of zik given xi and T1,, TG
90 91Limitations of the EM Algorithm
- Low rate of convergence
- You should start with good starting points and
hope for separable clusters - Not practical for large number of clusters (
probabilities) - "Crashes" when covariance matrix becomes singular
- Problems when there are few observation in a
cluster - EM must not get more clusters than exist in
nature