Clustering - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Clustering

Description:

We truncate it to r=4 and obtain, once again, our best results for s=0.5, where ... Apply SVD and truncate to r-space by selecting the first r significant eigenvectors ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 91
Provided by: wdw
Category:

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
(an ill-defined problem)
  • Introduction
  • Preprocessing dimensional reduction with SVD
  • Clustering methods K-means, FCM
  • Hierarchical methods
  • Model based methods (at the end)
  • Competitive NN (SOM) (not shown here)
  • SVC, QC
  • Applications
  • COMPACT

2
What Is Clustering ?
  • Clustering is partitioning of data into
    meaningful (?) groups called clusters.
  • Cluster a collection of objects that are
    similar to one another ? what is similar?
  • ? unsupervised learning no predefined classes

Not Classification!!
  • Why? To help understand the natural grouping or
    structure in a data set
  • When? Used either as
  • a stand-alone tool to get insight into data
    distribution or
  • as a preprocessing step for other algorithms,
    e.g., to discover classes

3
Clustering Applications
  • Operations Research
  • Facility Location Problem locate fire stations
    so as to minimize the maximum/average distance a
    fire truck must travel
  • Signal Processing
  • Vector Quantization Transmit large files (e.g.,
    video, speech) by computing quantizers
  • Astronomy
  • SkyCat Clustered 2x109 sky objects into stars,
    galaxies, quasars, etc based on radiation emitted
    in different spectrum bands.

4
Clustering Applications
  • Marketing
  • Segmentation of customers for target marketing
  • Segmentation of customers based on online
    clickstream data.
  • Web
  • To discover categories of content.
  • Search results
  • Bioinformatics
  • Gene expression
  • Finding groups of individuals (sick Vs. healthy)
  • Finding groups of genes
  • Motifs search.
  • In practice, clustering is one of the most widely
    used data mining techniques
  • Association rule algorithms produce too many
    rules
  • Other machine learning algorithms require labeled
    data.

5
Points/Metric Space
  • Points could be in in Rd, 0,1d,
  • Metric Space dist(x,y) is a distance metric if
  • Reflexive dist(x,y)0 iff xy
  • Symmetric dist(x,y) dist(y,x)
  • Triangle Inequality dist(x,y) ? dist(x,z)
    dist(z,y)

y
x
6
Example of Distance Metrics
  • The distance between xltx1,,xngt and ylty1,,yngt
    is
  • L2 norm
  • Manhattan Distance (L1 norm)
  • Documents Cosine measure
  • Similarity
  • i.e., more similar -gt close to 1
  • less similar -gt close to 0
  • Not a metric space, but 1-cos? is

7
Correlation
  • We might care more about the overall shape of
    expression profiles rather than the actual
    magnitudes
  • That is, we might want to consider genes similar
    when they are up and down together
  • When might we want this kind of measure? What
    experimental issues might make this appropriate?

8
Pearson Linear Correlation
  • Were shifting the expression profiles down
    (subtracting the means) and scaling by the
    standard deviations (i.e., making the data have
    mean 0 and std 1)

9
Pearson Linear Correlation
  • Pearson linear correlation (PLC) is a measure
    that is invariant to scaling and shifting
    (vertically) of the expression values
  • Always between 1 and 1 (perfectly
    anti-correlated and perfectly correlated)
  • This is a similarity measure, but we can easily
    make it into a dissimilarity measure

10
PLC (cont.)
  • PLC only measures the degree of a linear
    relationship between two expression profiles!
  • If you want to measure other relationships, there
    are many other possible measures (see Jagota book
    and project 3 for more examples)

? 0.0249, so dp 0.4876 The green curve is the
square of the blue curve this relationship is
not captured with PLC
11
More correlation examples
What do you think the correlation is here? Is
this what we want?
How about here? Is this what we want?
12
Missing Values
  • A common problem w/ microarray data
  • One approach with Euclidean distance or PLC is
    just to ignore missing values (i.e., pretend the
    data has fewer dimensions)
  • There are more sophisticated approaches that use
    information such as continuity of a time series
    or related genes to estimate missing values
    better to use these if possible

13
Preprocessing
  • For methods that are not applicable in very high
    dimensions you may want to apply
  • Dimensional reduction, e.g. consider the first
    few SVD components (truncate S at r-dimensions)
    and use the remaing values of the U or V matrices
  • Dimensional reduction normalization after
    applying dimensional reduction normalize all
    resulting vectors to unit length (i.e. consider
    angles as proximity measures)
  • Feature selection, e.g. consider only features
    that have large variance. More on feature
    selection in the future.

14
Clustering Types
  • Exclusive vs. Overlapping Clustering
  • Hierarchical vs. Global Clustering
  • Formal vs. Heuristic Clustering

First two examples K-Means exclusive, global,
heuristic FCM (fuzzy c-means) overlapping,
global, heuristic
15
Two classes of data described by (o) and (). The
objective is to reproduce the two classes by K2
clustering.
16
  1. Place two cluster centres (x) at random.
  2. Assign each data point ( and o) to the nearest
    cluster centre (x)

17
  1. Compute the new centre of each class
  2. Move the crosses (x)

18
Iteration 2
19
Iteration 3
20
Iteration 4 (then stop, because no visible
change) Each data point belongs to the cluster
defined by the nearest centre
21
M 0.0000 1.0000 0.0000 1.0000
0.0000 1.0000 0.0000 1.0000 0.0000
1.0000 1.0000 0.0000 1.0000
0.0000 1.0000 0.0000 1.0000 0.0000
1.0000 0.0000
  • The membership matrix M
  • The last five data points (rows) belong to the
    first cluster (column)
  • The first five data points (rows) belong to the
    second cluster (column)

22
Membership matrix M
cluster centre i
cluster centre j
data point k
distance
Results of K-means depend on the starting point
of the algorithm. Repeat it several times to get
a better feeling whether the results are
meaningful.
23
c-partition
All clusters C together fills the whole universe U
Clusters do not overlap
A cluster C is never empty and it is smaller than
the whole universe U
There must be at least 2 clusters in a
c-partition and at most as many as the number of
data points K
24
Objective function
Minimise the total sum of all distances
25
  • Algorithm fuzzy c-means (FCM)

26
Each data point belongs to two clusters to
different degrees
27
  1. Place two cluster centres
  2. Assign a fuzzy membership to each data point
    depending on distance

28
  1. Compute the new centre of each class
  2. Move the crosses (x)

29
Iteration 2
30
Iteration 5
31
Iteration 10
32
Iteration 13 (then stop, because no visible
change) Each data point belongs to the two
clusters to a degree
33
M 0.0025 0.9975 0.0091 0.9909
0.0129 0.9871 0.0001 0.9999 0.0107
0.9893 0.9393 0.0607 0.9638
0.0362 0.9574 0.0426 0.9906 0.0094
0.9807 0.0193
  • The membership matrix M
  • The last five data points (rows) belong mostly to
    the first cluster (column)
  • The first five data points (rows) belong mostly
    to the second cluster (column)

34
Hard Classifier (HCM)
A cell is either one or the other class defined
by a colour.
35
Fuzzy Classifier (FCM)
A cell can belong to several classes to
a degree, i.e., one column may have several
colours.
36
  • Hierarchical Clustering
  • Greedy
  • Agglomerative vs. Divisive
  • Dendrograms allow us to visualize
  • visualization is not unique!
  • Tends to be sensitive to small changes in the
    data
  • Provided with clusters of every size where to
    cut is user-determined
  • Large storage demand
  • Running Time O(n2 levels) O(n3)
  • Depends on distance measure, linkage method

37
Hierarchical Agglomerative Clustering
  • We start with every data point in a separate
    cluster
  • We keep merging the most similar pairs of data
    points/clusters until we have one big cluster
    left
  • This is called a bottom-up or agglomerative
    method

38
Hierarchical Clustering (cont.)
  • This produces a binary tree or dendrogram
  • The final cluster is the root and each data item
    is a leaf
  • The height of the bars indicate how close the
    items are

39
Hierarchical Clustering Demo
40
Hierarchical Clustering Issues
  • Distinct clusters are not produced sometimes
    this can be good, if the data has a hierarchical
    structure w/o clear boundaries
  • There are methods for producing distinct
    clusters, but these usually involve specifying
    somewhat arbitrary cutoff values
  • What if data doesnt have a hierarchical
    structure? Is HC appropriate?

41
Support Vector Clustering
  • Given points x in data space, define images in
    Hilbert space.
  • Require all images to be enclosed by a minimal
    sphere in Hilbert space.
  • Reflection of this sphere in data space defines
    cluster boundaries.
  • Two parameters width of Gaussian kernel and
    fraction of outliers

Ben-Hur, Horn, Siegelmann Vapnik. JMLR 2 (2001)
125-127
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
Variation of q allows for clustering solutions on
various scales
q1, 20, 24, 48
46
(No Transcript)
47
Example that allows for SVclustering only in
presence of outliers. Procedure limit ß
ltC1/pN, where pfraction of assumed outliers in
the data.
q3.5 p0
q1 p0.3
48
Similarity to scale space approach for high
values of q and p. Probability distribution
obtained from R(x).
q4.8 p0.7
49
From Scale-space to Quantum Clustering
  • Parzen window approach estimate the probability
    density by kernel functions (Gaussians) located
    at data points.

s 1/v(2q)
50
Quantum Clustering
  • View P? as the solution of the Schrödinger
    equation
  • with the potential V(x) responsible for
    attraction to cluster centers and the Lagrangian
    causing the spread.
  • Find V(x)

Horn and Gottlieb, Phys. Rev. Lett. 88 (2002)
018702
51
The Crabs Example (from Ripleys textbook)4
classes, 50 samples each, d5
A topographic map of the probability
distribution for the crab data set with ?1/?2
using principal components 2 and 3. There exists
only one maximum.
52
The Crabs ExampleQC potential exhibits four
minima identified with cluster centers
A topographic map of the potential for the crab
data set with ?1/?2 using principal components 2
and 3 . The four minima are denoted by crossed
circles. The contours are set at values VcE for
c0.2,,1.
53
The Crabs Example - Contd.
A three dimensional plot of the potential for the
crab data set with ?1/?3 using principal
components 2 and 3
54
The Crabs Example - Contd.
A three dimensional plot of the potential for the
crab data set with ?1/2 using principal
components 2 and 3
55
Properties of V and E
E is chosen so that min(V)0. E sets the scale
of structure observed in V(x). The single point
case corresponds to the harmonic potential
In general

56
Identifying Clusters
  • Local minima of the potential are identified with
    cluster centers.
  • Data points are assigned to clusters according
    to
  • -minimal distance from centers, or,
  • -sliding points down the slopes of the potential
    with gradient descent until they reach the
    centers.

57
The Iris Example3 classes, each containing 50
samples, d4
A topographic map of the potential for the iris
data set with ?0.25 using principal components 1
and 2. The three minima are denoted by crossed
circles. The contours are set at values VcE for
c0.2,,1.
58
The Iris Example - Gradient Descent Dynamics
59
The Iris Example - Using Raw Data in 4D.
There are only 5 misclassifications. ?0.21.
60
Example Yeast cell cycle
Yeast cell cycle data were studied by several
groups who have applied SVD. (Spellman et al.
Molecular Biology of the Cell, 9, Dec. 2000)
We use it to test clustering of genes,
whose classification into groups was investigated
by Spellman et al. The gene/sample matrix that we
start from has dimensions of 798x72, using the
same selection as made by (Shamir, R. and Sharan,
R. 2002 ). We truncate it to r4 and obtain,
once again, our best results for s0.5, where
four clusters follow from the QC algorithm.
61
Example Yeast cell cycle
The five gene families as represented in two
coordinates of our r4 dimensional space.
62
Example Yeast cell cycle
Cluster assignments of genes for QC with s0.46
, as compared to the classification by Spellman
into five classes, shown as alternating gray and
white areas .
63
Yeast cell cycle in normalized 2 dimensions
64
Hierarchical Quantum Clustering (HQC)
  • Start with raw data matrix containing gene
    expression profiles of the samples.
  • Apply SVD and truncate to r-space by selecting
    the first r significant eigenvectors
  • Apply QC in r-dimensions starting at small scale
    ? , obtaining many clusters. Move data points to
    cluster centers and reiterate the process at
    higher s. This produces hierarchical clustering
    that can be represented by a dendrogram.

65
Example Clustering of human cancer cells
The NCI60 set is a gene expression profile of
8000 genes in 60 human cancer cells. NCI60
includes cell lines derived from cancers of
colorectal, renal, ovarian, breast, prostate,
lung and central nervous system, as well as
leukemias and melanomas. After application of
selective filters the number of gene spots is
reduced to 1,376 gene subset. (Scherf et al.
Nature Genetics 24 , 2000) We applied HQC with
r5 dimension.
66
Example Clustering of human cancer cells
Dendrogram of 60 cancer cell samples. The
clustering was done in 5 truncated dimensions.
The first 2 letters in each sample represent the
tissue/cancer type.
67
Example - Projection onto the unit sphere
Representation of data of four classes of cancer
cells on two dimensions of the truncated space.
The circles denote the locations of the data
points before this normalization was applied
68
COMPACT a comparative package for clustering
assessment
  • Compact is a GUI Matlab tool that enables an easy
    and intuitive way to compare some clustering
    methods.
  • Compact is a five-step wizard that contains basic
    Matlab clustering methods as well as the quantum
    clustering algorithm. Compact provides a flexible
    and customizable interface for clustering data
    with high dimensionality.
  • Compact allows both textual and graphical display
    of the clustering results

69
How to Install?
  • COMPACT is a self-extracting package. In order
    to install and run the QUI tool, follow these
    three easy steps
  • Download the COMPACT.zip package to your local
    drive.
  •      Add the COMPACT destination directory to
    your Matlab path.
  •      Within Matlab, type compact at the command
    prompt.

70
Steps 1
  • Input parameters

71
Steps 1
  • Selecting variables

72
Steps 2
  • Determining the matrix shape and vectors to
    cluster

73
Steps 3
  • Preprocessing Procedures
  • Components variance graphs
  • Preprocessing parameters

74
Steps 4
  • Points distribution preview
  • and clustering method selection

75
Steps 5
  • Parameters for clustering algorithms
  • Kmeans

76
Steps 5
  • Parameters for clustering algorithms
  • FCM

77
Steps 5
  • Parameters for clustering algorithms
  • NN

78
Steps 5
  • Parameters for clustering algorithms
  • QC

79
Steps 6
  • COMPACT results

80
Steps 6Results
81
Clustering Methods Model-Based
  • Data are generated from a mixture of underlying
    probability distributions

82
Some Examples
  • Two univariate normal components
  • Equal proportions
  • Common variance s21

83
  • Two univariate normal components
  • proportions 0.75 and 0.25
  • Common variance s21

84
and some more
85
Probability Models
  • Classification Likelihood
  • set of parameters of cluster K
  • Mixture Likelihood
  • is the probability that an observation belongs
    to cluster K ( )

86
Probability Models (Cont.)
  • Most used multivariate normal distribution
  • Tk has a means vector µk and a covariance matrix
    Sk
  • How is the covariance matrix Sk calculated?

87
Calculating the covariance matrix Sk
  • The idea parameterize the covariance matrix
  • Dk Orthogonal matrix of eigenvectors
  • Determines the orientation of the PCs of Sk
  • Ak Diagonal matrix whose elements are
    proportional to the eigenvalues of Sk
  • Determines the shape of the density contours
  • ?k Scalar
  • Determines the volume of the corresponding
    ellipsoid

88
Sk Definition Determines the Model
  • spherical, equal (SOS criterion)
  • all ellipsoids are equal

89
How is Tk computed? EM algorithm
The complete-data log-likelihood ()
Density of an observation given zi is
is the conditional
expectation of zik given xi and T1,, TG
90
  • E calculate
  • M given maximize ()

91
Limitations of the EM Algorithm
  • Low rate of convergence
  • You should start with good starting points and
    hope for separable clusters
  • Not practical for large number of clusters (
    probabilities)
  • "Crashes" when covariance matrix becomes singular
  • Problems when there are few observation in a
    cluster
  • EM must not get more clusters than exist in
    nature
Write a Comment
User Comments (0)
About PowerShow.com