Bioinformatics Pattern recognition - Clustering - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Bioinformatics Pattern recognition - Clustering

Description:

or the partitioning of a data set into subsets (clusters) ... http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm ... – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 28

Provided by: us01

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Pattern recognition - Clustering

1
BioinformaticsPattern recognition - Clustering

Ulf Schmitz
ulf.schmitz_at_informatik.uni-rostock.de
Bioinformatics and Systems Biology Group
www.sbi.informatik.uni-rostock.de

2
Outline

Introduction
Hierarchical clustering
Partitional clustering
k-means and derivatives
Fuzzy Clustering

3
Introduction into Clustering algorithms

Clustering is the classification of similar
objects into separated groups
or the partitioning of a data set into subsets
(clusters)
so that the data in each subset (ideally) share
some common trait
Machine learning typically regards clustering as
a form of unsupervised learning.
we distinguish
Hierarchical Clustering (finds successive
clusters using previously established clusters)
Partitional Clustering (determines all clusters
at once)

4
Introduction into Clustering algorithms
Applications

gene expression data analysis
identification of regulatory binding sites
phylogenetic tree clustering
(for inference of horizontally transferred
genes)
protein domain identification
identification of structural motifs

5
Introduction into Clustering algorithms
Data matrix

data matrix collects observations of n objects,
described by m measurements
rows refer to objects, characterised by values in
the columns

if units of measurements, associated with the
columns of X differ, its necessary to normalise

column vector
mean
standard deviation

6
Hierarchical clustering
produces a sequence of nested partitions, the
steps are

find dis/similarity between every pair of objects
in the data set by evaluating a distance measure
group the objects into a hierarchical cluster
tree (dendrogram) by linking newly formed
clusters
obtain a partition of the data set into clusters
by selecting a suitable cut-level of the
cluster tree

7
Hierarchical clustering
Agglomerative Hierarchical clustering

start with n clusters, each containing one object
and calculate the distance matrix D1
determine from D1 which of the objects are least
distant (e.g. I and J)
merge these objects into one cluster and form a
new distance matrix by deleting the entries for
the clustered objects and add distances for the
new cluster
repeat steps 2 and 3 a total of m-1 times until a
single cluster is formed
record which clusters are merged at each step
record the distances between the clusters that
are merged in that step

8
Hierarchical clustering
calculating the distances

one treats the data matrix X as a set of n (row)
vectors with m elements

Euclidian distance
an example
are row vectors of X
City block distance
9
Hierarchical clustering
an example
Euclidian distance
City block distance
10
Hierarchical clustering
Agglomerative Hierarchical clustering

start with n clusters, each containing one object
and calculate the distance matrix D1
determine from D1 which of the objects are least
distant (e.g. I and J)
merge these objects into one cluster and form a
new distance matrix by deleting the entries for
the clustered objects and add distances for the
new cluster
repeat steps 2 and 3 a total of m-1 times until a
single cluster is formed
record which clusters are merged at each step
record the distances between the clusters that
are merged in that step

11
Hierarchical clustering
distance matrix
x1 x2 x3 x4 x5
x1 0 2.9155 1.0000 3.0414 3.0414
x2 2.9155 0 2.5495 3.3541 2.5000
x3 1.0000 2.5495 0 2.0616 2.0616
x4 3.0414 3.3541 2.0616 0 1.0000
x5 3.0414 2.5000 2.0616 1.0000 0
x1, x3 X2 x4, x5
x1, x3 0 2.9155 2.0616
X2 2.9155 0 2.5000
x4, x5 2.0616 2.5000 0
12
Hierarchical clustering
Methods to define a distance between clusters
single linkage
dIJ
complete linkage
dIJ
group average
d13
3
1
d14
4
d23
d15
d24
2
N is the number of members in a cluster
d25
5
centroid linkage
13
Hierarchical clustering
14
Hierarchical clustering
Agglomerative Hierarchical clustering

start with n clusters, each containing one object
and calculate the distance matrix D1
determine from D1 which of the objects are least
distant (e.g. I and J)
merge these objects into one cluster and form a
new distance matrix by deleting the entries for
the clustered objects and add distances for the
new cluster
repeat steps 2 and 3 a total of m-1 times until a
single cluster is formed
record which clusters are merged at each step
record the distances between the clusters that
are merged in that step

15
Hierarchical clustering
Limits of hierarchical clustering

the choice of distance measure is important
there is no provision for reassigning objects
that have been incorrectly grouped
errors are not handled explicitly in the
procedure
no method of calculating intercluster distances
is universally the best
but, single-linkage clustering is least
successful
and, group average clustering tends to be fairly
well

16
Partitional clustering K means

Involves prior specification of the number of
clusters, k
no pairwise distance matrix is required
The relevant distance is the distance from the
object to the cluster center (centroid)

17
Partitional clustering K means

partition the objects in k clusters (can be done
by random partitioning or by arbitrarily
clustering around two or more objects)
calculate the centroids of the clusters
assign or reassign each object to that cluster
whose centroid is closest (distance is calculated
as Euclidean distance)
recalculate the centroids of the new clusters
formed after the gain or loss of objects to or
from the previous clusters
repeat steps 3 and 4 for a predetermined number
of iterations or until membership of the groups
no longer changes

18
Partitional clustering K means
object x1 x2
A 1 1
B 3 1
C 4 8
D 8 10
E 9 6
step 1 make an arbitrary partition of the
objects into clusters e.g. objects with
into Cluster 1, all other into Cluster 2 A,B
and C in Cluster 1, and D and E in Cluster 2 step
2 calculate the centroids of the
clusters cluster 1 cluster 2 step 3
calculate the Euclidean distance between each
object and each of the two clusters
centroids
object d(x1,c1) d(x2,c2)
A 2.87 10.26
B 2.35 8.90
C 4.86 4.50
D 8.54 2.06
E 6.87 2.06
19
Partitional clustering K means

partition the objects in k clusters (can be done
by random partitioning or by arbitrarily
clustering around two or more objects)
calculate the centroids of the clusters
assign or reassign each object to that cluster
whose centroid is closest (distance is calculated
as Euclidean distance)
recalculate the centroids of the new clusters
formed after the gain or loss of objects to or
from the previous clusters
repeat steps 3 and 4 for a predetermined number
of iterations or until membership of the groups
no longer changes

20
Partitional clustering K means
step 4 C turns out to be closer to Cluster 2 and
has to be reassigned repeat step2 and step3
object d(X,1) d(X,2)
A 1.00 9.22
B 1.00 8.06
C 7.28 3.00
D 10.82 2.24
E 8.60 2.83
cluster 1 cluster 2
no further reassignments are necessary
21
Partitional clustering K means
22
Fuzzy clustering

is an extension of k means clustering
an objects belongs to a cluster in a certain
degree
for all objects the degrees of membership in the
k clusters adds up to one
a fuzzy weight is introduced, which determines
the fuzziness of the resulting clusters
for ? ? 1, the cluster becomes a hard partition
for ? ? 8, the degree of membership approximates
1/k
typical values are ? 1.25 and ? 2

23
Fuzzy clustering
fix k, 2 k lt n and choose a distance measure
(Euclidean, city block, etc.), a termination
tolerance dgt0 (e.g. 0.01 or 0.001), and fix ?, 1
? lt 8. Initialize first cluster set randomly.
step1 compute cluster centers
step2 compute distances between objects and
cluster centers
24
Fuzzy clustering
step3 update partition matrix
until
the algorithm is terminated if changes in the
partition matrix are negligible
25
Clustering Software