Title: Bioinformatics Pattern recognition - Clustering
1BioinformaticsPattern recognition - Clustering
- Ulf Schmitz
- ulf.schmitz_at_informatik.uni-rostock.de
- Bioinformatics and Systems Biology Group
- www.sbi.informatik.uni-rostock.de
2Outline
- Introduction
- Hierarchical clustering
- Partitional clustering
- k-means and derivatives
- Fuzzy Clustering
3Introduction into Clustering algorithms
- Clustering is the classification of similar
objects into separated groups - or the partitioning of a data set into subsets
(clusters) - so that the data in each subset (ideally) share
some common trait - Machine learning typically regards clustering as
a form of unsupervised learning. - we distinguish
- Hierarchical Clustering (finds successive
clusters using previously established clusters) - Partitional Clustering (determines all clusters
at once)
4Introduction into Clustering algorithms
Applications
- gene expression data analysis
- identification of regulatory binding sites
- phylogenetic tree clustering
- (for inference of horizontally transferred
genes) - protein domain identification
- identification of structural motifs
5Introduction into Clustering algorithms
Data matrix
- data matrix collects observations of n objects,
described by m measurements - rows refer to objects, characterised by values in
the columns
if units of measurements, associated with the
columns of X differ, its necessary to normalise
- column vector
- mean
- standard deviation
6Hierarchical clustering
produces a sequence of nested partitions, the
steps are
- find dis/similarity between every pair of objects
in the data set by evaluating a distance measure - group the objects into a hierarchical cluster
tree (dendrogram) by linking newly formed
clusters - obtain a partition of the data set into clusters
by selecting a suitable cut-level of the
cluster tree
7Hierarchical clustering
Agglomerative Hierarchical clustering
- start with n clusters, each containing one object
and calculate the distance matrix D1 - determine from D1 which of the objects are least
distant (e.g. I and J) - merge these objects into one cluster and form a
new distance matrix by deleting the entries for
the clustered objects and add distances for the
new cluster - repeat steps 2 and 3 a total of m-1 times until a
single cluster is formed - record which clusters are merged at each step
- record the distances between the clusters that
are merged in that step
8Hierarchical clustering
calculating the distances
- one treats the data matrix X as a set of n (row)
vectors with m elements
Euclidian distance
an example
are row vectors of X
City block distance
9Hierarchical clustering
an example
Euclidian distance
City block distance
10Hierarchical clustering
Agglomerative Hierarchical clustering
- start with n clusters, each containing one object
and calculate the distance matrix D1 - determine from D1 which of the objects are least
distant (e.g. I and J) - merge these objects into one cluster and form a
new distance matrix by deleting the entries for
the clustered objects and add distances for the
new cluster - repeat steps 2 and 3 a total of m-1 times until a
single cluster is formed - record which clusters are merged at each step
- record the distances between the clusters that
are merged in that step
11Hierarchical clustering
distance matrix
x1 x2 x3 x4 x5
x1 0 2.9155 1.0000 3.0414 3.0414
x2 2.9155 0 2.5495 3.3541 2.5000
x3 1.0000 2.5495 0 2.0616 2.0616
x4 3.0414 3.3541 2.0616 0 1.0000
x5 3.0414 2.5000 2.0616 1.0000 0
x1, x3 X2 x4, x5
x1, x3 0 2.9155 2.0616
X2 2.9155 0 2.5000
x4, x5 2.0616 2.5000 0
12Hierarchical clustering
Methods to define a distance between clusters
single linkage
dIJ
complete linkage
dIJ
group average
d13
3
1
d14
4
d23
d15
d24
2
N is the number of members in a cluster
d25
5
centroid linkage
13Hierarchical clustering
14Hierarchical clustering
Agglomerative Hierarchical clustering
- start with n clusters, each containing one object
and calculate the distance matrix D1 - determine from D1 which of the objects are least
distant (e.g. I and J) - merge these objects into one cluster and form a
new distance matrix by deleting the entries for
the clustered objects and add distances for the
new cluster - repeat steps 2 and 3 a total of m-1 times until a
single cluster is formed - record which clusters are merged at each step
- record the distances between the clusters that
are merged in that step
15Hierarchical clustering
Limits of hierarchical clustering
- the choice of distance measure is important
- there is no provision for reassigning objects
that have been incorrectly grouped - errors are not handled explicitly in the
procedure - no method of calculating intercluster distances
is universally the best - but, single-linkage clustering is least
successful - and, group average clustering tends to be fairly
well
16Partitional clustering K means
- Involves prior specification of the number of
clusters, k - no pairwise distance matrix is required
- The relevant distance is the distance from the
object to the cluster center (centroid)
17Partitional clustering K means
- partition the objects in k clusters (can be done
by random partitioning or by arbitrarily
clustering around two or more objects) - calculate the centroids of the clusters
- assign or reassign each object to that cluster
whose centroid is closest (distance is calculated
as Euclidean distance) - recalculate the centroids of the new clusters
formed after the gain or loss of objects to or
from the previous clusters - repeat steps 3 and 4 for a predetermined number
of iterations or until membership of the groups
no longer changes
18Partitional clustering K means
object x1 x2
A 1 1
B 3 1
C 4 8
D 8 10
E 9 6
step 1 make an arbitrary partition of the
objects into clusters e.g. objects with
into Cluster 1, all other into Cluster 2 A,B
and C in Cluster 1, and D and E in Cluster 2 step
2 calculate the centroids of the
clusters cluster 1 cluster 2 step 3
calculate the Euclidean distance between each
object and each of the two clusters
centroids
object d(x1,c1) d(x2,c2)
A 2.87 10.26
B 2.35 8.90
C 4.86 4.50
D 8.54 2.06
E 6.87 2.06
19Partitional clustering K means
- partition the objects in k clusters (can be done
by random partitioning or by arbitrarily
clustering around two or more objects) - calculate the centroids of the clusters
- assign or reassign each object to that cluster
whose centroid is closest (distance is calculated
as Euclidean distance) - recalculate the centroids of the new clusters
formed after the gain or loss of objects to or
from the previous clusters - repeat steps 3 and 4 for a predetermined number
of iterations or until membership of the groups
no longer changes
20Partitional clustering K means
step 4 C turns out to be closer to Cluster 2 and
has to be reassigned repeat step2 and step3
object d(X,1) d(X,2)
A 1.00 9.22
B 1.00 8.06
C 7.28 3.00
D 10.82 2.24
E 8.60 2.83
cluster 1 cluster 2
no further reassignments are necessary
21Partitional clustering K means
22Fuzzy clustering
- is an extension of k means clustering
- an objects belongs to a cluster in a certain
degree - for all objects the degrees of membership in the
k clusters adds up to one - a fuzzy weight is introduced, which determines
the fuzziness of the resulting clusters - for ? ? 1, the cluster becomes a hard partition
- for ? ? 8, the degree of membership approximates
1/k - typical values are ? 1.25 and ? 2
23Fuzzy clustering
fix k, 2 k lt n and choose a distance measure
(Euclidean, city block, etc.), a termination
tolerance dgt0 (e.g. 0.01 or 0.001), and fix ?, 1
? lt 8. Initialize first cluster set randomly.
step1 compute cluster centers
step2 compute distances between objects and
cluster centers
24Fuzzy clustering
step3 update partition matrix
until
the algorithm is terminated if changes in the
partition matrix are negligible
25Clustering Software
- Cluster 3.0 (for gene expression data analysis )
- PyCluster (Python Module)
- AlgorithmCluster (Perl package)
- C clustering library
http//bonsai.ims.u-tokyo.ac.jp/mdehoon/software/
cluster/software.htm
26Outlook
27Thanx for your attention!!!