Title: Network clustering
1Network clustering
- Presented by Wooyoung Kim
- 2/6/2009
- CSc 8910 Analysis of Biological Network, Spring
2009 - Dr. Yi Pan
2Outline
Introduction Definitions and Basic Concepts
Network Clustering Problem Hierarchical
clustering Clique-based clustering Centre-based
clustering Conclusion
3Introduction
- Clustering is
- Loosely defined as the process of grouping
objects into sets called clusters so that each
cluster consists of elements that are similar in
some way. - Example
- Distance-based clustering close given distance
metric - Conceptual clustering based on descriptive
concepts -
4Introduction
- Clustering is
- used for multiple purposes, including
- Finding natural clusters (modules) and
describing their properties - Classifying the data
- Detecting unusual data (outliers)
- Data reduction by treating a cluster or one of
its element as a single representative unit
5Introduction
- Network clustering
- deals with clustering the data represented as a
network or a graph - Link analysis
- Data points are represented by vertices
- An edge exists if two data points are similar or
related in a certain way - Similarity criterion
- Pairwise relations - for network model
- cohesiveness for cluster similarity
6Introduction
- Network clustering approaches are used to perform
- Distance-based clustering
- Vertices are data points, and edges are for close
points - Distances for weight the edges of a complete
graph
7Introduction
- Conceptual clustering
- Generating a concept description for each
generated cluster - Design a matching field in database networks,
then vertices are connected by an edge if the
two matching fields are close - Example
- Protein interaction networks, proteins are
vertices and a pair is connected by an edge if
they are known to interact - Gene co-expression networks, genes are vertices
and an edge indicates that the pair of genes (end
points) are co-expressed over some cut-off value,
based on microarray experiments.
8Introduction
- Application of network clustering
- Understand the structure and function of
proteins based on protein interaction maps of
organisms - Clustering protein interaction networks (PINs)
using cliques to decompose the Protein
interaction network into functional modules and
protein complexes - Use of cliques and other high density subgraphs
to identify protein complexes (splicing
machinery, transcription factors, etc.) and
functional modules (signalling cascades, cell
cycle regulation)
9Introduction
- Application of network clustering
- Protein complexes groups of proteins that
interact with each other at the same time and
place. - Functional modules groups of proteins that
are known to have pairwise interactions by
binding with each other to participate in
different cellular processes at different times
10Definition and basic concepts
- G(V,E) is simple, undirected graph
- nV, eE
- is complement graph of G
- Complement set of edges
- GS is induced subgraph of G (induced by a
subset S of V) - N(v) is set of neighbours of a vertex v in G
(excluding v) - Degree deg(v)N(v)
- NvN(v) U v
- duv is distance between u and v
- Length of the shortest path from u to v
- dmmax(dij) for all vertex pairs i and j is
diameter of a graph
11Definition and basic concepts
- Edge connectivity k(G) of a graph the
minimum number of edges that must be removed to
disconnect the graph - Vertex connectivity (or connectivity) k(G) of a
graph the minimum number of vertices that must
be removed to disconnect the graph (or results in
a trivial graph) - trivial graph one vertex, no edges
- connected graph every pair of vertices are
connected.
12Definition and basic concepts
Example (Vertex) connectivity k(G)2 removal of
vertices 3 and 5 would disconnect the graph
2
8
9
11
5
1
12
3
7
10
4
6
13Definition and basic concepts
Example Edge connectivity k(G)2 removal of
edges (9,11) and (10,12) would disconnect the
graph
2
8
9
11
5
1
12
3
7
10
4
6
14Definition and basic concepts
- A clique C is a subset of vertices such that an
edge exists between every pair of vertices in C - The induced subgraph GC is a complete graph
- A clique is maximal if it is not a subset of
any larger clique - A clique is maximum if there are no larger
cliques in the graph - A subset of vertices I is called an independent
set (also called a stable set) if for every pair
of vertices in I, (i, j) is not an edge - Induced subgraph GI is edgeless
- An independent set is maximal if it is not a
subset of any larger independent set - An independent set is maximum if there are no
larger independent sets in the graph
15Definition and basic concepts
Example maximal clique 1,2,3 is a maximal
clique
2
8
9
11
5
1
12
3
7
10
4
6
16Definition and basic concepts
Example maximum clique 7,8,9,10 is the maximum
clique
2
8
9
11
5
1
12
3
7
10
4
6
17Definition and basic concepts
Example maximal independent set I3,7,11 is a
maximal independent set as there is no larger
independent set containing it
18Definition and basic concepts
Example maximum independent set The set
1,4,5,10,11 is a maximum independent set, one
of largest cardinality in the graph
19Definition and basic concepts
20Definition and basic concepts
Algorithm for maximal independent set
21Definition and basic concepts
Algorithm for maximal independent set
2
5
1
3
7
4
6
22Definition and basic concepts
- A dominating set is a set of
vertices such that every vertex in the graph is
either in this set or has a neighbour in this set
- Dominating set is minimal if it contains no
proper subset which is dominating - Dominating set is a minimum dominating set if
it is of the smallest cardinality - Cardinality of a minimum dominating set is
called the domination number ?(G) of a graph
23Definition and basic concepts
D 7, 11, 3 is a minimal and minimum
dominating set
2
8
9
11
5
1
12
3
10
7
4
6
24Definition and basic concepts
- A connected dominating set is one in which the
subgraph induced by the dominating set is
connected - An independent dominating set is one in which
the dominating set is also independent
25Network clustering problem
- Given a graph G(V,E), find subsets (not
necessarily disjoint) V1,...,Vr of V such that
V UVi i1,,r such that - Each subset is a cluster modelled by structures
such as cliques or other distance and
diameter-based models - The model used as a cluster represents the
cohesiveness required of the cluster
26Network clustering problem
- The clustering models can be classified
- By the constraints on relations between
clusters (clusters may be disjoint or
overlapping) - The objective function used to achieve the goal
of clustering (minimizing the number of clusters
or maximizing the cohesiveness) - When clusters are required to be disjoint
- ? V1,...,Vr is cluster -partition ?Exclusive
clustering - When clusters are allowed to overlap
- ? V1,...,Vr is a cluster-over ? Overlapping
clustering
27Network clustering problem
- Assume that there is a measure of cohesiveness of
the cluster that can be varied for a graph G ?
define two types of optimization problems - Type I Minimize the number of clusters while
ensuring that every cluster formed has
cohesiveness over a prescribed threshold - Example The problem of clustering an incomplete
graph with cliques used as clusters and the
objective of minimizing the number of clusters
28Network clustering problem
- Type II Maximize the cohesiveness of each
cluster formed, while the number of clusters is K
(the last requirement may be relaxed by setting K
be infinite ) - Example assume that G has non-negative edge
weights w, for a cluster Vi let Ei denote the
edges in the subgraph induced by Vi - Use w as a dissimilarity measure (distance)
- For example, w(Ei)?e in Ew(e) is meaningful
measures of cohesiveness - can be used to formulate a Type II clustering
problem - We will refer to problems as Type I and Type II
based on their objective
29Hierarchical clustering
- After performing clustering, we can abstract the
graph G0 to a graph - G1 (V1, E1) as the followings
- There exists a vertex vi1 in V1 for every subset
(cluster) Vi0 - There exists an edge between vi1 and vj1 if and
only if there exist a vertex x in the cluster Vi
and a vertex y in cluster Vj - In other words if any two vertices from
different clusters have an edge between them in
the original graph G0 ? clusters containing them
are made adjacent in G1
30Hierarchical clustering
- We can recursively cluster the abstracted graph
G1 in a similar fashion to obtain a multilevel
hierarchy - Process is called hierarchical clustering
- Example
- Following subsets form clusters in this graph
C17,8,9,10,C21,2,3,C34,6,C411,12,C55
- Given the clusters of the example graph G ? we
can construct an abstracted graph G
31Clique-based clustering
- Natural choice for a highly cohesive cluster
- Cliques have
- Minimum possible diameter
- Maximum connectivity
- Maximum possible degree for each vertex
- Given an arbitrary graph
- Type I approach tries to partition it into (or
cover it using) minimum number of cliques - Type II approaches usually work with a weighted
complete graph and hence every partition of the
vertex set is a clique partition
32Clique-based clustering
- Minimum clique partitioning
- Type I clique partitioning and clique covering
problems are both NP-hard Garey79 - Heuristic approaches are preferred for large
graphs - Note that clique-partitioning and
clique-covering problems are closely related - Minimum number of clusters produced in clique
covering and partitioning are the same
33Clique-based clustering
34Clique-based clustering
Simple heuristic for clique partitioning
35Clique-based clustering
Example
2
8
9
11
5
1
12
3
10
7
4
6
36Clique-based clustering
Example
37Clique-based clustering
Example
38Clique-based clustering
- Min-Max k-Clustering
- A Type II clique partitioning problem with
min-max objective - Consider a weighted complete graph G(V,E)
with weights w(e1)w(e2)w(em) , mn(n-1)/2 - Partition the graph into no more than k cliques
s.t. the maximum weight of an edge between two
vertices inside a clique is minimized - In other words, if V2,...,Vk is the clique
partition, then we wish to minimize
maxi1kmaxu,v in Viw(u,v) - The weight w(i,j) can be thought of as a measure
of dissimilarity - Larger w(i,j) means more dissimilar i and j are
- Problem tries to cluster the graph into at most
k cliques such that the maximum dissimilarity
between any two vertices inside a clique is
minimized
39Clique-based clustering
- Min-Max k-Clustering
- Given any graph G(V,E), the required edge
weighted complete graph G can be obtained in
different ways using meaningful measures of
dissimilarity - The weight w(i,j) could be dij, the shortest
distance of i and j in G - The weight could be k(i,j) and k(i,j) (minimum
number of vertices and edges that need to be
removed from G to disconnect i and j) - Since these are measures of similarity ? we
could obtain the required weights as
w(i,j)V-k(i,j) or w(i,j)E-k(i,j)
40Clique-based clustering
41Clique-based clustering
Bottleneck heuristic for the min-max k-clustering
problem
42Clique-based clustering
- Procedure bottleneck() returns the bottleneck
graph G() - MIS() is an arbitrary procedure for finding a
maximal independent set (MIS) in G - This algorithm will be optimal if we manage to
find a maximum independent set (one of largest
size) in every iteration - Problem is NP-hard
- We have to restrict ourselves to finding MIS
using heuristic approaches such as the greedy
approach described earlier to have a polynomial
time algorithm
43Clique-based clustering
Example Clustering output of the bottleneck
min-max k-clustering algorithm with k2 for
following graph
44Clique-based clustering
Result
45Center-based clustering
- In center-based clustering models, the elements
of a cluster are determined based on their
similarity with the clusters center (or
cluster-head) - Center-based clustering algorithms usually
consist of two steps - First, an optimization procedure is used to
determine the cluster-heads - Second, the cluster-heads are then used to form
clusters around them
46Center-based clustering
- Clustering with dominating sets
- Minimum dominating set and related problems
provide a modelling tool for centre-based
clustering of Type I - The minimum dominating set problem is NP-hard ?
heuristic approaches and approximation algorithms
are used to find a small dominating set - If D denotes a dominating set, then for each
vertex v in D the closed neighbourhood Nv forms
a cluster - By the definition of domination, every vertex
not in the dominating set has a neighbour in it
and hence is assigned to some cluster
47Center-based clustering
- Each v in D is called a cluster-head and the
number of clusters that result is exactly the
size of the dominating set - Minimizing the size of the dominating set ?
minimize the number of clusters produced
resulting in a Type I clustering problem - This approach results in a cluster cover since
the resulting clusters need not to be disjoint
48Center-based clustering
- Each cluster has diameter at most two as every
vertex in the cluster is adjacent to its
cluster-head and the cluster-head is similar to
all the other vertices in its cluster - However, neighbours of the cluster-head may be
poorly connected among themselves - Some post-processing may be required as a cluster
formed in this fashion from an arbitrary
dominating set could completely contain another
cluster - Clustering with dominating sets is especially
suited for clustering protein interaction
networks - To reveal groups of proteins that interact
through a central protein which could be
identified as a cluster-head in this method
49Center-based clustering
- Independent Dominating Sets
- Finding a maximal independent set results also in
a minimal independent dominating set - Can be used in clustering the graph
- Here, no cluster formed can contain another
cluster completely, as the cluster-heads are
independent and different
50Center-based clustering
- Greedy algorithm for minimal independent
dominating sets - Proceeds by adding a maximum degree vertex to the
current independent set and then deleting that
vertex along with its neighbours - Greedy because it adds a maximum degree vertex so
that a larger number of vertices are removed in
each iteration, yielding a small independent
dominating set - This is repeated until no more vertices exist
51Center-based clustering
Progress of greedy minimal independent dominating
set algorithm on the example graph
52Center-based clustering
- Connected Dominating Sets
- In some situations, cluster-heads are connected,
and not independent. - Connected Dominating Sets (CDS) is a dominating
set D such that GD is connected. - Finding a minimum CDS is a NP-hard.
- Approximation algorithm is needed
53Center-based clustering
- Greedy vertex elimination type heuristic for
finding CDS Butenko04 - Let DV, and F be empty initially.
- Pick a minimum degree vertex u and delete it, if
deletion does not disconnect the graph. Otherwise
u is added to set F (u is fixed). - If N(u) is empty in F, a vertex of maximum
degree in GD that is also in N(u) is fixed
ensuring that u is dominated. - In every iteration D is connected and is a
dominating set in G. - The algorithm terminates when all the vertices
left in D are fixed (DF) and that is the output
CDS of the algorithm.
54Center-based clustering
greedy_CDS_algorithm (connected graph G (V,E))
55Center-based clustering
CDS
2
8
9
11
5
1
12
3
10
7
4
6
56Center-based clustering
Cluster cove
2
8
9
11
5
1
12
3
10
7
4
6
57Center-based clustering
D F U Â Â W
V 1 2,3 3
V\1 3 2 ---- ---
V\1,2 3 4 --- ---
V\1,2,4 3 6 7 7
V\1,2,4,6 3,7 11 9,12 9
3,5,7,8,9,10,12 3,7,9 12 10 10
3,5,7,8,9,10 3,7,9,10 5 --- --- ---
3,5,7,8,9,10 3,7,9,10,5 8 --- ---
3,5,7,9,10 3,7,9,10,5 Â Â Â Â
58Center-based clustering
- k-Center clustering
- Variant of k-means clustering approach
Birnhaum03 - Seek to identify k-cluster-heads such that some
objective function measuring the dissimilarity of
the members of a cluster be minimized. - Different choice of dissimilarity measures and
objectives yield different clustering problems
and solutions often.
59Center-based clustering
- k-Center clustering problem is type II
center-based clustering model with a min-max
objective. Hochbaum85
bottleneck_k-center_algorithm (connected graph G
(V,E), with weights)
60Center-based clustering
- bottleneck_k-center _algorithm
-
- Let I be MIS( ). Then the distance of every
pair of vertices in I is at least three in the
original graph G. - Any vertex in I of G cannot dominate any other
vertex in I since they are three distance apart. - Therefore, any dominating set in G is at least
as large as I. - In the algorithm, if we find an MIS of of
size k1, then no dominating set of size k exists
in G. - Thus, we proceed until we find an MIS of size at
most k and terminate.
61Conclusion
- Numerous models exist for clustering, such as
clique-based, graph partitioning, min-cut,
connectivity-based, etc. - However, more sophisticated approaches require
rigorous background on optimization and
algorithms. - Commercial software packages are available for
clustering problems of moderate sizes. - For large-scale instances, meta-heuristic
approaches such as simulated-annealing, tabu
search, or GRASP are offered. - The basic algorithms are too restrictive for
real-life data, so alternatively, distance-k
neighborhood method is used. More detail is in
Holme05
62Conclusion
- Distance-k neighborhood Nk(v)
- Nk(v) the vertices that are at distance k or
less from v excluding v itself. - Using BFS to find Nk(v).
- Used to identify molecular complexes starting
with a seed vertex. - The vertex weights are based on k-cores
(subgraphs of minimum degree at least k) in the
neighborhood of a vertex. - K-cores is first introduces in social network
analysis - Identify dense regions of the network.
- Resemble cliques if k is large enough in
relation to the size of the k-core found.
63Conclusion
- Several models that relax cliques and dominating
sets based on distance-k neighborhood also exist
such as quasi-cliques. - Those relaxations are more robust in clustering
real-life data containing errors (noise)
64Conclusion
- Dilemma for clustering
- Very rarely does real-life data present a unique
clustering solution, because - Deciding the best model is hard and it requires
experimentation with different models. - Even under one model, several optimal solutions
are possible. - These issues are in addition to the general
issues of clustering which are - The interpretation of clusters
- What they represent.
- The whole is more than the sum of its parts --
Aristotle
65References
Junker08Junker and Schreiber. Analysis of
Biological Networks. WileyInterscience
publication, 2008. Butenko04Butenko, Cheng,
Oliveria, and Pardalos. A new heuristic for the
minimum connected dominating set problem on ad
hoc wireless networks. Cooperative Control and
Optimization, pp 61-73, 2004 Garey79 Garey and
Johnson. Computers and Intractability A Guide to
the Theory of NP-completeness. W.H.Freeman and
company, New York, 1979. Hochbaum85 Hochbaum
and Shmoys. A best possible heuristic for the
k-center problem. Mathematics of Operations
Research, 10 180-184, 1985. Holme05
Core-periphery organization of complex networks.
Physical Review E, 72046111-1-046111-4. 2005