Network clustering - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Network clustering

Description:

Loosely defined as the process of grouping objects into sets called clusters so ... Understand the structure and function of proteins based on protein interaction ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 66
Provided by: ITSu177
Learn more at: http://www.cs.gsu.edu
Category:
Tags: clustering | cove | network | the

less

Transcript and Presenter's Notes

Title: Network clustering


1
Network clustering
  • Presented by Wooyoung Kim
  • 2/6/2009
  • CSc 8910 Analysis of Biological Network, Spring
    2009
  • Dr. Yi Pan

2
Outline
Introduction Definitions and Basic Concepts
Network Clustering Problem Hierarchical
clustering Clique-based clustering Centre-based
clustering Conclusion
3
Introduction
  • Clustering is
  • Loosely defined as the process of grouping
    objects into sets called clusters so that each
    cluster consists of elements that are similar in
    some way.
  • Example
  • Distance-based clustering close given distance
    metric
  • Conceptual clustering based on descriptive
    concepts

4
Introduction
  • Clustering is
  • used for multiple purposes, including
  • Finding natural clusters (modules) and
    describing their properties
  • Classifying the data
  • Detecting unusual data (outliers)
  • Data reduction by treating a cluster or one of
    its element as a single representative unit

5
Introduction
  • Network clustering
  • deals with clustering the data represented as a
    network or a graph
  • Link analysis
  • Data points are represented by vertices
  • An edge exists if two data points are similar or
    related in a certain way
  • Similarity criterion
  • Pairwise relations - for network model
  • cohesiveness for cluster similarity

6
Introduction
  • Network clustering approaches are used to perform
  • Distance-based clustering
  • Vertices are data points, and edges are for close
    points
  • Distances for weight the edges of a complete
    graph

7
Introduction
  • Conceptual clustering
  • Generating a concept description for each
    generated cluster
  • Design a matching field in database networks,
    then vertices are connected by an edge if the
    two matching fields are close
  • Example
  • Protein interaction networks, proteins are
    vertices and a pair is connected by an edge if
    they are known to interact
  • Gene co-expression networks, genes are vertices
    and an edge indicates that the pair of genes (end
    points) are co-expressed over some cut-off value,
    based on microarray experiments.

8
Introduction
  • Application of network clustering
  • Understand the structure and function of
    proteins based on protein interaction maps of
    organisms
  • Clustering protein interaction networks (PINs)
    using cliques to decompose the Protein
    interaction network into functional modules and
    protein complexes
  • Use of cliques and other high density subgraphs
    to identify protein complexes (splicing
    machinery, transcription factors, etc.) and
    functional modules (signalling cascades, cell
    cycle regulation)

9
Introduction
  • Application of network clustering
  • Protein complexes groups of proteins that
    interact with each other at the same time and
    place.
  • Functional modules groups of proteins that
    are known to have pairwise interactions by
    binding with each other to participate in
    different cellular processes at different times

10
Definition and basic concepts
  • G(V,E) is simple, undirected graph
  • nV, eE
  • is complement graph of G
  • Complement set of edges
  • GS is induced subgraph of G (induced by a
    subset S of V)
  • N(v) is set of neighbours of a vertex v in G
    (excluding v)
  • Degree deg(v)N(v)
  • NvN(v) U v
  • duv is distance between u and v
  • Length of the shortest path from u to v
  • dmmax(dij) for all vertex pairs i and j is
    diameter of a graph

11
Definition and basic concepts
  • Edge connectivity k(G) of a graph the
    minimum number of edges that must be removed to
    disconnect the graph
  • Vertex connectivity (or connectivity) k(G) of a
    graph the minimum number of vertices that must
    be removed to disconnect the graph (or results in
    a trivial graph)
  • trivial graph one vertex, no edges
  • connected graph every pair of vertices are
    connected.

12
Definition and basic concepts
Example (Vertex) connectivity k(G)2 removal of
vertices 3 and 5 would disconnect the graph
2
8
9
11
5
1
12
3
7
10
4
6
13
Definition and basic concepts
Example Edge connectivity k(G)2 removal of
edges (9,11) and (10,12) would disconnect the
graph
2
8
9
11
5
1
12
3
7
10
4
6
14
Definition and basic concepts
  • A clique C is a subset of vertices such that an
    edge exists between every pair of vertices in C
  • The induced subgraph GC is a complete graph
  • A clique is maximal if it is not a subset of
    any larger clique
  • A clique is maximum if there are no larger
    cliques in the graph
  • A subset of vertices I is called an independent
    set (also called a stable set) if for every pair
    of vertices in I, (i, j) is not an edge
  • Induced subgraph GI is edgeless
  • An independent set is maximal if it is not a
    subset of any larger independent set
  • An independent set is maximum if there are no
    larger independent sets in the graph

15
Definition and basic concepts
Example maximal clique 1,2,3 is a maximal
clique
2
8
9
11
5
1
12
3
7
10
4
6
16
Definition and basic concepts
Example maximum clique 7,8,9,10 is the maximum
clique
2
8
9
11
5
1
12
3
7
10
4
6
17
Definition and basic concepts
Example maximal independent set I3,7,11 is a
maximal independent set as there is no larger
independent set containing it
18
Definition and basic concepts
Example maximum independent set The set
1,4,5,10,11 is a maximum independent set, one
of largest cardinality in the graph
19
Definition and basic concepts
20
Definition and basic concepts
Algorithm for maximal independent set
21
Definition and basic concepts
Algorithm for maximal independent set
2
5
1
3
7
4
6
22
Definition and basic concepts
  • A dominating set is a set of
    vertices such that every vertex in the graph is
    either in this set or has a neighbour in this set
  • Dominating set is minimal if it contains no
    proper subset which is dominating
  • Dominating set is a minimum dominating set if
    it is of the smallest cardinality
  • Cardinality of a minimum dominating set is
    called the domination number ?(G) of a graph

23
Definition and basic concepts
D 7, 11, 3 is a minimal and minimum
dominating set
2
8
9
11
5
1
12
3
10
7
4
6
24
Definition and basic concepts
  • A connected dominating set is one in which the
    subgraph induced by the dominating set is
    connected
  • An independent dominating set is one in which
    the dominating set is also independent

25
Network clustering problem
  • Given a graph G(V,E), find subsets (not
    necessarily disjoint) V1,...,Vr of V such that
    V UVi i1,,r such that
  • Each subset is a cluster modelled by structures
    such as cliques or other distance and
    diameter-based models
  • The model used as a cluster represents the
    cohesiveness required of the cluster

26
Network clustering problem
  • The clustering models can be classified
  • By the constraints on relations between
    clusters (clusters may be disjoint or
    overlapping)
  • The objective function used to achieve the goal
    of clustering (minimizing the number of clusters
    or maximizing the cohesiveness)
  • When clusters are required to be disjoint
  • ? V1,...,Vr is cluster -partition ?Exclusive
    clustering
  • When clusters are allowed to overlap
  • ? V1,...,Vr is a cluster-over ? Overlapping
    clustering

27
Network clustering problem
  • Assume that there is a measure of cohesiveness of
    the cluster that can be varied for a graph G ?
    define two types of optimization problems
  • Type I Minimize the number of clusters while
    ensuring that every cluster formed has
    cohesiveness over a prescribed threshold
  • Example The problem of clustering an incomplete
    graph with cliques used as clusters and the
    objective of minimizing the number of clusters

28
Network clustering problem
  • Type II Maximize the cohesiveness of each
    cluster formed, while the number of clusters is K
    (the last requirement may be relaxed by setting K
    be infinite )
  • Example assume that G has non-negative edge
    weights w, for a cluster Vi let Ei denote the
    edges in the subgraph induced by Vi
  • Use w as a dissimilarity measure (distance)
  • For example, w(Ei)?e in Ew(e) is meaningful
    measures of cohesiveness
  • can be used to formulate a Type II clustering
    problem
  • We will refer to problems as Type I and Type II
    based on their objective

29
Hierarchical clustering
  • After performing clustering, we can abstract the
    graph G0 to a graph
  • G1 (V1, E1) as the followings
  • There exists a vertex vi1 in V1 for every subset
    (cluster) Vi0
  • There exists an edge between vi1 and vj1 if and
    only if there exist a vertex x in the cluster Vi
    and a vertex y in cluster Vj
  • In other words if any two vertices from
    different clusters have an edge between them in
    the original graph G0 ? clusters containing them
    are made adjacent in G1

30
Hierarchical clustering
  • We can recursively cluster the abstracted graph
    G1 in a similar fashion to obtain a multilevel
    hierarchy
  • Process is called hierarchical clustering
  • Example
  • Following subsets form clusters in this graph
    C17,8,9,10,C21,2,3,C34,6,C411,12,C55
  • Given the clusters of the example graph G ? we
    can construct an abstracted graph G

31
Clique-based clustering
  • Natural choice for a highly cohesive cluster
  • Cliques have
  • Minimum possible diameter
  • Maximum connectivity
  • Maximum possible degree for each vertex
  • Given an arbitrary graph
  • Type I approach tries to partition it into (or
    cover it using) minimum number of cliques
  • Type II approaches usually work with a weighted
    complete graph and hence every partition of the
    vertex set is a clique partition

32
Clique-based clustering
  • Minimum clique partitioning
  • Type I clique partitioning and clique covering
    problems are both NP-hard Garey79
  • Heuristic approaches are preferred for large
    graphs
  • Note that clique-partitioning and
    clique-covering problems are closely related
  • Minimum number of clusters produced in clique
    covering and partitioning are the same

33
Clique-based clustering
34
Clique-based clustering
Simple heuristic for clique partitioning
35
Clique-based clustering
Example
2
8
9
11
5
1
12
3
10
7
4
6
36
Clique-based clustering
Example
37
Clique-based clustering
Example
38
Clique-based clustering
  • Min-Max k-Clustering
  • A Type II clique partitioning problem with
    min-max objective
  • Consider a weighted complete graph G(V,E)
    with weights w(e1)w(e2)w(em) , mn(n-1)/2
  • Partition the graph into no more than k cliques
    s.t. the maximum weight of an edge between two
    vertices inside a clique is minimized
  • In other words, if V2,...,Vk is the clique
    partition, then we wish to minimize
    maxi1kmaxu,v in Viw(u,v)
  • The weight w(i,j) can be thought of as a measure
    of dissimilarity
  • Larger w(i,j) means more dissimilar i and j are
  • Problem tries to cluster the graph into at most
    k cliques such that the maximum dissimilarity
    between any two vertices inside a clique is
    minimized

39
Clique-based clustering
  • Min-Max k-Clustering
  • Given any graph G(V,E), the required edge
    weighted complete graph G can be obtained in
    different ways using meaningful measures of
    dissimilarity
  • The weight w(i,j) could be dij, the shortest
    distance of i and j in G
  • The weight could be k(i,j) and k(i,j) (minimum
    number of vertices and edges that need to be
    removed from G to disconnect i and j)
  • Since these are measures of similarity ? we
    could obtain the required weights as
    w(i,j)V-k(i,j) or w(i,j)E-k(i,j)

40
Clique-based clustering
41
Clique-based clustering
Bottleneck heuristic for the min-max k-clustering
problem
42
Clique-based clustering
  • Procedure bottleneck() returns the bottleneck
    graph G()
  • MIS() is an arbitrary procedure for finding a
    maximal independent set (MIS) in G
  • This algorithm will be optimal if we manage to
    find a maximum independent set (one of largest
    size) in every iteration
  • Problem is NP-hard
  • We have to restrict ourselves to finding MIS
    using heuristic approaches such as the greedy
    approach described earlier to have a polynomial
    time algorithm

43
Clique-based clustering
Example Clustering output of the bottleneck
min-max k-clustering algorithm with k2 for
following graph
44
Clique-based clustering
Result
45
Center-based clustering
  • In center-based clustering models, the elements
    of a cluster are determined based on their
    similarity with the clusters center (or
    cluster-head)
  • Center-based clustering algorithms usually
    consist of two steps
  • First, an optimization procedure is used to
    determine the cluster-heads
  • Second, the cluster-heads are then used to form
    clusters around them

46
Center-based clustering
  • Clustering with dominating sets
  • Minimum dominating set and related problems
    provide a modelling tool for centre-based
    clustering of Type I
  • The minimum dominating set problem is NP-hard ?
    heuristic approaches and approximation algorithms
    are used to find a small dominating set
  • If D denotes a dominating set, then for each
    vertex v in D the closed neighbourhood Nv forms
    a cluster
  • By the definition of domination, every vertex
    not in the dominating set has a neighbour in it
    and hence is assigned to some cluster

47
Center-based clustering
  • Each v in D is called a cluster-head and the
    number of clusters that result is exactly the
    size of the dominating set
  • Minimizing the size of the dominating set ?
    minimize the number of clusters produced
    resulting in a Type I clustering problem
  • This approach results in a cluster cover since
    the resulting clusters need not to be disjoint

48
Center-based clustering
  • Each cluster has diameter at most two as every
    vertex in the cluster is adjacent to its
    cluster-head and the cluster-head is similar to
    all the other vertices in its cluster
  • However, neighbours of the cluster-head may be
    poorly connected among themselves
  • Some post-processing may be required as a cluster
    formed in this fashion from an arbitrary
    dominating set could completely contain another
    cluster
  • Clustering with dominating sets is especially
    suited for clustering protein interaction
    networks
  • To reveal groups of proteins that interact
    through a central protein which could be
    identified as a cluster-head in this method

49
Center-based clustering
  • Independent Dominating Sets
  • Finding a maximal independent set results also in
    a minimal independent dominating set
  • Can be used in clustering the graph
  • Here, no cluster formed can contain another
    cluster completely, as the cluster-heads are
    independent and different

50
Center-based clustering
  • Greedy algorithm for minimal independent
    dominating sets
  • Proceeds by adding a maximum degree vertex to the
    current independent set and then deleting that
    vertex along with its neighbours
  • Greedy because it adds a maximum degree vertex so
    that a larger number of vertices are removed in
    each iteration, yielding a small independent
    dominating set
  • This is repeated until no more vertices exist

51
Center-based clustering
Progress of greedy minimal independent dominating
set algorithm on the example graph
52
Center-based clustering
  • Connected Dominating Sets
  • In some situations, cluster-heads are connected,
    and not independent.
  • Connected Dominating Sets (CDS) is a dominating
    set D such that GD is connected.
  • Finding a minimum CDS is a NP-hard.
  • Approximation algorithm is needed

53
Center-based clustering
  • Greedy vertex elimination type heuristic for
    finding CDS Butenko04
  • Let DV, and F be empty initially.
  • Pick a minimum degree vertex u and delete it, if
    deletion does not disconnect the graph. Otherwise
    u is added to set F (u is fixed).
  • If N(u) is empty in F, a vertex of maximum
    degree in GD that is also in N(u) is fixed
    ensuring that u is dominated.
  • In every iteration D is connected and is a
    dominating set in G.
  • The algorithm terminates when all the vertices
    left in D are fixed (DF) and that is the output
    CDS of the algorithm.

54
Center-based clustering
greedy_CDS_algorithm (connected graph G (V,E))
55
Center-based clustering
CDS
2
8
9
11
5
1
12
3
10
7
4
6
56
Center-based clustering
Cluster cove
2
8
9
11
5
1
12
3
10
7
4
6
57
Center-based clustering

D F U     W
V 1 2,3 3
V\1 3 2 ---- ---
V\1,2 3 4 --- ---
V\1,2,4 3 6 7 7
V\1,2,4,6 3,7 11 9,12 9
3,5,7,8,9,10,12 3,7,9 12 10 10
3,5,7,8,9,10 3,7,9,10 5 --- --- ---
3,5,7,8,9,10 3,7,9,10,5 8 --- ---
3,5,7,9,10 3,7,9,10,5        

58
Center-based clustering
  • k-Center clustering
  • Variant of k-means clustering approach
    Birnhaum03
  • Seek to identify k-cluster-heads such that some
    objective function measuring the dissimilarity of
    the members of a cluster be minimized.
  • Different choice of dissimilarity measures and
    objectives yield different clustering problems
    and solutions often.

59
Center-based clustering
  • k-Center clustering problem is type II
    center-based clustering model with a min-max
    objective. Hochbaum85

bottleneck_k-center_algorithm (connected graph G
(V,E), with weights)
60
Center-based clustering
  • bottleneck_k-center _algorithm
  • Let I be MIS( ). Then the distance of every
    pair of vertices in I is at least three in the
    original graph G.
  • Any vertex in I of G cannot dominate any other
    vertex in I since they are three distance apart.
  • Therefore, any dominating set in G is at least
    as large as I.
  • In the algorithm, if we find an MIS of of
    size k1, then no dominating set of size k exists
    in G.
  • Thus, we proceed until we find an MIS of size at
    most k and terminate.

61
Conclusion
  • Numerous models exist for clustering, such as
    clique-based, graph partitioning, min-cut,
    connectivity-based, etc.
  • However, more sophisticated approaches require
    rigorous background on optimization and
    algorithms.
  • Commercial software packages are available for
    clustering problems of moderate sizes.
  • For large-scale instances, meta-heuristic
    approaches such as simulated-annealing, tabu
    search, or GRASP are offered.
  • The basic algorithms are too restrictive for
    real-life data, so alternatively, distance-k
    neighborhood method is used. More detail is in
    Holme05

62
Conclusion
  • Distance-k neighborhood Nk(v)
  • Nk(v) the vertices that are at distance k or
    less from v excluding v itself.
  • Using BFS to find Nk(v).
  • Used to identify molecular complexes starting
    with a seed vertex.
  • The vertex weights are based on k-cores
    (subgraphs of minimum degree at least k) in the
    neighborhood of a vertex.
  • K-cores is first introduces in social network
    analysis
  • Identify dense regions of the network.
  • Resemble cliques if k is large enough in
    relation to the size of the k-core found.

63
Conclusion
  • Several models that relax cliques and dominating
    sets based on distance-k neighborhood also exist
    such as quasi-cliques.
  • Those relaxations are more robust in clustering
    real-life data containing errors (noise)

64
Conclusion
  • Dilemma for clustering
  • Very rarely does real-life data present a unique
    clustering solution, because
  • Deciding the best model is hard and it requires
    experimentation with different models.
  • Even under one model, several optimal solutions
    are possible.
  • These issues are in addition to the general
    issues of clustering which are
  • The interpretation of clusters
  • What they represent.
  • The whole is more than the sum of its parts --
    Aristotle

65
References
Junker08Junker and Schreiber. Analysis of
Biological Networks. WileyInterscience
publication, 2008. Butenko04Butenko, Cheng,
Oliveria, and Pardalos. A new heuristic for the
minimum connected dominating set problem on ad
hoc wireless networks. Cooperative Control and
Optimization, pp 61-73, 2004 Garey79 Garey and
Johnson. Computers and Intractability A Guide to
the Theory of NP-completeness. W.H.Freeman and
company, New York, 1979. Hochbaum85 Hochbaum
and Shmoys. A best possible heuristic for the
k-center problem. Mathematics of Operations
Research, 10 180-184, 1985. Holme05
Core-periphery organization of complex networks.
Physical Review E, 72046111-1-046111-4. 2005
Write a Comment
User Comments (0)
About PowerShow.com