Detecting Community Structure in Network - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Detecting Community Structure in Network

Description:

Title: Detecting Community Structure in Network Author: welcome Last modified by: welcome Created Date: 8/9/2004 4:24:17 PM Document presentation format – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 33
Provided by: welc185
Category:

less

Transcript and Presenter's Notes

Title: Detecting Community Structure in Network


1
Detecting Community Structure in Network
2004 summer intensive studies on complex networks
  • Seung Woo Son
  • KAIST

2004. 8. 11.
2
http//cnrl.snu.ac.kr/
3
Clustering of data
  • Partitional clustering methods
  • Important technique in data analysis
  • Divide the data according to natural classes
  • Pattern recognition, learning, astrophysics, and
    network analysis

N multivariable data points
4
On network
N vertices (nodes)
No prior information
Only know the edge (link) connectivity
Structural information
How can we divide the network into several parts?
How can we find the community structure?
5
Community, cluster
  • Functional modules in cellular and genetic
    network
  • P. Holme, M. Huss, and H. Jeong, Bioinformatics
    19, 532 (2003).
  • D. Wilkinson and B. A. Huberman, Proc. Natl.
    Acad. Sci. USA 10.1073/pnas.0307740100 (2004).
  • A. Vespignani, Nature Genetics 35, 118 (2003).
  • Cultural society or important source of a
    persons identity in social network
  • J. Scott, Social Network Analysis A Handbook,
    Sage Publications 2nd ed. (2000).
  • A bundle of web pages on common topics etc.
  • Community, module, (cohesive) subgroup, cluster,
    clique etc.
  • Computer science, mathematics, sociology,
    biology, and physics are related in this
    community finding problem.

6
Structural definition of community
  • Groups of vertices within which connections are
    dense, but between which connections are sparser.
  • Because we dont have any prior information about
    network.

7
Key points (Highlight)
  • What property or measure of network is used in
    this algorithm or method?
  • eigenvalue and eigenvector, spectrum of adjacency
    matrix.
  • Edge betweenness, information centrality.
  • Distance, dissimilarity index, edge clustering
    coefficient, etc.
  • Agglomerative or divisive?
  • What is the required prior information here?
  • Whether there is community or not.
  • How many modules are there.
  • Performance of partitioning results and
    computational complexity.
  • We will review about 11 different methods
    recently studied. ? If you are boring, ask me a
    question.

Physical meaning?
8
1. Spectral bisection (old one)
M. Fiedler, Czrch. Math. J. 23, 298 (1973)A.
Pothen, H. Simon, and K.-P. Liou, SIAM J. Matrix
Anal. Appl. 11, 430 (1990)F. R. K. Chung,
Spectral Graph Theory, Amer. Math. Soc.
(1997) http//www.cs.berkeley.edu/demmel/cs267/le
cture20/lecture20.html
Laplacian L of n-vertex undirected graph G
- D is the diagonal matrix of vertex degree k.
- A is the adjacency matrix.
9
Bisect !
Algebraic connectivity
The eigenvector corresponding to the lowest
eigenvalue must haveboth positive and negative
elements.
How good the split is, with smaller values
corresponding to better splits.
G. H. Golub and C. F. Van Loan, Matrix
computations. Johns Hopkins University Press,
Baltimore, MD (1989)
10
2. The Kernighan-Lin (KL) algorithm
B. W. Kernighan and S. Lin, Bell System Technical
Journal 49, 291 (1970)http//www.cs.berkeley.edu/
demmel/cs267/lecture18/lecture18.html
Benefit function Q
The number of edges that lie within the two
groups minus the number that lie between them.
1. We should Specify the size of the two groups.
N(A), N(B)2. Calculate the ?Q for all possible
exchange pair from A and B. 3. Choose the pair
that maximizes the change of Q. (greedy
algorithm) 4. Repeat 2 3 until all vertices
have been swapped once. (any vertex that has
been swapped is never swapped. ) 5. Go back over
the sequence of swaps and find the highest Q.
Bisect !
- This algorithm requires a priori what the size
of the groups will be.
- It runs moderately quickly, in worst case time
O(n2). However, if we dont know the size, It
will increase to O(n3).
- The best values of Q are always achieved for
very asymmetric trivial division.
11
3. Newman fast algorithm
M. E. J. Newman, cond-mat/0309508 (PRE in press)
Maximize Q by greedy algorithm !
  1. Separate each vertex solely into n community.
  2. Calculate the increase of Q for all possible
    community pairs.
  3. Choose the mergence of the greatest increase in
    Q.
  4. Repeat 2 3 until the modularity Q reaches the
    maximal value.

Time Complexity - O(mn)
O(n2) on sparse graph.
Agglomerative hierarchical clustering method!
12
4. q-state Potts method or RB method
(Reihardt-Bornholdt method)
J. Reichardt and S. Bornholdt, cond-mat/0402349
(2004)
q-state Potts model on network
Hamiltonian
Nearest neighbor ferromagnetic interaction of the
Potts model homogeneous distribution of spin
Diversity global anti-ferromagnetic interaction.
q N/5 is reasonable for application.
Monte-Carlo heat-bath algorithm and simulated
annealing
magnetization
13
128 nodes computer-generated (proposed by Newman)
network, 4 groups of 32 nodes each. Average of 16
links ( zinzout16 )
14
5. Hierarchical clustering
Dendrogram
metric
  1. Measure of similarity xij between pairs (i,j) of
    vertices.
  2. Single linkage, complete linkage, or average
    linkage.

Structural equivalence Two vertices are said
to be structurally equivalent if they have the
same set of neighbours. How many same friends
they have.
Pearson correlation
Euclidean distance
K-components Two vertices in the same community
have at least k independent paths between them.
The count of edge-independent path (max-flow)
betweenvertices.
Time complexity Max(O(mn), O(n2logn) )
because of the sorting of n2 similarity.
15
6. Zhou dissimilarity index method
H. Zhou, Phys. Rev. E 67, 061901 (2003) H. Zhou,
Phys. Rev. E 67, 041908 (2003)
The distance dij from vertex i to vertex j is
defined as the average number of steps needed for
a Brownian particle on this network to move from
vertex i to vertex j.
Transfer matrix (jumping probability)
I is N by N identity matrix.
Distance
B(j) is equals to P except that Blj(j) 0 for
all l.
Dissimilarity index
16
7. Girvan-Newman (GN) algorithm
M. Girvan and M. E. J. Newman, PNAS 99, 7821
(2002) M. E. J. Newman and M. Girvan, Phys. Rev.
E 69, 026113 (2004)
The few edges that lie between communities can be
thought of as forming bottlenecks between the
communities.
Betweenness and edge betweenness
The number of geodesic (i.e., shortest) paths
between vertex pairs that run along the edge in
question, summed over all vertex pairs.
Edge removal
After calculating the betweenness of all edges in
the network, remove the one with highest
betweenness.
Recalculate after edge removal and repeat it
until the modularity Q is maximum.
Time complexity O(m2n)
17
8. Tyler-Wilkinson-Huberman (TWM) method
J. R. Tyler, D. M. Wilkinson, and B. A. Huberman,
cond-mat/0303264 (2003)
Variation of Girvan-Newman algorithm to improve
the calculating speed.
Tyler et al. suggest instead summing up over all
node only a subset of vertices i be summed over,
giving partial betweenness score for all edges
if a random sample is chosen, this will give a
Monte Carlo estimate of betweenness.
The number of vertices sampled is chosen so as to
make the betweenness of at least one edge in the
network greater than a certain threshold.
This stochastic approach reduces the time
complexity from O(m2n) to O(m2)
18
9. RCCLP method or Parisi method
(Radicchi-Castellano-Cecconi-Loreto-Parisi method)
F. Radicchi, C. Castellano, F. Cecconi, V.
Loreto, and D. Parisi, PNAS 101, 2658 (2004)
Edge clustering coefficient
the number of triangles built on that edge.
Edge coefficient of order g
19
Edge clustering coefficient is strongly
negatively correlated with edge betweenness.
Time complexity O(m4/n2) O(n2)
This algorithm relies on the presence of
triangles in the network. Clearly if a network
has few triangles in the first place, then the
edge clustering coefficient will be small for all
edges, and the algorithm will be fail to find the
communities.
20
10. Information centrality method
(Fortunato-Latora-Marchiori method)
S. Fortunato, V. Latora, and M. Marchiori,
cond-mat/042522 (2004)
Network efficiency E
Information centrality CI
Time complexity O(m3n)
Iterative removal of the edges with the highest
information centrality
64 nodes computer-generated network. 256 edges, 4
groups of 16 nodes each.
21
11. Flakes max-flow method
(Flake-Lawrence-Giles-Coetzee method)
G. W. Flake, S. R. Lawrence, C. L. Giles, and F.
M. Coetzee, IEEE Computer 35, 66 (2002)
Web community
Find the boundary of community using max-flow and
min-cut.
Without the text information only link
information.Ex) Page Rank, Hyperlink Induced
Topic Search(HIT)
Starting page orseed Web sites
22
Simple example of max-flow, min-cut
23
Optimization Approach Hamiltonian, benefit
function, or modularity Q
Spectral analysis eignevalue and eigenvector
of Laplacian or transfer matrix
Edge removal betweenness, information
centrality, clustering coefficient, etc.
Hierarchical clustering metric ( Euclidian,
correlation, similarity, etc. )
24
(No Transcript)
25
12. ESMS method or K. Sneppen method
( Eriksen-Simonsen-Maslov-Sneppen method )
K. A. Eriksen, I. Simonsen, S. Maslov, and K.
Sneppen, Phys. Rev. Lett. 90, 148701 (2003)
26
(No Transcript)
27
13. CSCC method or Capocci method
( Capocci-Servedio-Caldarelli-Colaiori method )
A. Capocci, V. D. P. Servedio, G. Caldarelli, and
F. Colaiori, cond-mat/0402499
28
14. Donetti-Muñoz (DM) method
L. Donetti and M. A. Muñoz, cond-mat/0404652
29
15. Wu-Huberman (WH) method
F. Wu and B. A. Huberman, cond-mat/310600
30
16. Costas Hub-based flooding method
L. F. Costa, cond-mat/0405022 (2004)
31
(No Transcript)
32
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com