Title: A scalable multilevel algorithm for community structure detection
1A scalable multilevel algorithm for community
structure detection
- Melih Onus Hristo Djidjev
- Arizona State University Los Alamos National
Laboratory
Models and Algorithms for the Web Graph (WAW
2006) November 29 December 2, 2006
2Community Structure Detection Problem
- The problem of identifying communities in a
network is usually modeled as a graph clustering
problem - Vertices correspond to individual items
- Edges describe relationships
- The communities correspond to subgraphs
- Dense connections between vertices from the same
subgraph - Fewer connections between vertices in different
subgraphs
3Motivation Why to detect communities?
- Analyze and understand the information contained
in the huge amount of data available on the WWW - Finding related commercial items
- Recommendation systems
- Important for
- Social networks
- Ad-hoc networks
- Protein interaction networks
- Genetic networks
4Motivation Why to detect communities?
- Predict how much someone going to love a
movie based on their movie preferences
Grand Prize 1.000.000
5Outline of the talk
- Previous work
- Graph partitioning problem
- Our approach
- Modularity
- Reduction
- Multilevel graph partitioning
- Experimental results
- Conclusions
6Previous Work
- Two main classes
- Agglomerative Methods (addition of edges)
- Divisive Methods (removal of edges)
- Algorithms based on
- Laplacian Matrix
- Centrality measures
- Flow models
- Random walks
- Resistor networks
- Optimization
- Not fast enough or inaccurate
7Graph Partitioning Problem
- Given a graph G(V, E), find a partition such that
- The partition is balanced (i.e., the number of
vertices of all subsets are roughly equal) - Cut size is minimized (i.e., the number of the
edges with endpoints in different subsets is
minimized) - Previous Work
- Kernighan-Lin algorithm
- Spectral partitioning
- Multilevel algorithms
8Kernighan - Lin Algorithm
- Find an initial random partition
- Improve by a greedy procedure that swaps pairs of
vertices from different partitions - Minimize the size of the cut set
u
v
9Graph Partitioning vs Graph Clustering
- Find Clusters
- Community sizes may differ
- Number of subsets varies
- Minimize cut size
- Equal number of vertices in each subset
- Number of subsets is an input
- Algorithms for graph partitioning can not be
directly used to produce good quality clustering
10Our approach
- Convert original graph G into a complete graph G
- Find min-cut of G using modified graph
partitioning method - This will produce a good quality (high
modularity) clustering for G
11Modularity
- A useful measure of clustering quality
- Introduced by Newman 6
- Modularity of a partitioning
- (number of edges within communities)
- (expected number of such edges)
- We are trying to find a division of graph with
high modularity
12Reduction
Min-Cut Problem The problem of finding a
minimum cut in a complete edge-weighted graph G'
Graph Clustering Problem The problem of finding
a clustering of maximum modularity in G
13Reduction
- Maximize modularity of a partitioning
- (number of edges within communities)
- (expected number of such edges)
Graph Clustering Problem Maximize modularity
Minimize (- modularity) (cut size)
(expected cut size)
Min-Cut Problem Minimize cut size
14Random Graph Models
pij the probability that there is an edge
between vertices i and j in a random graph from a
given distribution
Chung - Lu Model
15Multilevel graph partitioning
- Fast and an accurate method for producing
high-quality partitions
- Consists of the three phases
- Coarsening phase
- Partitioning phase
- Uncoarsening and refinement phase
16Coarsening Phase
- Find a maximal matching and collapse edges to a
vertex - Recursive coarsening
- lt G G1, G2, , Gk gt
17Partitioning Phase
- Greedy graph growing partitioning
- Partition Gk
18Uncoarsening and Refinement Phase
- Project the partitioning Pi of Gi to Pi-1 of Gi-1
- More degrees of freedom at Gi than Gi-1
- Improve Pi using KL algorithm
19Implementation
- Our implementation is based on the graph
partitioning package METIS 3 that employs a
multilevel strategy - Convert the graph partitioning algorithm into a
clustering one - The optimal clustering might not be balanced.
- We ignore the restrictions that control the
sizes of the parts. - The number of the parts in the optimal clustering
is not known. - We employ a recursive bisection procedure.
- The original graph G might be sparse, while the
transformed one G' is complete. Our algorithm
does not explicitly generate G.
20Modularity Erdos - Renyi Model
- (- Modularity) cut size n1n2p
(- Modularity) cut size (n11)(n2-1)p
n1
n2
Erdos - Renyi Model
21Modularity Chung - Lu Model
- (- Modularity) cut size w1w2/2m
(- Modularity) cut size
(w1 w(v))(w2 - w(v))/2m
w1
w2
wi Sum of degrees in partition i
22Analysis
- Time Complexity O(nm)
- Experiments
- Random Graphs
- k-community graphs
- nd.edu
23Experiment I Random Graphs
- We generated random graphs with 128 vertices and
4 communities of size 32 each - The expected degree of any vertex is 16
- Out degree varies
24Experiment II k-community graphs
- We generated graphs with k communities
- Size of each community is 100
- Expected number of edges in the community is
equal to expected number of edges going outside
from community. - Probability of an edge in communities varies
between 0.5 and 0.1. - Results show that graphs are clustered especially
99 correctly.
25Experiment III nd.edu
- Data consists of the complete map of the nd.edu
domain, which contains 325,729 document and
1090108 links - Our algorithm clusters this graph into 280
clusters with modularity 0.925579 - This high modularity indicates strong community
structure in the graph - We show the dendrogram generated by our
algorithm. - The size of rectangles are proportional to size
of communities.
26Conclusions
- Community structure detection problem
- A scalable algorithm
- Based on multilevel graph partitioning
- Uses modularity as a quality measure