Title: The overlapping community structure of complex networks
1The overlapping community structure of complex
networks
2Introduction
- Networks and complex systems
- The structure of networks
- Finding communities
- Devisive and agglomerative methods
- Network construction in examples
- Statistical features
- The importance of observing networks
31. Networks and complex systems
- purpose
- understand the
- structural and fundamental properties
- desription of the global organization
coexistence of structural subunits (communities) - local structural units distribution and
clustering properties global features - Communities larger units in the network
- vertices ( ) more densely connected to
eachother than to the rest of the network
4Examples
A person as part of the scientific community,
family, their connections related to their hobby,
schoolmates
5- such blocks
- in the industrial sectors
- functionally related proteins
- word association communities
- (next illustration)
6The communities of the word bright
7Problems with the identifications of communities
- different kind of methods
- usually they dont allow for overlapping
communities - However overlapping is important.
- devide networks into smaller peaces
8Nested and overlapping structure of the
communities
9Devisive and agglomerative methodsfail to
identify the communities when overlaps are
significant
10- We would like to discuss an approach to analysing
the main statistical features - we need new characteristic quantities
- Introduce a technique for exploring overlapping
communities on a large scale
112. The stucture of networks
- Clusters/communities
- Those parts of the network in which the nodes
are more highly connected to each other than to
the rest of the network. - Membership number mi
- number of communities that node i belongs to
- Overlap size between a and ß communities
- Sova,ß
- the number of nodes which communities a and ß
share
12- Community degree dacom
- the number of those links which are overlaps
- Size of community a sacom
- number of nodes
- We would like to examine the distribution of
these quantities - m ? P(m)
- sov ? P(sov)
- dcom ? P(dcom)
- scom ? P(scom)
-
13- k-clique complete subgraph of size k
- k-clique community
- union of all k-cliques that can be reached from
each other through a series of adjacent k-cliques - ? they share k-1 nodes
3-cliques and 3-cligue percolation clusters
14overlapping k-clique communities k4 overlaps
yellow-blue 1 node yellow-green 2
nodes and 1 link 1 node
153. Finding communities
- Requirements
- The method of identification
- cannot be too restrictive
- be based on the density of links
- be local
- not allowed to be any cut-node or cut-link
- allow overlaps
16- Algorithm
- We use an exponential algorithm
- ?it proved to be more efficient than polynomial
algorithms - procedure
- Locating all cliques of the network
- Identifying the communities by carrying out a
standard component analysis of the clique-clique
overlap matrix - We use the method for binary networks
- undirected, unweighted links
- Arbitrary networks can always be transformed to
binary ones - ignore any directionality
- keep only those links that are stronger than a
treshold w
17- Strategy
- according to the experience in real networks the
typical size of the complete subgraphs is between
10 and 100 - ? ( ) different k-cliques
- ? locating the k-cliques individually and
examine the adjacency between them would be
extremely slow - ?dont look for k-cliques, rather
- 1. locate the large complete subgraphs
- 2. look for the k-clique connected subsets of
given k by studying the overlap between them -
18- Method
- Extract all complete subgraphs (cliques)
- cliques have to be located in a decreasing order
of their size - (firtst of all the largest clique size have to
be determined) - start with this size
- repeatedly choose a node
- extract every clique of this size containing
that node - delete the node and its edges
- (will not find the same clique multiple times)
- when no nodes are left the clique size is
decreased by one -
- Find the clique of size s that contains node v
- construct set A
- A nodes all linked to eachother
- initially contains v then enlarge till it
reaches size s - construct set B
19- Prepare the clique-clique overlap matrix
- (symmetric)
- Diagonal elements ? size of the clique
- Offdiagonal elements ? the number of common nodes
20- k-clique communities at least k-1 nodes
- ? we have to erase every offdiagonal entry
smaller than k-1 - ? erase every diagonal elements smaller than
k - ? replace the remaining elements by 1
- ? component analysis of this matrix
21- Efficiency
- CPU time depends on the structure of the input
data very strongly - If we illustrate the time (t) depending on the
number of edges (M) - fit t AMBln(M)
- (A,B fitting parameters)
22- Further examples for local community structure
- The four community of the word gold
- k4
- w0.025
23- Communities of the word day
- k4
- w0.025
24- Communities of the word play
- k4
- w0.025
25- Community structure around a particular node
- We should scan through some ranges of k, w
- Examples
- Social network of scientific collaborators
- 2. The communities of the word bright in the
South Florida Free Association norms list - 3. The communities of the protein Zds1 in the DIP
core list of the protein-protein interaction of
Saccharomyces cerevisiae
26Social network of scientific collaborators k4 w
0.75
27The communities of the word bright k4 w0.0
25
28The molecular-biological network of
protein-protein interactions k4 w0.75
29- We try to find the community of proteins based on
their interaction - Most proteins can be associated with
- protein complexes
- certain functions
- For some proteins no function is yet available
- ? appearing as a part of a community can be a
prediction of their functions - Example
- protein Ycr072c (essential for the viability of
the cell) - there is no biological function yet available
- the most important biological process for this
community - ribosome biogenesis/assembly
- ? our protein is likely to be involved in this
process
30- Network of the protein-protein interactions of S.
cerevisiae - (k4)
31Divisive and agglomerative methods
- Devisive methods
- ? cut the network into smaller and smaller peaces
- each node is forced to remain in only one
community and becomes separated from its other
communities - ? usually they fall apart and desappear
- example bright ? stays together with the
words connected to light - ? most of the other communities
disintegrate - Agglomerative methods
- do the same in reverse direction
- leads to a tree-like hierarchical rendering of
the communities
32The constructions of our above mentioned networks
- 1. co-authorship each article ?
-
- contribution to the weight of the link between
every pair of its n authors - 2. South Florida Free Association norms list
- ? weight of a directed link from one word to
another indicates the frequency with which the
people in the survey associated the end point of
the link with its starting point - ? replace with undirected ones
- ? weight equal to the sum of the weights of the
corresponding two oppositely directed links - 3. DIP (Database of Interacting Proteins core
list of the protein-protein interactions of
Saccharomyces cerevisiae) - each interaction represents an unweighted link
between the interacting proteins
334. Statistical features
- Values of k, w
- Purpose we would like to analyse the statistical
properties of the community structure of the
entire network - ? finding a community structure that is as highly
structured as possible - ?
- it leads us to the percolation phenomenon
- If the number of links is increased above a
critical point a giant component appears.
34- Approach critical point!
- for each value of k (typ. 3-6) we lower the
treshold w until the largest community becomes
twice as big as the second largest one - ? find as many communities as possible, but
- no giant community that smears out the details of
the community structure by merging many smaller
communities - f the fraction of links stronger than w
- use those k values for which f is not too small
- (smaller than 0.5)
- co-authorship k6 f 0.93
- protein interaction network k5 f 0.75
- word-association k4 f 0.67
35Statistics of the k-clique communities
- Cumulative distribution function of the community
size power law - P(scom) (scom)-t
- t ranges between -1, -1.6
- valid over nearly the entire range of community
size
36- The cumulative distribution of the community
degree - starts exponentionally then crosses over to a
power law - exponentional decay
- P(dcom)
- most of the communities have a size of the
order of k - and
- their distribution dominates this part of the
curve - ? a characteristic scale appears d0com kd
- power-law tail P(dcom) (dcom) t
- on average each node of a community has a
contribution of d to the community degree - ? this power law tail is proportional to that of
the community size distribution
37(No Transcript)
38- The cumulative distribution of the overlap size
- close to a power law
- large exponent
- there is no characteristic overlap size in the
network - The cumulative distribution of the membership
number P(m) - a node can belong to several communities
- collaboration, word-association
- no characteristic value
- the data are close to a power-law dependence,
large exponent - protein-protein interaction the largest
membership number is only 4 - (consistent with the also short distribution of
its community degree)
39(No Transcript)
40- From statistical features
- two communities overlapping with a given
community are likely to overlap with each other
as well - ( average clustering coefficient is high )
- Specific scaling of P(dcom) the signature of the
hierarchical nature of the system - (the network of the communities still exhibits a
degree-distribution with a fat tail, a
characteristic scale appears below which the
distribution is exponential) - ? Complex systems have different levels of
organization with units specific to each level
415. The importance of observing networks
- Community structure
- ? prediction of some essential features of the
system - ? possibility to zoom in on a unit and
uncover its communities - ? interpret the local organization of large
networks - ? predict how the modular structure changes if a
unit is removed - We can simultaneously look at the network at a
higher level of organization and locate the
communities. -