Title: Structure and models of real-world graphs and networks
1Structure and models of real-world graphs and
networks
- Jure Leskovec
- Machine Learning Department
- Carnegie Mellon University
- jure_at_cs.cmu.edu
- http//www.cs.cmu.edu/jure/
2Networks (graphs)
3Examples of networks
(b)
(c)
(a)
(d)
(e)
- Internet (a)
- citation network (b)
- World Wide Web (c)
- sexual network (d)
- food web (e)
4Networks of the real-world (1)
- Information networks
- World Wide Web hyperlinks
- Citation networks
- Blog networks
- Social networks people interactios
- Organizational networks
- Communication networks
- Collaboration networks
- Sexual networks
- Collaboration networks
- Technological networks
- Power grid
- Airline, road, river networks
- Telephone networks
- Internet
- Autonomous systems
Florence families
Karate club network
Collaboration network
Friendship network
5Networks of the real-world (2)
- Biological networks
- metabolic networks
- food web
- neural networks
- gene regulatory networks
- Language networks
- Semantic networks
- Software networks
Semantic network
Yeast protein interactions
Language network
XFree86 network
6Types of networks
- Directed/undirected
- Multi graphs (multiple edges between nodes)
- Hyper graphs (edges connecting multiple nodes)
- Bipartite graphs (e.g., papers to authors)
- Weighted networks
- Different type nodes and edges
- Evolving networks
- Nodes and edges only added
- Nodes, edges added and removed
7Traditional approach
- Sociologists were first to study networks
- Study of patterns of connections between people
to understand functioning of the society - People are nodes, interactions are edges
- Questionares are used to collect link data (hard
to obtain, inaccurate, subjective) - Typical questions Centrality and connectivity
- Limited to small graphs (10 nodes) and
properties of individual nodes and edges
8New approach (1)
- Large networks (e.g., web, internet, on-line
social networks) with millions of nodes - Many traditional questions not useful anymore
- Traditional What happens if a node U is removed?
- Now What percentage of nodes needs to be removed
to affect network connectivity? - Focus moves from a single node to study of
statistical properties of the network as a whole - Can not draw (plot) the network and examine it
9New approach (2)
- How the network looks like even if I cant look
at it? - Need statistical methods and tools to quantify
large networks - 3 parts/goals
- Statistical properties of large networks
- Models that help understand these properties
- Predict behavior of networked systems based on
measured structural properties and local rules
governing individual nodes
10Statistical properties of networks
- Features that are common to networks of different
types - Properties of static networks
- Small-world effect
- Transitivity or clustering
- Degree distributions (scale free networks)
- Network resilience
- Community structure
- Subgraphs or motifs
- Temporal properties
- Densification
- Shrinking diameter
11Small-world effect (1)
- Six degrees of separation (Milgram 60s)
- Random people in Nebraska were asked to send
letters to stockbrokes in Boston - Letters can only be passed to first-name
acquantices - Only 25 letters reached the goal
- But they reached it in about 6 steps
- Measuring path lengths
- Diameter (longest shortest path) max dij
- Effective diameter distance at which 90 of all
connected pairs of nodes can be reached - Mean geodesic (shortest) distance l
or
12Small-world effect (2)
- Distribution of shortest path lengths
- Microsoft Messenger network
- 180 million people
- 1.3 billion edges
- Edge if two people exchanged at least one message
in one month period
7
13Small-world effect (3)
- Fact
- If number of vertices within distance r grows
exponentially with r, then mean shortest path
length l increases as log n - Implications
- Information (viruses) spread quickly
- Erdos numbers are small
- Peer to peer networks (for navigation purposes)
- Shortest paths exists
- Humans are able to find the paths
- People only know their friends
- People do not have the global knowledge of the
network - This suggests something special about the
structure of the network - On a random graph short paths exists but no one
would be able to find them
14Transitivity or Clustering
- friend of a friend is a friend
- If a connects to b, and b to c, then with
high probability a connects to c. - Clustering coefficient C
- C 3number of triangles / number of connected
triples - Alternative definition
- Ci triangles connected to vertex i / number
triples centered on vertex i - Clustering coefficient
Ci1, 1, 1/6, 0, 0
15Transitivity or Clustering (2)
- Clustering coefficient scales as
- It is considerably higher than in a random graph
- It is speculated that in real networks
- CO(1) as n?8
- In Erdos-Renyi random graph CO(n-1)
Synonyms network
World Wide Web
16Degree distributions (1)
- Let pk denote a fraction of nodes with degree k
- We can plot a histogram of pk vs. k
- In a Erdos-Renyi random graph degree distribution
follows Poisson distribution - Degrees in real networks are heavily skewed to
the right - Distribution has a long tail of values that are
far above the mean - Heavy (long) tail
- Amazon sales
- word length distribution,
17Detour how long is the long tail?
This is not directly related to graphs, but it
nicely explains the long tail effect. It shows
that there is big market for niche products.
18Degree distributions (2)
- Many real world networks contain hubs highly
connected nodes - We can easily distinguish between exponential and
power-law tail by plotting on log-lin and log-log
axis - In scale-free networks maximum degree scales as
n1/(a-1)
lin-lin
log-lin
pk
k
k
log-log
pk
k
Degree distribution in a blog network
19Poisson vs. Scale-free network
Poisson network
Scale-free (power-law) network
(Erdos-Renyi random graph)
Degree distribution is Power-law
Function is scale free if f(ax) b f(x)
Degree distribution is Poisson
20Degree distribution number of people a person
talks to on a Microsoft Messenger
Count
Highest degree
X
Node degree
21Network resilience (1)
- We observe how the connectivity (length of the
paths) of the network changes as the vertices get
removed - Vertices can be removed
- Uniformly at random
- In order of decreasing degree
- It is important for epidemiology
- Removal of vertices corresponds to vaccination
22Network resilience (2)
- Real-world networks are resilient to random
attacks - One has to remove all web-pages of degree gt 5 to
disconnect the web - But this is a very small percentage of web pages
- Random network has better resilience to targeted
attacks
Random network
Internet (Autonomous systems)
Preferential removal
Mean path length
Random removal
Fraction of removed nodes
Fraction of removed nodes
23Community structure
- Most social networks show community structure
- groups have higher density of edges within than
accross groups - People naturally divide into groups based on
interests, age, occupation, - How to find communities
- Spectral clustering (embedding into a low-dim
space) - Hierarchical clustering based on connection
strength - Combinatorial algorithms
- Block models
- Diffusion methods
Friendship network of children in a school
24MSN Messenger
Distribution of Connected components in MSN
Messenger network
Growth of largest component over time in a
citation network
Count
- Graphs have a giant component
- Distribution of connected components follows a
power law
Largest component
X
Size (number of nodes)
25Network motifs (1)
- What are the building blocks (motifs) of
networks? - Do motifs have specific roles in networks?
- Network motifs detection process
- Count how many times each subgraph appears
- Compute statistical significance for each
subgraph probability of appearing in random as
much as in real network
3 node motifs
26Network motifs (2)
- Biological networks
- Feed-forward loop
- Bi-fan motif
- Web graph
- Feedback with two mutual diads
- Mutual diad
- Fully connected triad
27Network motifs (3)
Transcription networks
Signal transduction networks
WWW and friendship networks
Word adjacency networks
28Networks over time Densification
- A very basic question What is the relation
between the number of nodes and the number of
edges in a network? - Networks are becoming denser over time
- The number of edges grows faster than the number
of nodes average degree is increasing - a densification exponent 1 a 2
- a1 linear growth constant out-degree (assumed
in the literature so far) - a2 quadratic growth clique
Internet
E(t)
a1.2
N(t)
Citations
E(t)
a1.7
N(t)
29Densification degree distribution
Degree exponent over time
- How does densification affect degree
distribution? - Given densification exponent a, the degree
exponent is - (a) For ?const over time, we obtain
densification only for 1lt?lt2, then ?a/2 - (b) For ?lt2 degree distribution has to evolve
according to - Power-law yb x?, for ?lt2 Ey 8
pkk?
(a)
?(t)
a1.1
(b)
?(t)
a1.6
30Shrinking diameters
Internet
- Intuition says that distances between the nodes
slowly grow as the network grows (like log n) - But as the network grows the distances between
nodes slowly decrease
Citations
31Models of network generation and evolution
32Recap (1)
- Last time we saw
- Large networks (web, on-line social networks) are
here - Many traditional questions not useful anymore
- We can not plot the network so we need
statistical methods and tools to quantify large
networks - 3 parts/goals
- Statistical properties of large networks
- Models that help understand these properties
- Predict behavior of networked systems based on
measured structural properties and local rules
governing individual nodes
33Recap (2)
- We also so features that are common to networks
of various types - Properties of static networks
- Small-world effect
- Transitivity or clustering
- Degree distributions (scale free networks)
- Network resilience
- Community structure
- Subgraphs or motifs
- Temporal properties
- Densification
- Shrinking diameter
34Outline for today
- We will see the network generative models for
modeling networks features - Erdos-Renyi random graph
- Exponential random graphs (p) model
- Small world model
- Preferential attachment
- Community guided attachment
- Forest fire model
- Fitting models to real data
- How to generate a synthetic realistic looking
network?
35(Erodos-Renyi) Random graphs
- Also known as Poisson random graphs or Bernoulli
graphs - Given n vertices connect each pair i.i.d. with
probability p - Two variants
- Gn,p graph with m edges appears with probability
pm(1-p)M-m, where M0.5n(n-1) is the max number
of edges - Gn,m graphs with n nodes, m edges
- Very rich mathematical theory many properties
are exactly solvable
36Properties of random graphs
- Degree distribution is Poisson since the presence
and absence of edges is independent - Giant component average degree k2m/n
- k1-e all components are of size log n
- k1e there is 1 component of size n
- All others are of size log n
- They are a tree plus an edge, i.e., cycles
- Diameter log n / log k
37Evolution of a random graph
for non-GCC vertices
k
38Subgraphs in random graphs
- Expected number of subgraphs H(v,e) in Gn,p is
a... of isomorphic graphs
39Random graphs conclusion
Configuration model
- Pros
- Simple and tractable model
- Phase transitions
- Giant component
- Cons
- Degree distribution
- No community structure
- No degree correlations
- Extensions
- Configuration model
- Random graphs with arbitrary degree sequence
- Excess degree Degree of a vertex of the end of
random edge qk k pk
40Exponential random graphs(p models)
- Comes from social sciences
- Let ei set of measurable properties of a graph
(number of edges, number of nodes of a given
degree, number of triangles, ) - Exponential random graph model defines a
probability distribution over graphs
Examples of ei
41Exponential random graphs
- Includes Erdos-Renyi as a special case
- Assume parameters ßi are specified
- No analytical solutions for the model
- But can use simulation to sample the graphs
- Define local moves on a graph
- Addition/removal of edges
- Movement of edges
- Edge swaps
- Parameter estimation
- maximum likelihood
- Problem
- Cant solve for transitivity (produces cliques)
- Used to analyze small networks
Example of parameter estimates
42Small-world model
- Used for modeling network transitivity
- Many networks assume some kind of geographical
proximity - Small-world model
- Start with a low-dimensional regular lattice
- Rewire
- Add/remove edges to create shortcuts to join
remote parts of the lattice - For each edge with prob p move the other end to a
random vertex - Rewiring allows to interpolate between regular
lattice and random graph
43Small-world model
- Regular lattice (p0)
- Clustering coefficient C(3k-3)/(4k-2)3/4
- Mean distance L/4k
- Almost random graph (p1)
- Clustering coefficient C2k/L
- Mean distance log L / log k
- No power-law degree distribution
Rewiring probability p
Degree distribution
44Models of evolution
- Models of network evolution
- Preferential attachment
- Edge copying model
- Community Guided Attachment
- Forest Fire model
- Models for realistic network generation
- Kronecker graphs
45Preferential attachment
- Models the growth of the network
- Preferential attachment (Price 1965, Albert
Barabasi 1999) - Add a new node, create m out-links
- Probability of linking a node ki is
proportional to its degree - Based on Herbert Simons result
- Power-laws arise from Rich get richer
(cumulative advantage) - Examples (Price 1965 for modeling citations)
- Citations new citations of a paper are
proportional to the number it already has
46Preferential attachment
- Leads to power-law degree distributions
- But
- all nodes have equal (constant) out-degree
- one needs a complete knowledge of the network
- There are many generalizations and variants, but
the preferential selection is the key ingredient
that leads to power-laws
47Edge copying model
- Copying model
- Add a node and choose k the number of edges to
add - With prob ß select k random vertices and link to
them - With prob 1-ß edges are copied from a randomly
chosen node - Generates power-law degree distributions with
exponent 1/(1-ß) - Generates communities
- Related Random-surfer model
48Community guided attachment
- Want to model/explain densification in networks
- Assume community structure
- One expects many within-group friendships and
fewer cross-group ones
University
Arts
Science
CS
Drama
Music
Math
Self-similar university community structure
49Community guided attachment
- Assuming cross-community linking probability
- The Community Guided Attachment leads to
Densification Power Law with exponent - a densification exponent
- b community tree branching factor
- c difficulty constant, 1 c b
- If c 1 easy to cross communities
- Then a2, quadratic growth of edges near
clique - If c b hard to cross communities
- Then a1, linear growth of edges constant
out-degree
50Forest Fire Model
- Want to model graphs that density and have
shrinking diameters - Intuition
- How do we meet friends at a party?
- How do we identify references when writing papers?
51Forest Fire Model for directed graphs
- The model has 2 parameters
- p forward burning probability
- r backward burning probability
- The model
- Each turn a new node v arrives
- Uniformly at random chooses an ambassador w
- Flip two geometric coins to determine the number
in- and out-links of w to follow (burn) - Fire spreads recursively until it dies
- Node v links to all burned nodes
52Forest Fire Model
- Simulation experiments
- Forest Fire generates graphs that densify and
have shrinking diameter
E(t)
densification
diameter
1.32
diameter
N(t)
N(t)
53Forest Fire Model
- Forest Fire also generates graphs with
heavy-tailed degree distribution
in-degree
out-degree
count vs. in-degree
count vs. out-degree
54Forest Fire Parameter Space
- Fix backward probability r and vary forward
burning probability p - We observe a sharp transition between sparse and
clique-like graphs - Sweet spot is very narrow
Clique-like graph
Increasing diameter
Constant diameter
Sparse graph
Decreasing diameter
55Kronecker graphs
- Want to have a model that can generate a
realistic graph - Static Patterns
- Power Law Degree Distribution
- Small Diameter
- Power Law Eigenvalue and Eigenvector Distribution
- Temporal Patterns
- Densification Power Law
- Shrinking/Constant Diameter
- For Kronecker graphs all these properties can
actually be proven
56Kronecker Product a Graph
Intermediate stage
Adjacency matrix
Adjacency matrix
57Kronecker Product Definition
- The Kronecker product of matrices A and B is
given by - We define a Kronecker product of two graphs as a
Kronecker product of their adjacency matrices
N x M
K x L
NK x ML
58Stochastic Kronecker Graphs
- Create N1?N1 probability matrix P1
- Compute the kth Kronecker power Pk
- For each entry puv of Pk include an edge (u,v)
with probability puv
Kronecker multiplication
Instance Matrix G2
P1
flip biased coins
P2
59Fitting Kronecker to Real Data
- Given a graph G and Kronecker matrix P1 we can
calculate probability that P1 generated G
P(GP1)
P1
G
Pk
P(GP1)
s node labeling
60Fitting Kronecker 2 challenges
P1
G
Pk
P(GP1)
- Invariance to node labeling s (there are N!
labelings) - Calculating P(GP1) takes O(N2) (since one needs
to consider every cell of adjacency matrix)
1
2
3
4
2
1
4
3
61Fitting Kronecker Solutions
s node labeling
P
G
- Node Labeling can use MCMC sampling to average
over (all) node labelings - P(GP1) takes O(N2) Real graphs are sparse, so
calculate P(Gempty) and then add the edges.
This takes O(E).
62Experiments on real AS graph
Degree distribution
Hop plot
Network value
Adjacency matrix eigen values
63Why fitting generative models?
- Parameters tell us about the structure of a graph
- Extrapolation given a graph today, how will it
look in a year? - Sampling can I get a smaller graph with similar
properties? - Anonymization instead of releasing real graph
(e.g., email network), we can release a synthetic
version of it
64Processes taking place on networks
65Epidemiological processes
- The simplest way to spread a virus over the
network - S Susceptible
- I Infected
- R Recovered (removed)
- SIS model 2 parameters
- ß virus birth rate
- d virus death (recovery) rate
- SIR model as one gets cured, he or she can not
get infected again
SIS model
?it depends on ß and topology
66Epidemic threshold for SIS model
- How infectious the virus needs to be to survive
in the network? - First results on power-law networks suggested
that any virus will prevail - New result that works for any topology
- For sgt1 virus prevails
- For slt1 virus dies
?1 largest eigen value of graph adjacency matrix
67Navigation in small-world networks
v
- Milgrams experiment showed
- (a) short paths exist in networks
- (b) humans are able to find them
- Assume the following setting
- Nodes of a graph are scattered on a plane
- Given starting node u and we want to reach target
node v - A small world navigation algorithm navigates the
network by always navigating to a neighbor that
is closest (in Manhattan distance) to target node
v
u
68Navigation in small-world networks
Network creation
- Start with random lattice
- Each node connects with their 4 immediate
neighbors - Long range links are added with probability
proportional to the distance between the points
(p(u,v) da) - Can be show that only for a2 delivery time is
poly-log in number of nodes n
Deliver time T lt nß
69Navigation in a real-world network
- Take a social network of 500k bloggers where for
each blogger we know their geographical location - Pick two nodes at random and geographically
greedy navigate the network - Results
- 13 success rate (vs. 18 for Milgram)
Distribution of path lengths
Friendships vs. distance
70Navigation in real-world network
- Geographical distance may not be the right kind
of distance - Since population is non-uniform lets use rank
based friendship distance - i.e., we measure the distance d(u,v) by the
number of people living closer to v than u does - Then
And the proof still works
71- Some references used to prepare this talk
- The Structure and Function of Complex Networks,
by Mark Newman - Statistical mechanics of complex networks, by
Reka Albert and Albert-Laszlo Barabasi - Graph Mining Laws, Generators and Algorithms, by
Deepay Chakrabarti and Christos Faloutsos - An Introduction to Exponential Random Graph (p)
Models for Social Networks by Garry Robins, Pip
Pattison, Yuval Kalish and Dean Lusher - Graph Evolution Densification and Shrinking
Diameters, by Jure Leskovec, Jon Kleinberg and
Christos Faloutsos - Realistic, Mathematically Tractable Graph
Generation and Evolution, Using Kronecker
Multiplication, by Jure Leskovec, Deepayan
Chakrabarti, Jon Kleinberg and Christos Faloutsos
- Navigation in a Small World, by Jon Kleinberg
- Geographic routing in social networks, by David
Liben-Nowell, Jasmine Novak, Ravi Kumar,
Prabhakar Raghavan, and Andrew Tomkins - Some plots and slides borrowed from Lada Adamic,
Mark Newman, Mark Joseph, Albert Barabasi, Jon
Kleinberg, David Lieben-Nowell, Sergi Valverde
and Ricard Sole
72Rough random materialthat did not make it into
the presentation
73Bow-tie structure of the web
TENDRILS44M
SCC56 M
OUT44 M
IN44 M
DISC17 M
Broder al. WWW 2000, Dill al. VLDB 2001
74Study of 3 websites
- study over three universities publicly indexable
Web sites
75Australia
In- and out-degree distributions
76New Zealand
In- and out-degree distributions
77United Kingdom
In- and out-degree distributions
78We assume this node would like to connect to a
centrally located node a node whose distances to
other nodes is minimized.
dij is the Euclidean distance hj is some measure
of the centrality of node j a is a parameter
a function of the final number n of points,
gauging the relative importance of the two
objectives
79Fabrikant et al. define 3 possible measures of
centrality 1. The average number of hops from
other nodes 2. The maximum number of hops from
another node 3. The number of hops from a fixed
center of the tree
80a is the crux of the theorem! Why? Here are
some examples
Fabrikantal
81If a is too low, then the Euclidian distances
become unimportant, and the network resembles a
star
Fabrikantal
82But if a grows at least as fast as vn, where n is
the final number of points, then distance becomes
too important, and minimum spanning trees with
high degree occur, but with exponentially
vanishing probability thus not a power law. if
a is anywhere in between, we have a power law
Through a rather complex and elaborate proof,
Fabrikantal prove this initial assumption will
produce a power law distribution Ill save you
the math!