Structure and models of real-world graphs and networks

About This Presentation

Title:

Structure and models of real-world graphs and networks

Description:

Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department Carnegie Mellon University jure_at_cs.cmu.edu http://www.cs.cmu.edu/~jure/ – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 83

Provided by: csCmuEdu95

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Structure and models of real-world graphs and networks

1
Structure and models of real-world graphs and
networks

Jure Leskovec
Machine Learning Department
Carnegie Mellon University
jure_at_cs.cmu.edu
http//www.cs.cmu.edu/jure/

2
Networks (graphs)
3
Examples of networks
(b)
(c)
(a)
(d)
(e)

Internet (a)
citation network (b)
World Wide Web (c)

sexual network (d)
food web (e)

4
Networks of the real-world (1)

Information networks
World Wide Web hyperlinks
Citation networks
Blog networks
Social networks people interactios
Organizational networks
Communication networks
Collaboration networks
Sexual networks
Collaboration networks
Technological networks
Power grid
Airline, road, river networks
Telephone networks
Internet
Autonomous systems

Florence families
Karate club network
Collaboration network
Friendship network
5
Networks of the real-world (2)

Biological networks
metabolic networks
food web
neural networks
gene regulatory networks
Language networks
Semantic networks
Software networks

Semantic network
Yeast protein interactions
Language network
XFree86 network
6
Types of networks

Directed/undirected
Multi graphs (multiple edges between nodes)
Hyper graphs (edges connecting multiple nodes)
Bipartite graphs (e.g., papers to authors)
Weighted networks
Different type nodes and edges
Evolving networks
Nodes and edges only added
Nodes, edges added and removed

7
Traditional approach

Sociologists were first to study networks
Study of patterns of connections between people
to understand functioning of the society
People are nodes, interactions are edges
Questionares are used to collect link data (hard
to obtain, inaccurate, subjective)
Typical questions Centrality and connectivity
Limited to small graphs (10 nodes) and
properties of individual nodes and edges

8
New approach (1)

Large networks (e.g., web, internet, on-line
social networks) with millions of nodes
Many traditional questions not useful anymore
Traditional What happens if a node U is removed?
Now What percentage of nodes needs to be removed
to affect network connectivity?
Focus moves from a single node to study of
statistical properties of the network as a whole
Can not draw (plot) the network and examine it

9
New approach (2)

How the network looks like even if I cant look
at it?
Need statistical methods and tools to quantify
large networks
3 parts/goals
Statistical properties of large networks
Models that help understand these properties
Predict behavior of networked systems based on
measured structural properties and local rules
governing individual nodes

10
Statistical properties of networks

Features that are common to networks of different
types
Properties of static networks
Small-world effect
Transitivity or clustering
Degree distributions (scale free networks)
Network resilience
Community structure
Subgraphs or motifs
Temporal properties
Densification
Shrinking diameter

11
Small-world effect (1)

Six degrees of separation (Milgram 60s)
Random people in Nebraska were asked to send
letters to stockbrokes in Boston
Letters can only be passed to first-name
acquantices
Only 25 letters reached the goal
But they reached it in about 6 steps
Measuring path lengths
Diameter (longest shortest path) max dij
Effective diameter distance at which 90 of all
connected pairs of nodes can be reached
Mean geodesic (shortest) distance l

or
12
Small-world effect (2)

Distribution of shortest path lengths
Microsoft Messenger network
180 million people
1.3 billion edges
Edge if two people exchanged at least one message
in one month period

7
13
Small-world effect (3)

Fact
If number of vertices within distance r grows
exponentially with r, then mean shortest path
length l increases as log n
Implications
Information (viruses) spread quickly
Erdos numbers are small
Peer to peer networks (for navigation purposes)
Shortest paths exists
Humans are able to find the paths
People only know their friends
People do not have the global knowledge of the
network
This suggests something special about the
structure of the network
On a random graph short paths exists but no one
would be able to find them

14
Transitivity or Clustering

friend of a friend is a friend
If a connects to b, and b to c, then with
high probability a connects to c.
Clustering coefficient C
C 3number of triangles / number of connected
triples
Alternative definition
Ci triangles connected to vertex i / number
triples centered on vertex i
Clustering coefficient

Ci1, 1, 1/6, 0, 0
15
Transitivity or Clustering (2)

Clustering coefficient scales as
It is considerably higher than in a random graph
It is speculated that in real networks
CO(1) as n?8
In Erdos-Renyi random graph CO(n-1)

Synonyms network
World Wide Web
16
Degree distributions (1)

Let pk denote a fraction of nodes with degree k
We can plot a histogram of pk vs. k
In a Erdos-Renyi random graph degree distribution
follows Poisson distribution
Degrees in real networks are heavily skewed to
the right
Distribution has a long tail of values that are
far above the mean
Heavy (long) tail
Amazon sales
word length distribution,

17
Detour how long is the long tail?
This is not directly related to graphs, but it
nicely explains the long tail effect. It shows
that there is big market for niche products.
18
Degree distributions (2)

Many real world networks contain hubs highly
connected nodes
We can easily distinguish between exponential and
power-law tail by plotting on log-lin and log-log
axis
In scale-free networks maximum degree scales as
n1/(a-1)

lin-lin
log-lin
pk
k
k
log-log
pk
k
Degree distribution in a blog network
19
Poisson vs. Scale-free network
Poisson network
Scale-free (power-law) network
(Erdos-Renyi random graph)
Degree distribution is Power-law
Function is scale free if f(ax) b f(x)
Degree distribution is Poisson
20
Degree distribution number of people a person
talks to on a Microsoft Messenger
Count
Highest degree
X
Node degree
21
Network resilience (1)

We observe how the connectivity (length of the
paths) of the network changes as the vertices get
removed
Vertices can be removed
Uniformly at random
In order of decreasing degree
It is important for epidemiology
Removal of vertices corresponds to vaccination

22
Network resilience (2)

Real-world networks are resilient to random
attacks
One has to remove all web-pages of degree gt 5 to
disconnect the web
But this is a very small percentage of web pages
Random network has better resilience to targeted
attacks

Random network
Internet (Autonomous systems)
Preferential removal
Mean path length
Random removal
Fraction of removed nodes
Fraction of removed nodes
23
Community structure

Most social networks show community structure
groups have higher density of edges within than
accross groups
People naturally divide into groups based on
interests, age, occupation,
How to find communities
Spectral clustering (embedding into a low-dim
space)
Hierarchical clustering based on connection
strength
Combinatorial algorithms
Block models
Diffusion methods

Friendship network of children in a school
24
MSN Messenger
Distribution of Connected components in MSN
Messenger network
Growth of largest component over time in a
citation network
Count

Graphs have a giant component
Distribution of connected components follows a
power law

Largest component
X
Size (number of nodes)
25
Network motifs (1)

What are the building blocks (motifs) of
networks?
Do motifs have specific roles in networks?
Network motifs detection process
Count how many times each subgraph appears
Compute statistical significance for each
subgraph probability of appearing in random as
much as in real network

3 node motifs
26
Network motifs (2)

Biological networks
Feed-forward loop
Bi-fan motif
Web graph
Feedback with two mutual diads
Mutual diad
Fully connected triad

27
Network motifs (3)
Transcription networks
Signal transduction networks
WWW and friendship networks
Word adjacency networks
28
Networks over time Densification

A very basic question What is the relation
between the number of nodes and the number of
edges in a network?
Networks are becoming denser over time
The number of edges grows faster than the number
of nodes average degree is increasing
a densification exponent 1 a 2
a1 linear growth constant out-degree (assumed
in the literature so far)
a2 quadratic growth clique

Internet
E(t)
a1.2
N(t)
Citations
E(t)
a1.7
N(t)
29
Densification degree distribution
Degree exponent over time

How does densification affect degree
distribution?
Given densification exponent a, the degree
exponent is
(a) For ?const over time, we obtain
densification only for 1lt?lt2, then ?a/2
(b) For ?lt2 degree distribution has to evolve
according to
Power-law yb x?, for ?lt2 Ey 8

pkk?
(a)
?(t)
a1.1
(b)
?(t)
a1.6
30
Shrinking diameters
Internet

Intuition says that distances between the nodes
slowly grow as the network grows (like log n)
But as the network grows the distances between
nodes slowly decrease

Citations
31
Models of network generation and evolution
32
Recap (1)

Last time we saw
Large networks (web, on-line social networks) are
here
Many traditional questions not useful anymore
We can not plot the network so we need
statistical methods and tools to quantify large
networks
3 parts/goals
Statistical properties of large networks
Models that help understand these properties
Predict behavior of networked systems based on
measured structural properties and local rules
governing individual nodes

33
Recap (2)

We also so features that are common to networks
of various types
Properties of static networks
Small-world effect
Transitivity or clustering
Degree distributions (scale free networks)
Network resilience
Community structure
Subgraphs or motifs
Temporal properties
Densification
Shrinking diameter

34
Outline for today

We will see the network generative models for
modeling networks features
Erdos-Renyi random graph
Exponential random graphs (p) model
Small world model
Preferential attachment
Community guided attachment
Forest fire model
Fitting models to real data
How to generate a synthetic realistic looking
network?

35
(Erodos-Renyi) Random graphs

Also known as Poisson random graphs or Bernoulli
graphs
Given n vertices connect each pair i.i.d. with
probability p
Two variants
Gn,p graph with m edges appears with probability
pm(1-p)M-m, where M0.5n(n-1) is the max number
of edges
Gn,m graphs with n nodes, m edges
Very rich mathematical theory many properties
are exactly solvable

36
Properties of random graphs

Degree distribution is Poisson since the presence
and absence of edges is independent
Giant component average degree k2m/n
k1-e all components are of size log n
k1e there is 1 component of size n
All others are of size log n
They are a tree plus an edge, i.e., cycles
Diameter log n / log k

37
Evolution of a random graph
for non-GCC vertices
k
38
Subgraphs in random graphs

Expected number of subgraphs H(v,e) in Gn,p is

a... of isomorphic graphs
39
Random graphs conclusion
Configuration model

Pros
Simple and tractable model
Phase transitions
Giant component
Cons
Degree distribution
No community structure
No degree correlations
Extensions
Configuration model
Random graphs with arbitrary degree sequence
Excess degree Degree of a vertex of the end of
random edge qk k pk

40
Exponential random graphs(p models)

Comes from social sciences
Let ei set of measurable properties of a graph
(number of edges, number of nodes of a given
degree, number of triangles, )
Exponential random graph model defines a
probability distribution over graphs

Examples of ei
41
Exponential random graphs

Includes Erdos-Renyi as a special case
Assume parameters ßi are specified
No analytical solutions for the model
But can use simulation to sample the graphs
Define local moves on a graph
Addition/removal of edges
Movement of edges
Edge swaps
Parameter estimation
maximum likelihood
Problem
Cant solve for transitivity (produces cliques)
Used to analyze small networks

Example of parameter estimates
42
Small-world model

Used for modeling network transitivity
Many networks assume some kind of geographical
proximity
Small-world model
Start with a low-dimensional regular lattice
Rewire
Add/remove edges to create shortcuts to join
remote parts of the lattice
For each edge with prob p move the other end to a
random vertex
Rewiring allows to interpolate between regular
lattice and random graph

43
Small-world model

Regular lattice (p0)
Clustering coefficient C(3k-3)/(4k-2)3/4
Mean distance L/4k
Almost random graph (p1)
Clustering coefficient C2k/L
Mean distance log L / log k
No power-law degree distribution

Rewiring probability p
Degree distribution
44
Models of evolution

Models of network evolution
Preferential attachment
Edge copying model
Community Guided Attachment
Forest Fire model
Models for realistic network generation
Kronecker graphs

45
Preferential attachment

Models the growth of the network
Preferential attachment (Price 1965, Albert
Barabasi 1999)
Add a new node, create m out-links
Probability of linking a node ki is
proportional to its degree
Based on Herbert Simons result
Power-laws arise from Rich get richer
(cumulative advantage)
Examples (Price 1965 for modeling citations)
Citations new citations of a paper are
proportional to the number it already has

46
Preferential attachment

Leads to power-law degree distributions
But
all nodes have equal (constant) out-degree
one needs a complete knowledge of the network
There are many generalizations and variants, but
the preferential selection is the key ingredient
that leads to power-laws

47
Edge copying model

Copying model
Add a node and choose k the number of edges to
add
With prob ß select k random vertices and link to
them
With prob 1-ß edges are copied from a randomly
chosen node
Generates power-law degree distributions with
exponent 1/(1-ß)
Generates communities
Related Random-surfer model

48
Community guided attachment

Want to model/explain densification in networks
Assume community structure
One expects many within-group friendships and
fewer cross-group ones

University
Arts
Science
CS
Drama
Music
Math
Self-similar university community structure
49
Community guided attachment

Assuming cross-community linking probability
The Community Guided Attachment leads to
Densification Power Law with exponent
a densification exponent
b community tree branching factor
c difficulty constant, 1 c b
If c 1 easy to cross communities
Then a2, quadratic growth of edges near
clique
If c b hard to cross communities
Then a1, linear growth of edges constant
out-degree

50
Forest Fire Model

Want to model graphs that density and have
shrinking diameters
Intuition
How do we meet friends at a party?
How do we identify references when writing papers?

51
Forest Fire Model for directed graphs

The model has 2 parameters
p forward burning probability
r backward burning probability
The model
Each turn a new node v arrives
Uniformly at random chooses an ambassador w
Flip two geometric coins to determine the number
in- and out-links of w to follow (burn)
Fire spreads recursively until it dies
Node v links to all burned nodes

52
Forest Fire Model

Simulation experiments
Forest Fire generates graphs that densify and
have shrinking diameter

E(t)
densification
diameter
1.32
diameter
N(t)
N(t)
53
Forest Fire Model

Forest Fire also generates graphs with
heavy-tailed degree distribution

in-degree
out-degree
count vs. in-degree
count vs. out-degree
54
Forest Fire Parameter Space

Fix backward probability r and vary forward
burning probability p
We observe a sharp transition between sparse and
clique-like graphs
Sweet spot is very narrow

Clique-like graph
Increasing diameter
Constant diameter
Sparse graph
Decreasing diameter
55
Kronecker graphs

Want to have a model that can generate a
realistic graph
Static Patterns
Power Law Degree Distribution
Small Diameter
Power Law Eigenvalue and Eigenvector Distribution
Temporal Patterns
Densification Power Law
Shrinking/Constant Diameter
For Kronecker graphs all these properties can
actually be proven

56
Kronecker Product a Graph
Intermediate stage
Adjacency matrix
Adjacency matrix
57
Kronecker Product Definition

The Kronecker product of matrices A and B is
given by
We define a Kronecker product of two graphs as a
Kronecker product of their adjacency matrices

N x M
K x L
NK x ML
58
Stochastic Kronecker Graphs

Create N1?N1 probability matrix P1
Compute the kth Kronecker power Pk
For each entry puv of Pk include an edge (u,v)
with probability puv

Kronecker multiplication
Instance Matrix G2
P1
flip biased coins
P2
59
Fitting Kronecker to Real Data

Given a graph G and Kronecker matrix P1 we can
calculate probability that P1 generated G
P(GP1)

P1
G
Pk
P(GP1)
s node labeling
60
Fitting Kronecker 2 challenges
P1
G
Pk
P(GP1)

Invariance to node labeling s (there are N!
labelings)
Calculating P(GP1) takes O(N2) (since one needs
to consider every cell of adjacency matrix)

1
2
3
4

2
1
4
3
61
Fitting Kronecker Solutions
s node labeling
P
G

Node Labeling can use MCMC sampling to average
over (all) node labelings
P(GP1) takes O(N2) Real graphs are sparse, so
calculate P(Gempty) and then add the edges.
This takes O(E).

62
Experiments on real AS graph
Degree distribution
Hop plot
Network value
Adjacency matrix eigen values
63
Why fitting generative models?

Parameters tell us about the structure of a graph
Extrapolation given a graph today, how will it
look in a year?
Sampling can I get a smaller graph with similar
properties?
Anonymization instead of releasing real graph
(e.g., email network), we can release a synthetic
version of it

64
Processes taking place on networks
65
Epidemiological processes

The simplest way to spread a virus over the
network
S Susceptible
I Infected
R Recovered (removed)
SIS model 2 parameters
ß virus birth rate
d virus death (recovery) rate
SIR model as one gets cured, he or she can not
get infected again

SIS model
?it depends on ß and topology
66
Epidemic threshold for SIS model

How infectious the virus needs to be to survive
in the network?
First results on power-law networks suggested
that any virus will prevail
New result that works for any topology
For sgt1 virus prevails
For slt1 virus dies

?1 largest eigen value of graph adjacency matrix
67
Navigation in small-world networks
v

Milgrams experiment showed
(a) short paths exist in networks
(b) humans are able to find them
Assume the following setting
Nodes of a graph are scattered on a plane
Given starting node u and we want to reach target
node v
A small world navigation algorithm navigates the
network by always navigating to a neighbor that
is closest (in Manhattan distance) to target node
v

u
68
Navigation in small-world networks
Network creation

Start with random lattice
Each node connects with their 4 immediate
neighbors
Long range links are added with probability
proportional to the distance between the points
(p(u,v) da)
Can be show that only for a2 delivery time is
poly-log in number of nodes n

Deliver time T lt nß
69
Navigation in a real-world network

Take a social network of 500k bloggers where for
each blogger we know their geographical location
Pick two nodes at random and geographically
greedy navigate the network
Results
13 success rate (vs. 18 for Milgram)

Distribution of path lengths
Friendships vs. distance
70
Navigation in real-world network

Geographical distance may not be the right kind
of distance
Since population is non-uniform lets use rank
based friendship distance
i.e., we measure the distance d(u,v) by the
number of people living closer to v than u does
Then

And the proof still works
71

Some references used to prepare this talk
The Structure and Function of Complex Networks,
by Mark Newman
Statistical mechanics of complex networks, by
Reka Albert and Albert-Laszlo Barabasi
Graph Mining Laws, Generators and Algorithms, by
Deepay Chakrabarti and Christos Faloutsos
An Introduction to Exponential Random Graph (p)
Models for Social Networks by Garry Robins, Pip
Pattison, Yuval Kalish and Dean Lusher
Graph Evolution Densification and Shrinking
Diameters, by Jure Leskovec, Jon Kleinberg and
Christos Faloutsos
Realistic, Mathematically Tractable Graph
Generation and Evolution, Using Kronecker
Multiplication, by Jure Leskovec, Deepayan
Chakrabarti, Jon Kleinberg and Christos Faloutsos
Navigation in a Small World, by Jon Kleinberg
Geographic routing in social networks, by David
Liben-Nowell, Jasmine Novak, Ravi Kumar,
Prabhakar Raghavan, and Andrew Tomkins
Some plots and slides borrowed from Lada Adamic,
Mark Newman, Mark Joseph, Albert Barabasi, Jon
Kleinberg, David Lieben-Nowell, Sergi Valverde
and Ricard Sole

72
Rough random materialthat did not make it into
the presentation
73
Bow-tie structure of the web
TENDRILS44M
SCC56 M
OUT44 M
IN44 M
DISC17 M
Broder al. WWW 2000, Dill al. VLDB 2001
74
Study of 3 websites

study over three universities publicly indexable
Web sites

75
Australia
In- and out-degree distributions
76
New Zealand
In- and out-degree distributions
77
United Kingdom
In- and out-degree distributions
78
We assume this node would like to connect to a
centrally located node a node whose distances to
other nodes is minimized.
dij is the Euclidean distance hj is some measure
of the centrality of node j a is a parameter
a function of the final number n of points,
gauging the relative importance of the two
objectives
79
Fabrikant et al. define 3 possible measures of
centrality 1. The average number of hops from
other nodes 2. The maximum number of hops from
another node 3. The number of hops from a fixed
center of the tree
80
a is the crux of the theorem! Why? Here are
some examples
Fabrikantal
81
If a is too low, then the Euclidian distances
become unimportant, and the network resembles a
star
Fabrikantal
82
But if a grows at least as fast as vn, where n is
the final number of points, then distance becomes
too important, and minimum spanning trees with
high degree occur, but with exponentially
vanishing probability thus not a power law. if
a is anywhere in between, we have a power law
Through a rather complex and elaborate proof,
Fabrikantal prove this initial assumption will
produce a power law distribution Ill save you
the math!

Write a Comment

User Comments (0)