Title: Respondent-driven Sampling for Characterizing Unstructured Overlays
1Respondent-driven Sampling for Characterizing
Unstructured Overlays
A. H. Rasti University of Oregon
M. Torkjazi University of Oregon
R. Rejaie University of Oregon
N. Duffield ATT Labs - Research
W. Willinger ATT Labs - Research
D. Stutzbach Stutzbach Enterprises
Graciously Presented By Shubho Sen ATT Labs
- Research
2Motivation
- P2P systems are very popular in practice.
- Millions of simultaneous users.
- A significant fraction of Internet traffic
- Measurement studies aid understanding existing
systems and user behavior. - Capturing an accurate global snapshot is often
infeasible. - P2P systems are distributed, large, and rapidly
changing. - P2P crawlers are likely to capture incomplete or
distorted snapshots - Sampling is a natural approach, and has been used
implicitly in most earlier P2P measurement
studies. - How can we collect representative samples?
3The Graph Sampling Problem
- We focus on sampling peer properties, such as
number of neighbors (degree), access link
bandwidth, session time, files - Sampling peer properties has two steps
- Discovering and selecting peers (or samples)
- Measuring the desired properties of selected
peers - Selecting peers uniformly at random is hard
there are two sources of bias StutzbachIMC06 - Topological high-degree peers are more likely to
be selected - Temporal short-lived peers are more likely to
be selected - Random walks are a promising approach to sampling
- The resulting bias is precisely known
- Samples can be collected in parallel by multiple
walkers
4Sampling Using Random Walk
- Random walks can be described with a transition
matrix P(x,y) - P(x,y) probability of moving from x to y
- Pr(x,y) probability of moving from x to y after
r moves - Random walks converge to a stationary
distribution - Problem we need a uniform distribution
5Metropolized Random Walk (MRW)
- The Metropolis-Hastings method modifies the
transition matrix to yield the desired uniform
distribution StutzbachIMC06 - MRW method
- Select a neighbor y of x uniformly at random
- Transition to y with probability min(
deg(x)/deg(y) , 1) - Otherwise, self-loop to x.
- Results in uniform stationary dist. ?(x) 1/V
- MRW compensates for bias as samples are collected
y
x
6This paper
- Presents a new graph sampling technique,
Respondent-Driven Sampling (RDS) - Compares the performance of RDS and MRW sampling
techniques using simulations experiments
7Respondent-driven Sampling
- A development of Snowball Sampling Salganik04
- Commonly used in social sciences to sample
hidden populations, e.g. HIV individuals - Social relationships (references) are used by
sampler to diffuse into hidden populations - Each person introduces n other persons
- Similar to random walk (n 1)
- We adopt the RDS technique from social sciences
for sampling P2P networks
8RDS Formulation
- Goal Estimate the distribution of node property
X - Perform regular random walk, collect values of
property X and node degree (deg(v)) at each
visited node - Deal with the bias during the post-processing as
follows - Divide possible values for X into several ranges
R1, . . . ,Rm - Partition nodes with the X value within the same
range V1, . . . ,Vm - Using Hansen-Hurwitz estimator to compensate for
the bias, the proportion of all nodes in group i
is estimated as follows
- Ti visited samples in group i
- T all visited samples
9Evaluation Overview
- Performance metric
- Consider only peer properties that may interact
with the walk - 1) Peer Degree, 2) Peer Uptime, 3) Peer RTT
- Compare the dist. of the these peer properties
from samples and ground truth using
Kolmogorov-Smirnov (KS) statistics - Evaluation Methodology
- Evaluation over static graphs
- Effect of graph structure
- Evaluation over dynamic graphs (session level
simulation) - Benefits of parallel Sampling (see the paper)
- Effect of 1) churn, 2) peer discovery, 3)
target peer degree - Experiments over Gnutella network
10Evaluation Static Graphs
- Using graphs with different degree distribution
clustering characteristics - Random graphs (ER) Erdos-Renyi
- Small-world graphs (SW) Watts and Strogatz
- Scale-free graphs (BA) Barabasi and Albert
- Hierarchical Scale-Free graphs (HSF) Barabasi
02 - Power-law degree distribution
- Node clustering is inversely proportional to node
degree - Gnutella graphs (GA) Snapshots of Gnutella
Ultrapeer topology
11Hierarchical Scale-Free (HSF)
12Static Graphs
- Accuracy of both techniques is improved with the
number of samples in most cases - The rate of improvement in accuracy is much lower
over HSF especially for MRW - Walkers are likely to get trapped within clusters
in HSF graphs - Leaving a cluster requires visiting high degree
nodes but MRW is less likely to visit these nodes - Rewiring a small fraction of randomly selected
edges in HSF significantly improves accuracy for
both techniques - RDS is less sensitive to graph clustering than MRW
13Dynamic Graphs
- Churn is a primary limiting factor for accuracy
- Session len.gt 5m ? Very good sampling accuracy
- Churn model has little effect
- Similar impact on other peer properties (see the
paper) - Sampling error is small once nodes have
sufficient connectivity (gt 5) - Lower accuracy for smaller degree is due to graph
partitioning - Partitioned nodes in History mech. reduce the
accuracy of sampling
14Experiment Gnutella
- Run crawler, 1000 RDS 1000 MRW walkers in
parallel - 500 steps per walker
- Use captured snapshots by crawler as a rough
reference - Show min, max, avg KS over 6 experiments
- Focus only on degree dist
- The degree dist from samples crawls are very
similar (KS0.03) - The accuracy is an order of magnitude lower than
dynamic sim due to inaccurate reference. - Both sampling technique achieve similar accuracy
15Conclusions Future Work
- RDS always performs as good or better than MRW
- High level of graph clustering can significantly
degrade the accuracy of both RDS and MRW - RDS is less sensitive than MRW to graph
clustering - There is sweet spot for the number of parallel
samplers. - Poor connectivity high dynamics adversely
affect the accuracy of both techniques. - Future Work
- RDS is a promising approach for sampling user
properties in Online Social Networks - Sampling over directed graphs raises new
challenges.
16 17Different Grpah Structures
18Dynamic Simulation Setting
- Simulation environment
- Session-time distributions Weibull,
Exponential, Pareto - Poisson arrival process
- Peer discovery Oracle, FIFO, HeartBeat, History
- Target population 100000
- Min. Degree 3-30
- Sampling Parameters
- Node degree (DEG)
- Node query latency (RTT)
- Session length/uptime (UT)
19Evaluation Static Graphs Conclusion
- Combination of highly skewed degree distribution
and highly skewed clustering traps samplers - RDS samplers get out of the clusters quickly
- MRW samplers get stuck in low-degree clusters
- Shuffling provides short-cuts out of clusters
- HSF can be a model for some natural and social
networks
20Dynamic Graphs Effect of Parallelism
- Too much parallelism does not improve performance
- Too long random walks have negative effect
- Sweet spot exists