Respondent-driven Sampling for Characterizing Unstructured Overlays presentation

About This Presentation

Transcript and Presenter's Notes

Title: Respondent-driven Sampling for Characterizing Unstructured Overlays

1
Respondent-driven Sampling for Characterizing
Unstructured Overlays
A. H. Rasti University of Oregon
M. Torkjazi University of Oregon
R. Rejaie University of Oregon

N. Duffield ATT Labs - Research
W. Willinger ATT Labs - Research

D. Stutzbach Stutzbach Enterprises
Graciously Presented By Shubho Sen ATT Labs
- Research
2
Motivation

P2P systems are very popular in practice.
Millions of simultaneous users.
A significant fraction of Internet traffic
Measurement studies aid understanding existing
systems and user behavior.
Capturing an accurate global snapshot is often
infeasible.
P2P systems are distributed, large, and rapidly
changing.
P2P crawlers are likely to capture incomplete or
distorted snapshots
Sampling is a natural approach, and has been used
implicitly in most earlier P2P measurement
studies.
How can we collect representative samples?

3
The Graph Sampling Problem

We focus on sampling peer properties, such as
number of neighbors (degree), access link
bandwidth, session time, files
Sampling peer properties has two steps
Discovering and selecting peers (or samples)
Measuring the desired properties of selected
peers
Selecting peers uniformly at random is hard
there are two sources of bias StutzbachIMC06
Topological high-degree peers are more likely to
be selected
Temporal short-lived peers are more likely to
be selected
Random walks are a promising approach to sampling
The resulting bias is precisely known
Samples can be collected in parallel by multiple
walkers

4
Sampling Using Random Walk

Random walks can be described with a transition
matrix P(x,y)
P(x,y) probability of moving from x to y
Pr(x,y) probability of moving from x to y after
r moves
Random walks converge to a stationary
distribution
Problem we need a uniform distribution

5
Metropolized Random Walk (MRW)

The Metropolis-Hastings method modifies the
transition matrix to yield the desired uniform
distribution StutzbachIMC06
MRW method
Select a neighbor y of x uniformly at random
Transition to y with probability min(
deg(x)/deg(y) , 1)
Otherwise, self-loop to x.
Results in uniform stationary dist. ?(x) 1/V
MRW compensates for bias as samples are collected

y
x
6
This paper

Presents a new graph sampling technique,
Respondent-Driven Sampling (RDS)
Compares the performance of RDS and MRW sampling
techniques using simulations experiments

7
Respondent-driven Sampling

A development of Snowball Sampling Salganik04
Commonly used in social sciences to sample
hidden populations, e.g. HIV individuals
Social relationships (references) are used by
sampler to diffuse into hidden populations
Each person introduces n other persons
Similar to random walk (n 1)
We adopt the RDS technique from social sciences
for sampling P2P networks

8
RDS Formulation

Goal Estimate the distribution of node property
X
Perform regular random walk, collect values of
property X and node degree (deg(v)) at each
visited node
Deal with the bias during the post-processing as
follows
Divide possible values for X into several ranges
R1, . . . ,Rm
Partition nodes with the X value within the same
range V1, . . . ,Vm
Using Hansen-Hurwitz estimator to compensate for
the bias, the proportion of all nodes in group i
is estimated as follows

Ti visited samples in group i
T all visited samples

9
Evaluation Overview

Performance metric
Consider only peer properties that may interact
with the walk
1) Peer Degree, 2) Peer Uptime, 3) Peer RTT
Compare the dist. of the these peer properties
from samples and ground truth using
Kolmogorov-Smirnov (KS) statistics
Evaluation Methodology
Evaluation over static graphs
Effect of graph structure
Evaluation over dynamic graphs (session level
simulation)
Benefits of parallel Sampling (see the paper)
Effect of 1) churn, 2) peer discovery, 3)
target peer degree
Experiments over Gnutella network

10
Evaluation Static Graphs

Using graphs with different degree distribution
clustering characteristics
Random graphs (ER) Erdos-Renyi
Small-world graphs (SW) Watts and Strogatz
Scale-free graphs (BA) Barabasi and Albert
Hierarchical Scale-Free graphs (HSF) Barabasi
02
Power-law degree distribution
Node clustering is inversely proportional to node
degree
Gnutella graphs (GA) Snapshots of Gnutella
Ultrapeer topology

11
Hierarchical Scale-Free (HSF)
12
Static Graphs

Accuracy of both techniques is improved with the
number of samples in most cases
The rate of improvement in accuracy is much lower
over HSF especially for MRW
Walkers are likely to get trapped within clusters
in HSF graphs
Leaving a cluster requires visiting high degree
nodes but MRW is less likely to visit these nodes
Rewiring a small fraction of randomly selected
edges in HSF significantly improves accuracy for
both techniques
RDS is less sensitive to graph clustering than MRW

13
Dynamic Graphs

Churn is a primary limiting factor for accuracy
Session len.gt 5m ? Very good sampling accuracy
Churn model has little effect
Similar impact on other peer properties (see the
paper)
Sampling error is small once nodes have
sufficient connectivity (gt 5)
Lower accuracy for smaller degree is due to graph
partitioning
Partitioned nodes in History mech. reduce the
accuracy of sampling

14
Experiment Gnutella

Run crawler, 1000 RDS 1000 MRW walkers in
parallel
500 steps per walker
Use captured snapshots by crawler as a rough
reference
Show min, max, avg KS over 6 experiments
Focus only on degree dist
The degree dist from samples crawls are very
similar (KS0.03)
The accuracy is an order of magnitude lower than
dynamic sim due to inaccurate reference.
Both sampling technique achieve similar accuracy

15
Conclusions Future Work

RDS always performs as good or better than MRW
High level of graph clustering can significantly
degrade the accuracy of both RDS and MRW
RDS is less sensitive than MRW to graph
clustering
There is sweet spot for the number of parallel
samplers.
Poor connectivity high dynamics adversely
affect the accuracy of both techniques.
Future Work
RDS is a promising approach for sampling user
properties in Online Social Networks
Sampling over directed graphs raises new
challenges.

Thank You !

17
Different Grpah Structures
18
Dynamic Simulation Setting

Simulation environment
Session-time distributions Weibull,
Exponential, Pareto
Poisson arrival process
Peer discovery Oracle, FIFO, HeartBeat, History
Target population 100000
Min. Degree 3-30
Sampling Parameters
Node degree (DEG)
Node query latency (RTT)
Session length/uptime (UT)

19
Evaluation Static Graphs Conclusion

Combination of highly skewed degree distribution
and highly skewed clustering traps samplers
RDS samplers get out of the clusters quickly
MRW samplers get stuck in low-degree clusters
Shuffling provides short-cuts out of clusters
HSF can be a model for some natural and social
networks

Respondent-driven Sampling for Characterizing Unstructured Overlays PowerPoint PPT Presentation