Title: On Unbiased Sampling for Unstructured Peer-to-Peer Networks
1On Unbiased Sampling for Unstructured
Peer-to-Peer Networks
- Daniel Stutzbach University of Oregon
- Reza Rejaie University of Oregon
- Nick Duffield ATT LabsResearch
- Subhabrata Sen ATT LabsResearch
- Walter Willinger ATT LabsResearch
Internet Measurement Conference Rio de Janeiro,
Brazil October 25th, 2006
2Motivation
- P2P systems are very popular in practice.
- Several million simultaneous users collectively.
- 60 of all Internet traffic CacheLogic Research
2005 - Measurement studies aid understanding existing
systems and user behavior. - Capturing a accurate global picture is often
infeasible. - P2P systems are distributed, large, and rapidly
changing. - Capturing a global picture is time-consuming,
resulting in a blurry picture. - Sampling is a natural approach, and has been used
implicitly in most earlier P2P measurement
studies. - But how do we know the samples are
representative?
3The Problem
- We focus on sampling peer properties.
- Number of neighbors (degree)
- Link bandwidth
- Number of shared files
- Remaining uptime
- Sampling peer properties occurs in two steps
- Discover and select peers
- Collect measurements from the selected peers
- Selecting peers uniformly at random is hard.
- Temporal Peer dynamics can introduce bias.
- Topological The graph topology can introduce
bias. - We first examine these two problems in isolation.
- We then examine them together.
4Sampling with Dynamics
- Define Vt as the set of peers present at time t.
- We gather samples over a measurement window of
length ?. - The most common approach is to gather peers from
the set present during the window
5Bias towards Short-Lived Peers
Time
Long-lived peer
Short-lived peers
- Consider a simple two-peer system, containing
- One long-lived peer
- One rapidly-changing short-lived peer
- The common approach over-selects short-lived
peers.
6Handling Temporal Causes of Bias
- The common approach is intuitive but incorrect.
- Sampling peers is the wrong goal.
- We want to sample peer properties.
- Two samples from the same peer, but at different
times, are distinct. - Allow sampling the same peer more than once, at
different points in time.
7Example of avoiding bias towards Short-Lived
Peers
Time
Long-lived peer
Short-lived peers
- Allowing re-selecting a peer solves the problem.
- The long-lived peer will be selected half the
time, reflecting the actual state of the system. - How do we select a peer uniformly at random at a
particular moment?
8Sampling from Static Graphs
- Assume for the moment a static graph
- Goal Select a peer uniformly from the graph
- Discover
- Begin with one peer.
- Query peers to discover neighbors.
- Classic algorithms Breadth-First Search,
Depth-First Search - Select
- Choose a subset of discovered peers
- Gather samples from the selected peers
9Advantages of Random Walks
- Problems with classic approaches
- Peers are correlated by their neighbor
relationship - Peers with higher degree discovered more often
- A peer can only be selected once.
- Random walks are a promising alternative
- The information in the starting location is
lost by repeatedly injecting randomness at
each step. - The results are biased, but the bias is precisely
known. - Random walks can implicitly visit the same peer
twice.
10Random walks, formally
- Random walks can be described with a transition
matrix, P(x,y). - P(x,y) is the probability of moving from x to y
- P r(x,y) is the probability of moving from x to y
after r moves - Random walks converge to a stationary
distribution - Problem we want a uniform distribution
11The MetropolisHastings Method
- The MetropolisHastings method modifies the
transition matrix to yield the desired
distribution - Proven for static graphs
- Plugging in our P(x,y) and µ(x)
- Select a neighbor y of x uniformly at random
- Transition to y with probability deg(x) / deg(y)
- Otherwise, self-transition to x.
12Sampling from Dynamic Graphs
- Adapting to vanishing peers
- We maintain a stack of visited peers
- If a query times out, go back in the stack
- Hypothesis A Metropolized random walk will yield
approximately unbiased samples in practice. - Trivially valid for extremely slowly changing
graphs - Trivially false for extremely rapidly changing
graphs - Where is the transition?
- Methodology
- Session-level simulations of a wide variety of
situations - Determine what conditions lead to biased samples
- Do those conditions arise in practice?
13Metrics Fundamental properties
- We focus on three fundamental properties that
affect the walk - Degree
- Session length
- Query latency (in paper only)
- We compute the KS statistic (D) for each
distribution versus a snapshot from an oracle. - We evaluate these metrics under a variety of
conditions - Several models of churn
- Several models of degree distribution
- Four different peer discovery mechanisms
14Base case
- Base case
- Session length distribution is Weibull (k0.59,
?40) - Maximum degree 30
- Target degree 15
- Peer discovery mechanism FIFO rendezvous point
- Sampled and expected distributions are visually
indistinguishable. - Very low KS statistic D lt 0.004
15Varying churn
- Each point represents a simulation y-axis show
KS statistic (D) - Error is low over a wide range of session lengths
- Becomes significant for median lt 2 min
- High for median lt 30 s
- Type of distribution does not have a large impact
16Varying topology
- Little bias when target degree gt 2
- Degree 2 means network fragmentation
- History mechanism bias is due to 2 of peers
with no neighbors. - More simulation results in the paper
17Empirical results
- We developed the technique into a tool called
ion-sampler. - Available from our website
bash ./ion-sampler gnutella --hops 25 -n 10
10.8.65.1716348 10.199.20.1835260 10.8.4
5.10334717 10.21.0.296346 10.32.170.2006346 10.
201.162.4930274 10.222.183.12947272 10.245.64.85
6348 10.79.198.4436520 10.216.54.16944380
18Empirical validation
- Empirical validation is tricky because there is
no perfect baseline for comparison. - Full crawling performed by Cruiser Stutzbach 05
IMC - The full crawl may be slightly biased towards
higher degree - Ion-sampler records slightly fewer higher degree
peers than a full crawl - Conclusion ion-sampler is close to a full crawl
in accuracy, and may even be more accurate!
19Conclusions and Future Work
- Summary
- Temporal and topological bias can lead to
sampling error. - We present the Metropolized Random Walk with
Backtracking technique. - Extensive simulations show that it gathers nearly
unbiased samples in a wide variety of
circumstances. - Ion-sampler is a tool for gathering nearly
unbiased samples from real P2P systems. - Future work
- Explore improving sampling efficiency for
uncommon events. - Evaluate MRWB under flash crowd scenarios.
- Develop additional plug-ins for ion-sampler.