Sampling Biases in IP Topology Measurements - PowerPoint PPT Presentation

About This Presentation

Title:

Sampling Biases in IP Topology Measurements

Description:

Pick k unique source nodes, uniformly at random ... When traces are run from few sources to large destinations, some portions of ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 26

Provided by: lakhinabye

Learn more at: https://www.cs.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sampling Biases in IP Topology Measurements

1
Sampling Biases in IP Topology Measurements
John Byers with Anukool Lakhina, Mark Crovella
and Peng XieDepartment of Computer
ScienceBoston University
2
Discovering the Internet topology

Goal Discover the Internet Router Graph
Vertices represent routers,
Edges connect routers that are one IP hop apart

3
Traceroute studies today

k sources Few active sources, strategically
located.
m destinations Many passive destinations,
globally dispersed.
Union of many traceroute paths.
(k,m)-traceroute study

Sources
Destinations
4
Heavy tails in Topology Measurements
A surprising finding FFF99 Let be a
given node degree. Let be frequency of
degree vertices in a graph Power-law
relationship
5
Were skeptical

We will argue that the evidence for power laws is
at best insufficient.
Insufficient does not mean noisy or incomplete.
(which these datasets certainly are!)
For us, insufficient means that measurements are
statistically biased.
We will show that (k,m)-traceroute studies
exhibit significant sampling bias.

6
A thought experiment

Idea Simulate topology measurements on a random
graph.
Generate a sparse Erdös-Rényi random graph,
G(V,E). Each edge present independently with
probability p Assign weights w(e) 1 e ,
where e in
Pick k unique source nodes, uniformly at random
Pick m unique destination nodes, uniformly at
random
Simulate traceroute from k sources to m
destinations, i.e. learn shortest paths between k
sources and m destinations.
Let G be union of shortest paths.
Ask How does G compare with G ?

7
Underlying Random Graph, G
log(PrXgtx)
MeasuredGraph, G
Underlying Graph N100000, p0.00015Measured
Graph k3, m1000
log(Degree)
G is a biased sample of G that looks
heavy-tailedAre heavy tails a measurement
artifact?
8
Outline

Motivation and Thought Experiments
Understanding Bias on Simulated Topologies Where
and Why
Detecting and Defining Bias Statistical
hypotheses to infer presence of bias
Examining Internet Maps

9
Understanding Bias
(k,m)-traceroute sampling of graphs is
biased An intuitive explanation When traces
are run from few sources to large destinations,
some portions of underlying graph are explored
more than others. We now investigate the causes
behind bias.
10
Are nodes sampled unevenly?

Conjecture Shortest path routing favors higher
degree nodes ? nodes sampled unevenly
ValidationExamine true degrees of nodes in
measured graph, G. Expect true degrees of nodes
in G to be higher than degrees of nodes in G, on
average.

11
Are edges sampled unevenly?

ConjectureEdges selected incident to a node in
G not proportional to true degree.
ValidationFor each node in G, plot true degree
vs. measured degree. If unbiased, ratio of true
to measured degree should be constant. Points
clustered around ycx line (clt1).

12
Why Analyzing Bias

Question Given some vertex in G that is h hops
from the source, what fraction of its true edges
are contained in G?
Messages
As h increases, number of edges discovered falls
off sharply.

1000dst
Fraction of node edges discovered
600dst
100dst
Distance from source
We can prove exponential fall-off analytically,
in a simplified model.
13
What does this suggest?
SummaryEdges are sampled unevenly by
(k,m)-traceroute methods.Edges close to the
source are sampled more often than edges further
away.
Intuitive Picture Neighborhood near sources is
well explored but, visibility of edges declines
sharply with hop distance from sources.
14
Outline

Motivation and Thought Experiments
Understanding Bias in Simulated Topologies Where
and Why
Detecting and Defining Bias Statistical
hypotheses to infer presence of bias
Examining Internet Maps

15
Inferring Bias
Goal Given a measured G, does it appear to be
biased? Why this is difficult Dont have
underlying graph. Dont have formal criteria for
checking bias. General Approach Examine
statistical properties as a function of distance
from nearest source. Unbiased sample ? No
change Change ? Bias
16
Detecting Bias
Examine PrDdHh, the conditional probability
that a node has degree d, given that it is at
distance h from the source.
Underlying Graph
log(PrXgtx)
G degrees H3
G degrees H2
log(Degree)
Two observations1. Highest degree nodes are
near the source.2. Degree distribution of nodes
near the source different from those far away
17
A Statistical Test for C1
C1 Are the highest-degree nodes near the
source? If so, then consistent with bias.
The 1 highest degree nodes occur at random with
distance to nearest source.
H0C1

Cut vertex set in half N (near) and F (far), by
distance from nearest source.
Let v (0.01) V
k fraction of v that lies in N
Can bound likelihood k deviates from 1/2 using
Chernoff-bounds

Reject hypothesis with confidence 1-a if
18
A Statistical Test for C2
C2 Is the degree distribution of nodes near the
source different from those further away? If
so, consistent with bias.
Chi Square Test succeeds on degree distribution
for nodes near the source and far from the
source.
H0C2
Partition vertices across median distance N
(near) and F (far) Compare degree distribution
of nodes in N and F, using the Chi-Square Test

where O and E are observed and expected degree
frequencies and l is histogram bin size. Reject
hypothesis with confidence 1-a if
19
Our Definition of Bias

Bias (Definition) Failure of a sampled graph
to meet statistical tests for randomness
associated with C1 and C2.
Disclaimers Tests are not conclusive. Tests
are binary and dont tell us how biased datasets
are.
But dataset that fails both tests is a poor
choice to make generalizations of underlying
graph.

20
Introducing datasets
Pansiot-Grad
Mercator
Skitter
log(PrXgtx)
log(Degree)
21
Testing C1
H0C1
The 1 highest degree nodes occur at random with
distance to source.
Pansiot-Grad 93 of the highest degree nodes are
in N Mercator 90 of the highest degree nodes
are in N Skitter 84 of the highest degree
nodes are in N
22
Testing C2
H0C2
23
Summary of Statistical Tests

All datasets pass both statistical tests for
evidence of bias.
Likely that true degree distribution of the
routers is different than that of these datasets.

24
Final Remarks

Using (k,m)-traceroute methods to discover
Internet topology yields biased samples.
Rocketfuel SMW02 is limited-scale but may
avoid some pitfalls of (k,m)-traceroute studies.
One open question How to sample the degree of a
router at random?
Node degree distributions are especially
sensitive to biased sampling ? may not be a
sufficiently robust metric for characterizing or
comparing graphs.

25
Sampling Power-Law Graphs
Underlying, Power-Law Graph
log(PrXgtx)
MeasuredGraph
Underlying PLRG N100000Measured Graph k3,
m1000
log(Degree)
Even though distributional shape similar,
different exponents matter for topology modeling.
Again, G is a biased sample of G

Write a Comment

User Comments (0)