Title: Tomographybased Overlay Network Monitoring
1Tomography-based Overlay Network Monitoring
Yan Chen, David Bindel, and Randy H. Katz
2Motivation
- Infrastructure ossification led to thrust of
overlay and P2P applications - Such applications flexible on paths and targets,
thus can benefit from E2E distance monitoring - Overlay routing/location
- VPN management/provisioning
- Service redirection/placement
- Requirements for E2E monitoring system
- Scalable efficient small amount of probing
traffic - Accurate capture congestion/failures
- Incrementally deployable
- Easy to use
3Existing Work
- General Metrics RON (n2 measurement)
- Latency Estimation
- Clustering-based IDMaps, Internet Isobar, etc.
- Coordinate-based GNP, ICS, Virtual Landmarks
- Network tomography
- Focusing on inferring the characteristics of
physical links rather than E2E paths - Limited measurements -gt under-constrained system,
unidentifiable links
4Problem Formulation
- Given an overlay of n end hosts and O(n2) paths,
how to select a minimal subset of paths to
monitor so that the loss rates/latency of all
other paths can be inferred. - Assumptions
- Topology measurable
- Can only measure the E2E path, not the link
5Our Approach
- Select a basis set of k paths that fully describe
O(n2) paths (k O(n2)) - Monitor the loss rates of k paths, and infer the
loss rates of all other paths - Applicable for any additive metrics, like latency
6Modeling of Path Space
A
1
3
D
C
2
B
- Path loss rate p, link loss rate l
7Putting All Paths Together
A
1
3
D
C
2
B
Totally r O(n2) paths, s links, s r
8Sample Path Matrix
- x1 - x2 unknown gt cannot compute x1, x2
- Set of vectors
- form null space
- To separate identifiable vs. unidentifiable
components x xG xN
9Intuition through Topology Virtualization
- Virtual links
- Minimal path segments whose loss rates uniquely
identified - Can fully describe all paths
- xG is composed of virtual links
All E2E paths are in path space, i.e., GxN 0
10More Examples
Virtualization
Real links (solid) and all of the overlay paths
(dotted) traversing them
Virtual links
11Algorithms
- Select k rank(G) linearly independent paths to
monitor - Use QR decomposition
- Leverage sparse matrix time O(rk2) and memory
O(k2) - E.g., 10 minutes for n 350 (r 61075) and k
2958 - Compute the loss rates of other paths
- Time O(k2) and memory O(k2)
12How many measurements saved ?
- k O(n2) ?
- For a power-law Internet topology
- When the majority of end hosts are on the overlay
- When a small portion of end hosts are on overlay
- If Internet a pure hierarchical structure (tree)
k O(n) - If Internet no hierarchy at all (worst case,
clique) k O(n2) - Internet has moderate hierarchical structure
TGJ02
k O(n) (with proof)
For reasonably large n, (e.g., 100), k
O(nlogn) (extensive linear regression tests on
both synthetic and real topologies)
13Practical Issues
- Topology measurement errors tolerance
- Measurement load balancing on end hosts
- Randomized algorithm
- Adaptive to topology changes
- Add/remove end hosts and routing changes
- Efficient algorithms for incrementally update of
selected paths -
14Evaluation
- Extensive Simulations
- Experiments on PlanetLab
- 51 hosts, each from different organizations
- 51 50 2,550 paths
- On average k 872
- Results Highlight
- Avg real loss rate 0.023
- Absolute error mean 0.0027 90 lt 0.014
- Relative error mean 1.1 90 lt 2.0
- On average 248 out of 2550 paths have no or
incomplete routing information - No router aliases resolved
15Conclusions
- A tomography-based overlay network monitoring
system - Given n end hosts, characterize O(n2) paths with
a basis set of O(n logn) paths - Selectively monitor the basis set for their loss
rates, then infer the loss rates of all other
paths - Both simulation and PlanetLab experiments show
promising results
16Backup Slides
17Problem Formulation
- Given an overlay of n end hosts and O(n2) paths,
how to select a minimal subset of paths to
monitor so that the loss rates/latency of all
other paths can be inferred.
- Key idea based on topology, select a basis set
of k paths that fully describe O(n2) paths (k
O(n2)) - Monitor the loss rates of k paths, and infer the
loss rates of all other paths - Applicable for any additive metrics, like latency
18Modeling of Path Space
A
1
3
D
C
2
B
- Path loss rate p, link loss rate l
Put all r O(n2) paths together Totally s links
19Sample Path Matrix
- x1 - x2 unknown gt cannot compute x1, x2
- Set of vectors
- form null space
- To separate identifiable vs. unidentifiable
components x xG xN - All E2E paths are in path space, i.e., GxN 0
20More Examples
Virtualization
Real links (solid) and all of the overlay paths
(dotted) traversing them
Virtual links
21Linear Regression Tests of the Hypothesis
- BRITE Router-level Topologies
- Barbarasi-Albert, Waxman, Hierarchical models
- Mercator Real Topology
- Most have the best fit with O(n) except the
hierarchical ones fit best with O(n logn)
22Algorithms
- Select k rank(G) linearly independent paths to
monitor - Use rank revealing decomposition
- Leverage sparse matrix time O(rk2) and memory
O(k2) - E.g., 10 minutes for n 350 (r 61075) and k
2958 - Compute the loss rates of other paths
- Time O(k2) and memory O(k2)
23Practical Issues
- Topology measurement errors tolerance
- Care about path loss rates than any interior
links - Poor router alias resolution
- gt assign similar loss rates to the same links
- Unidentifiable routers
- gt add virtual links to bypass
-
- Measurement load balancing on end hosts
- Randomly order the paths for scan and selection
of - Topology Changes
- Efficient algorithms for incrementally update of
- for adding/removing end hosts routing changes
24Work in Progress
- Provide it as a continuous service on PlanetLab
- Network diagnostics
- Which links or path segments are down
- Iterative methods for better speed and scalability
25Topology Changes
- Basic building block add/remove one path
- Incremental changes O(k2) time (O(n2k2) for
re-scan) - Add path check linear dependency with old basis
set, - Delete path p hard when
- The essential info described by p
- Add/remove end hosts , Routing changes
- Topology relatively stable in order of a day
- gt incremental detection
26Evaluation
- Simulation
- Topology
- BRITE Barabasi-Albert, Waxman, hierarchical 1K
20K nodes - Real topology from Mercator 284K nodes
- Fraction of end hosts on the overlay 1 - 10
- Loss rate distribution (90 links are good)
- Good link 0-1 loss rate bad link 5-10 loss
rates - Good link 0-1 loss rate bad link 1-100 loss
rates - Loss model
- Bernouli independent drop of packet
- Gilbert busty drop of packet
- Path loss rate simulated via transmission of 10K
pkts - Experiments on PlanetLab
27Experiments on Planet Lab
- 51 hosts, each from different organizations
- 51 50 2,550 paths
- Simultaneous loss rate measurement
- 300 trials, 300 msec each
- In each trial, send a 40-byte UDP pkt to every
other host - Simultaneous topology measurement
- Traceroute
- Experiments 6/24 6/27
- 100 experiments in peak hours
28Sensitivity Test of Sending Frequency
- Big jump for of lossy paths when the sending
rate is over 12.8 Mbps
29PlanetLab Experiment Results
- Loss rate distribution
- Metrics
- Absolute error p p
- Average 0.0027 for all paths, 0.0058 for lossy
paths - Relative error BDPT02
- Lossy path inference coverage and false positive
ratio - On average k 872 out of 2550
30Accuracy Results for One Experiment
- 95 of absolute error lt 0.0014
- 95 of relative error lt 2.1
31Accuracy Results for All Experiments
- For each experiment, get its 95 absolute
relative errors - Most have absolute error lt 0.0135 and relative
error lt 2.0
32Lossy Path Inference Accuracy
- 90 out of 100 runs have coverage over 85 and
false positive less than 10 - Many caused by the 5 threshold boundary effects
33Topology/Dynamics Issues
- Out of 13 sets of pair-wise traceroute
- On average 248 out of 2550 paths have no or
incomplete routing information - No router aliases resolved
- Conclusion robust against topology measurement
errors - Simulation on adding/removing end hosts and
routing changes also give good results
34Performance Improvement with Overlay
- With single-node relay
- Loss rate improvement
- Among 10,980 lossy paths
- 5,705 paths (52.0) have loss rate reduced by
0.05 or more - 3,084 paths (28.1) change from lossy to
non-lossy - Throughput improvement
- Estimated with
- 60,320 paths (24) with non-zero loss rate,
throughput computable - Among them, 32,939 (54.6) paths have throughput
improved, 13,734 (22.8) paths have throughput
doubled or more - Implications use overlay path to bypass
congestion or failures
35Adaptive Overlay Streaming Media
Stanford
UC San Diego
UC Berkeley
X
HP Labs
- Implemented with Winamp client and SHOUTcast
server - Congestion introduced with a Packet Shaper
- Skip-free playback server buffering and
rewinding - Total adaptation time lt 4 seconds
36Adaptive Streaming Media Architecture
37Conclusions
- A tomography-based overlay network monitoring
system - Given n end hosts, characterize O(n2) paths with
a basis set of O(nlogn) paths - Selectively monitor O(nlogn) paths to compute the
loss rates of the basis set, then infer the loss
rates of all other paths - Both simulation and real Internet experiments
promising - Built adaptive overlay streaming media system on
top of monitoring services - Bypass congestion/failures for smooth playback
within seconds