Title: An Algebraic Approach to Practical and Scalable Overlay Network Monitoring
1An Algebraic Approach to Practical and Scalable
Overlay Network Monitoring
David Bindel, Hanhee Song, and Randy H. Katz
Yan Chen
- University of California at Berkeley
Northwestern University
ACM SIGCOMM 2004
2Motivation
- Infrastructure ossification led to thrust of
overlay and P2P applications - Such applications flexible on paths and targets,
thus can benefit from E2E distance monitoring - Overlay routing/location
- VPN management/provisioning
- Service redirection/placement
- Requirements for E2E monitoring system
- Scalable efficient small amount of probing
traffic - Accurate capture congestion/failures
- Adaptive nodes join/leave, topology changes
- Robust tolerate measurement errors
- Balanced measurement load
3Related Work
- General metrics RON (n2 measurement)
- Latency estimation
- Link-level-measurement min set cover (Ozmultu et
al), similar approach for giving bounds of other
metrics (Tang McKinley) - Clustering-based IDMaps, Internet Isobar, etc.
- Coordinate-based GNP, Virtual Landmarks,
Vivaldi, etc. - Network tomography
- Focusing on inferring the characteristics of
physical links rather than E2E paths - Limited measurements -gt under-constrained system,
unidentifiable links
4Problem Formulation
- Given an overlay of n end hosts and O(n2) paths,
how to select a minimal subset of paths to
monitor so that the loss rates/latency of all
other paths can be inferred. - Assumptions
- Topology measurable
- Can only measure the E2E path, not the link
5Outlines
- An algebraic approach framework
- Algorithms for a fixed set of overlay nodes
- Scalability analysis
- Adaptive dynamic algorithms
- Measurement load balancing
- Handling topology measurement errors
- Simulations and Internet experiments
6Our Approach
- Select a basis set of k paths that fully describe
O(n2) paths (k O(n2)) - Monitor the loss rates of k paths, and infer the
loss rates of all other paths - Applicable for any additive metrics, like latency
7Modeling of Path Space
A
1
3
D
C
2
B
- Path loss rate p, link loss rate l
8Putting All Paths Together
Totally r O(n2) paths, s links, s ltltr
9Sample Path Matrix
- x1 - x2 unknown gt cannot compute x1, x2
- To separate identifiable vs. unidentifiable
components x xG xN
- All E2E paths (G) are orthogonal to xN, i.e., GxN
0
10Intuition through Topology Virtualization
Virtual links minimal path segments whose loss
rates uniquely identified
- Can fully describe all paths
- xG composed of virtual links
Virtualization
Real links (solid) and all of the overlay paths
(dotted) traversing them
Virtual links
11Algorithms
- Select k rank(G) linearly independent paths to
monitor (one time) - Use QR decomposition
- Leverage sparse matrix time O(rk2) and memory
O(k2) - E.g., 79 seconds for n 300 (r 44850) and k
2541 - Compute the loss rates of other paths
(continuously) - Time O(k2) and memory O(k2)
12Outlines
- An algebraic approach framework
- Algorithms for fixed set of overlay nodes
-
- Scalability analysis
- Adaptive dynamic algorithms
- Measurement load balancing
- Handling topology measurement errors
- simulations and Internet experiments
-
13How many measurements saved ?
- k O(n2) ?
- For a power-law Internet topology
- When the majority of end hosts are on the overlay
- When a small portion of end hosts are on overlay
- If Internet a pure hierarchical structure (tree)
k O(n) - If Internet no hierarchy at all (worst case,
clique) k O(n2) - Internet has moderate hierarchical structure
TGJ02
k O(n) (with proof)
For reasonably large n, (e.g., 100), k O(nlogn)
14Linear Regression Tests of the Hypothesis
- BRITE Router-level Topologies
- Barbarasi-Albert, Waxman, Hierarchical models
- Mercator Real Topology
- Most have the best fit with O(n) except the
hierarchical ones fit best with O(nlogn)
15Outlines
- An algebraic approach framework
- Algorithms for fixed set of overlay nodes
-
- Scalability analysis
- Adaptive dynamic algorithms
- Measurement load balancing
- Handling topology measurement errors
- Simulations and Internet experiments
-
16Topology Changes
- Basic building block add/remove one path
- Incremental changes O(k2) time (O(n2k2) for
re-scan) - Add path check linear dependency with old basis
set, - Delete path p hard when
- Intuitively, two steps
- Add/remove end hosts , Routing changes
- Routing relatively stable in order of a day
- gt incremental detection
17Topology Change Example
18Other Practical Issues
- Measurement load balancing
- Randomly reorder the paths in G before scanning
them for selection of - Has no effect on the loss rate estimation
accuracy - Topology measurement errors tolerance
- Care about path loss rates than any interior
links - Router aliases
- gt Let it be assign similar loss rates to the
same links - Path (segments) without topology info
- gt add virtual links to bypass
-
19Outlines
- An algebraic approach framework
- Algorithms for fixed set of overlay nodes
-
- Scalability analysis
- Adaptive dynamic algorithms
- Measurement load balancing
- Handling topology measurement errors
- Simulations and Internet experiments
20Evaluation
Areas and Domains Areas and Domains Areas and Domains of hosts
US (40) .edu .edu 33
US (40) .org .org 3
US (40) .net .net 2
US (40) .gov .gov 1
US (40) .us .us 1
Interna-tional (11) Europe (6) France 1
Interna-tional (11) Europe (6) Sweden 1
Interna-tional (11) Europe (6) Denmark 1
Interna-tional (11) Europe (6) Germany 1
Interna-tional (11) Europe (6) UK 2
Interna-tional (11) Asia (2) Taiwan 1
Interna-tional (11) Asia (2) Hong Kong 1
Interna-tional (11) Canada Canada 2
Interna-tional (11) Australia Australia 1
- Extensive Simulations
- See paper
- Experiments on PlanetLab
- 51 hosts, each from different organizations
- 51 50 2,550 paths
- Simultaneous loss rate measurement
- 300 trials, 300 msec each
- In each trial, send a 40-byte UDP pkt to every
other host - Topology measurement (traceroute)
- 100 experiments in peak hours of North America
21PlanetLab Experiment Results
- Loss rate distribution
- On average k 872 out of 2550
- Metrics
- Absolute error p p
- Average 0.0027 for all paths, 0.0058 for lossy
paths - Relative error BDPT02
- Average 1.1 for all paths, and 1.7 for lossy paths
loss rate 0, 0.05) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1)
loss rate 0, 0.05) 0.05, 0.1) 0.1, 0.3) 0.3, 0.5) 0.5, 1.0) 1.0
95.9 15.2 31.0 23.9 4.3 25.6
22More Experiment Results
- Running time
- Setup (path selection) 0.75 seconds
- Update (for all 2550 paths) 0.16 seconds
- More results on topology change adaptation see
paper - Robustness
- Out of 14 sets of pair-wise traceroute
- On average 245 out of 2550 paths have no or
incomplete routing information - No router aliases resolved
- Conclusion robust against topology measurement
errors
23Results for Measurement Load Balancing
- Simulation on an overlay of 300 end hosts,
average load 8.5 - With balancing Gaussian-like load distribution
- Without heavily skewed, with the max almost 20
times the average
24Conclusions
- A tomography-based overlay network monitoring
system - Given n end hosts, characterize O(n2) paths with
a basis set of O(nlogn) paths - Selectively monitor the basis set for their loss
rates, then infer the loss rates of all other
paths - Adaptive to topology changes
- Balanced measurement load
- Topology measurement error tolerance
- Both simulation and PlanetLab experiments show
promising results - Built an adaptive overlay streaming media system
on top of it
25Backup Slides
26Other Practical Issues
- Topology measurement errors tolerance
- Care about path loss rates than any interior
links - Poor router alias resolution
- gt assign similar loss rates to the same links
- Unidentifiable routers
- gt add virtual links to bypass
-
- Measurement load balancing on end hosts
- Randomly order the paths for scan and selection
of
27Modeling of Path Space
A
1
3
D
C
2
B
- Path loss rate p, link loss rate l
Put all r O(n2) paths together Totally s links
28Sample Path Matrix
- x1 - x2 unknown gt cannot compute x1, x2
- Set of vectors
- form null space
- To separate identifiable vs. unidentifiable
components x xG xN - All E2E paths are in path space, i.e., GxN 0
29Intuition through Topology Virtualization
- Virtual links
- Minimal path segments whose loss rates uniquely
identified - Can fully describe all paths
- xG is composed of virtual links
All E2E paths are in path space, i.e., GxN 0
30Algorithms
- Select k rank(G) linearly independent paths to
monitor - Use rank revealing decomposition
- Leverage sparse matrix time O(rk2) and memory
O(k2) - E.g., 10 minutes for n 350 (r 61075) and k
2958 - Compute the loss rates of other paths
- Time O(k2) and memory O(k2)
31Practical Issues
- Topology measurement errors tolerance
- Care about path loss rates than any interior
links - Poor router alias resolution
- gt assign similar loss rates to the same links
- Unidentifiable routers
- gt add virtual links to bypass
-
- Measurement load balancing on end hosts
- Randomly order the paths for scan and selection
of - Topology Changes
- Efficient algorithms for incrementally update of
- for adding/removing end hosts routing changes
32More Experiment Results
- Measurement load balancing
- Putting load values of each node in 10 equally
spaced bins - Running time
- Setup (path selection) 0.75 seconds
- Update (for all 2550 paths) 0.16 seconds
- More results on topology change adaptation see
paper
With load balancing
Without load balancing
33Work in Progress
- Provide it as a continuous service on PlanetLab
- Network diagnostics
- Which links or path segments are down
- Iterative methods for better speed and scalability
34Evaluation
- Simulation
- Topology
- BRITE Barabasi-Albert, Waxman, hierarchical 1K
20K nodes - Real topology from Mercator 284K nodes
- Fraction of end hosts on the overlay 1 - 10
- Loss rate distribution (90 links are good)
- Good link 0-1 loss rate bad link 5-10 loss
rates - Good link 0-1 loss rate bad link 1-100 loss
rates - Loss model
- Bernouli independent drop of packet
- Gilbert busty drop of packet
- Path loss rate simulated via transmission of 10K
pkts - Experiments on PlanetLab
35Evaluation
- Extensive Simulations
- Experiments on PlanetLab
- 51 hosts, each from different organizations
- 51 50 2,550 paths
- On average k 872
- Results Highlight
- Avg real loss rate 0.023
- Absolute error mean 0.0027 90 lt 0.014
- Relative error mean 1.1 90 lt 2.0
- On average 248 out of 2550 paths have no or
incomplete routing information - No router aliases resolved
Areas and Domains Areas and Domains Areas and Domains of hosts
US (40) .edu .edu 33
US (40) .org .org 3
US (40) .net .net 2
US (40) .gov .gov 1
US (40) .us .us 1
Interna-tional (11) Europe (6) France 1
Interna-tional (11) Europe (6) Sweden 1
Interna-tional (11) Europe (6) Denmark 1
Interna-tional (11) Europe (6) Germany 1
Interna-tional (11) Europe (6) UK 2
Interna-tional (11) Asia (2) Taiwan 1
Interna-tional (11) Asia (2) Hong Kong 1
Interna-tional (11) Canada Canada 2
Interna-tional (11) Australia Australia 1
36Sensitivity Test of Sending Frequency
- Big jump for of lossy paths when the sending
rate is over 12.8 Mbps
37PlanetLab Experiment Results
- Loss rate distribution
- Metrics
- Absolute error p p
- Average 0.0027 for all paths, 0.0058 for lossy
paths - Relative error BDPT02
- Lossy path inference coverage and false positive
ratio - On average k 872 out of 2550
loss rate 0, 0.05) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1)
loss rate 0, 0.05) 0.05, 0.1) 0.1, 0.3) 0.3, 0.5) 0.5, 1.0) 1.0
95.9 15.2 31.0 23.9 4.3 25.6
38Accuracy Results for One Experiment
- 95 of absolute error lt 0.0014
- 95 of relative error lt 2.1
39Accuracy Results for All Experiments
- For each experiment, get its 95 absolute
relative errors - Most have absolute error lt 0.0135 and relative
error lt 2.0
40Lossy Path Inference Accuracy
- 90 out of 100 runs have coverage over 85 and
false positive less than 10 - Many caused by the 5 threshold boundary effects
41Performance Improvement with Overlay
- With single-node relay
- Loss rate improvement
- Among 10,980 lossy paths
- 5,705 paths (52.0) have loss rate reduced by
0.05 or more - 3,084 paths (28.1) change from lossy to
non-lossy - Throughput improvement
- Estimated with
- 60,320 paths (24) with non-zero loss rate,
throughput computable - Among them, 32,939 (54.6) paths have throughput
improved, 13,734 (22.8) paths have throughput
doubled or more - Implications use overlay path to bypass
congestion or failures
42Adaptive Overlay Streaming Media
Stanford
UC San Diego
UC Berkeley
X
HP Labs
- Implemented with Winamp client and SHOUTcast
server - Congestion introduced with a Packet Shaper
- Skip-free playback server buffering and
rewinding - Total adaptation time lt 4 seconds
43Adaptive Streaming Media Architecture
44Conclusions
- A tomography-based overlay network monitoring
system - Given n end hosts, characterize O(n2) paths with
a basis set of O(nlogn) paths - Selectively monitor O(nlogn) paths to compute the
loss rates of the basis set, then infer the loss
rates of all other paths - Both simulation and real Internet experiments
promising - Built adaptive overlay streaming media system on
top of monitoring services - Bypass congestion/failures for smooth playback
within seconds