Title: Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring
1Correlations in E2E Network Metrics Impact on
Large Scale Network Monitoring
Praveen Yalagandula Sung-Ju Lee Puneet
Sharma Sujata Banerjee
- HP Labs, Palo Alto
- http//networking.hpl.hp.com
2Motivation
- Large scale E2E network monitoring
- Application management, Flow control, Fault
Diagnosis, etc. - A key question What granularity should we
measure? - Coarse-grained lower cost but
higher inaccuracy - Fine-grained lower inaccuracy but
higher cost - Observation Heterogeneity in measurement costs
- PING lt TRACEROUTE lt PATHRATE
- Our investigation
- Are different E2E network metrics correlated?
- Can we leverage such dependencies (if any) to
- Lower monitoring cost while maintaining high
accuracies?
3Our Approach
- We consider two correlations in the current work
- Changes in Hop and Latency ? Changes in Route
- Changes in Route ? Changes in Capacity
- We use data from S3 deployment on Planet-Lab
- 2years of data
- E2E measurements Traceroute and Pathrate
(capacity) - On thousands of paths
- Perform Cost vs. Accuracy analysis for two cases
- Base Only higher cost measurements are performed
- Strategy
- Perform lower cost measurements
- If change detected, perform higher cost
measurements
4State-of-the-art
- Correlations assumed by previous systems
- GNP, Vivaldi, and other co-ordinate based systems
- Correlation in latencies across paths
- NetQuest
- Correlation between hop changes and route changes
- CoDeen
- Correlation between route changes and capacity
- Our work
- Quantify the correlation
- Perform accuracy vs cost tradeoff analysis
5Outline
- Motivation Quantify leverage metric
correlations - S3 Scalable Sensing Service
- Deployment on PlanetLab
- Correlations
- Changes in Hop and Latency ? Changes in Route
- Changes in Route ? Changes in Capacity
- Cost-Accuracy Tradeoff Analysis
- Summary and Future work
6S3 Architecture
- Sensor pods
- Collection of sensors
- Measure system state from a nodes view
- Backplane
- Programmable fabric
- Connects pods and aggregates measured system
state - Inference Engines
- Infers O(n2) E2E paths info by measuring few
paths - Schedules measurements on pods
- Aggregates data on backplane
- Applications
7Sensor Pod
Configuration Data
SNMP Agent
Repository
Load
Memory
Secure Web Interface
Capacity
API query, control, and notification
Lossrate
Controller
Bandwidth
Latency
8S3 Deployment on Planet-Lab
- Running since January 2006
- All pair network metrics
- Latency Inferred by Netvigator
- Lossrate Measured using Tulip lossrate tool
- Available Bandwidth Measured using Spruce and
PathChirp - Capacity Measured using Pathrate
- Stats14GB raw data every day, 1GB compressed
9Two correlations quantified
- Changes in hop and latency ? changes in route
(HL?R)? - PING can be used to measure both hops and latency
- Original TTL - Remaining TTL value Num of hops
- Change in number of hops will always means change
in the route - But does change in the route ? change in the
number of hops? - Obviously NO but how often how it affects
monitoring accuracy? - Changes in route ? changes in capacity (R?C)?
- Capacity can change when route is not changed
- CAP Limits
- Especially in PlanetLab
- Becoming common in other networks e.g., Cable
networks - Same route, but link upgraded or link-level
change not visible in IP route - Question
- How often does this happen and how it affects
monitoring accuracy?
10S3 Dataset
- HL ? R
- Use Traceroute measurements
- Performed at each node to 20 landmark nodes
- Landmark nodes (20) chosen across the globe
- Performed once every 30 minutes
- R ? C
- Use Traceroute and Pathrate measurements
- Each node performs Pathrate to all other nodes
- In a round-robin fashion
- Takes about a day (avg.) to complete a round of
measurements - We use Pathrate measurements iff (0 lt COV lt 1)
11Defining metric changes
- Route changes (R)
- R1 If current route does not match previous
sample - Else R0
- Some times routers do not respond in output
- We ignore those hops during above route change
detection - Latency changes (L)
- L1 If current latency is p or more different
than the previous sample - Else L0
- We use p5 for this analysis
- Hop changes (H)
- H1 If current number of hops does not match
with the previous - H0 otherwise
12Case counts
- Averaged across all paths
- H Change in hops L Change in Latency R
Change in route
13Case counts
Measurements where route changed but neither hops
nor latency changed ? If we use changes in hops
and/or latency to detect route changes, we will
miss these
- Averaged across all paths
- H Change in hops L Change in Latency R
Change in route
14Case counts
Overall, these two numbers are small ? changes
in hop and latency can be a good indicator of
changes in route
- Averaged across all paths
- H Change in hops L Change in Latency R
Change in route
15Cost-Accuracy Tradeoff
- What if we perform only PING and then perform
Traceroute only when a hop or latency change is
observed? - Reduces cost PING is relatively inexpensive
- Increases inaccuracy Might miss some some route
changes - Base method Traceroutes every T seconds
- Strategy
- Perform Traceroutes every s.T seconds
- We refer to s as the sampling factor
- Perform PING every t seconds when a Traceroute is
not performed - Further, perform a Traceroute if change in
hop/latency is observed
16Cost-Accuracy Tradeoff
17Defining capacity changes for a path
- Pathrate gives an estimate of capacity (with some
error) - Link-Mapping based change detection
- Mapped result from Pathrate measurement to one of
the several known link types - C1 If current link type is different from the
previous link type - Percent-Change
- C1 If current value is p or more different
from the previous value - We use p10 for our analysis
18Case counts
- Averaged across all paths
- C Change in Capacity R Change in Route
R C take same value in only 63 and 58 cases ?
Modest positive correlation
19Cost-Accuracy Tradeoff
20Cost-Accuracy Tradeoff
21Conclusions and Next Steps
- Methodology for correlation quantification
- Case counting
- Cost-Accuracy tradeoff analysis
- Hop Latency changes ? Route changes
- Route changes ? Capacity changes
- Promising results in both cases
- Low cost measurements can be used to trigger high
cost measurements - Further steps
- Other correlations Capacity and Available
Bandwidth correlation - Application level inaccuracy aka impact on E2E
apps
22Ongoing work
- http//networking.hpl.hp.com/s-cube
- Email s-cube_at_hpl.hp.com