Title: Untangling the Web from DNS
1A case for resource discovery in shared
distributed platforms
David Oppenheimer
UCB ROC Retreat12 January 2005
2Introduction
- Application performance is a function of
- resources available to the application
- resources needed by the application
- or, application sensitivity to resource
constraints - At summer retreat, described SWORD
- at app deployment time, find best set of nodes
given - resources available on a set of distributed nodes
- application sensitivity to resource constraints
- assumptions
- available resources vary among nodes enough to
matter - spare CPU, mem, disk space inter-node latency,
avail. bw ... - applications are sensitive to resource
constraints enough to matter - Focus of this talk verify assumption (1)
3Introduction (cont.)
- Questions we will address
- is there enough variation among nodes at any
given (deployment) time to justify service
placement? - is there enough variation over time on a single
node to justify periodic task migration? - are there correlations between attributes on a
single node, or among nodes at the same site? - All of these questions are important in designing
a system for resource discovery and service
placement (like SWORD)
4Outline
- How much does the available amount of per-node
resources vary among nodes at a fixed time? - How much does the available amount of per-node
resources vary over time? How much do inter-node
latency and available bandwidth vary over time? - On a given node, are any per-node attributes
strongly correlated? Are inter-node latency and
available bandwidth correlated?
5Experimental environment
- Per-node attributes Ganglia, CoMon
- two-week period (Oct 10-Oct 24, 2004)
- each node polled every 5 minutes
- free memory, free swap, free disk, load average,
network bytes sent and received/sec, active
slices - Inter-node latency all-pairs pings
- one month period ending Oct 24, 2004
- each pair of nodes measured every 15 minutes
- Inter-node bandwidth Iperf
- one month period ending Oct 24, 2004
- each pair of nodes measured 1-2x/week
- About 250 nodes in the trace each day
6Outline
- How much does the available amount of per-node
resources vary among nodes at a fixed time? - How much does the available amount of per-node
resources vary over time? How much do inter-node
latency and available bandwidth vary over time? - On a given node, are any per-node attributes
strongly correlated? Are inter-node latency and
available bandwidth correlated?
7Resource heterogeneity averages
- How much does available resources vary over the
trace?
attribute mean std. dev. 10th ile 90th ile
of CPUs 1.0 0.0 1.0 1.0
CPU speed (MHz) 1942 572 1263 2652
Total disk (GB) 127 88.5 35.1 232
Total memory (MB) 1153 467 628 2017
Total swap (GB) 1.0 0.0 1.0 1.0
8Resource heterogeneity averages
- How much does available resources vary over the
trace?
attribute mean std. dev. 10th ile 90th ile
1 min load average 6.81 20.06 1.05 11.86
Free memory (MB) 62.359 125.234 13.668 105.432
Free swap (MB) 755.596 178.795 524.336 1000.268
Free disk (GB) 102.8 86.04 8.088 208.3
Active slices 13.3 5.96 0.0 20.0
Bytes/s in 50477 117023 5568 92877
Bytes/s out 52543 130112 5476 96214
9Resource heterogeneity CV vs. time
10Outline
- How much does the available amount of per-node
resources vary among nodes at a fixed time? - How much does the available amount of per-node
resources vary over time? How much do inter-node
latency and available bandwidth vary over time? - On a given node, are any per-node attributes
strongly correlated? Are inter-node latency and
available bandwidth correlated?
11Variability of per-node attributes over time
12Variability of per-node attributes over time
13Variability of per-node attributes over time
14Variability of per-node attributes over time
- Can rank degree of variability of each attribute
- disk, swap lt mem, load lt net bytes slices mod
to sig. - CDF curve shifts to right as interval length
incrs. - attributes vary less over short time periods than
long - migration interval find sweet spot in curve of
variability vs. interval length - CDF slope decreases as median var. of attr. incr.
- may be able to classify nodes as high/low var.
over time for mem, load, net bytes (they have
high median var.)
15Inter-node latency and BW variation over time
- Most nodes have low latency (and bw) variability
even over a month-long trace - migration may not be worthwhile
16Outline
- How much does the available amount of per-node
resources vary among nodes at a fixed time? - How much does the available amount of per-node
resources vary over time? How much do inter-node
latency and available bandwidth vary over time? - On a given node, are any per-node attributes
strongly correlated? Are inter-node latency and
available bandwidth correlated?
17Correlation among per-node attributes
r loadone memfree swapfree diskfree actvslice byte_in byte_out
loadone .080
memfree -.050 .627
swapfree -.231 .274 .473
diskfree -.035 .192 .212 .929
actvslice .079 -.050 -.219 .049 .773
byte_in .059 -.033 -.074 .059 .140 .209
byte_out .058 -.033 -.059 .078 .137 .443 .188
- No strong correlations between different attrs.
- though some one-hour trace segments had some
- Some correlation between nodes at same site
18Correlation between latency and avail BW
r-.59
- Moderate inverse power law correlation
- Using latency to estimate BW gives 233 error
- some nodes are bandwidth-capped, some in weird
ways - Some node pairs showed strong lat-BW correlation
- 17 within 25, 56 within 50
19Conclusion
- How much does the available amount of per-node
resources vary among nodes at a fixed
time? significantly enough to warrant svc.
placement - How much does the available amount of per-node
resources vary over time? How much do inter-node
latency and available bandwidth vary over
time? moderate variability may warrant
migration - On a given node, are any per-node attributes
strongly correlated? Are inter-node latency and
available bandwidth correlated? no strong
correlation between diff. attrs. some
correlation between same attr, same site latency
can predict avail. bandwidth
20Future work
- Ask same questions but use application model to
answer, rather than analysis of raw data - different apps have different resource
sensitivities - different apps have different migration costs
- Can we predict attribute values?
- give warning before migration
- or just dont bother to deploy on bad nodes
- How much better could we do if SWORD could
schedule jobs?