Title: Directing a Datacenter: Predicting Resource Utilization from Workload
1Directing a DatacenterPredicting Resource
Utilization from Workload
- Peter Bodík
- RAD Lab, UC Berkeley
2Overview slide
3Introduction
- resource allocation main task of the director
- satisfy SLA reqs, but dont waste
power/resources - minimize cost
- need to model/predict resource utilization
- depends on workload, changes w/ app
- modeling resource utilization
- input workload
- output CPU utilization, disk, network, memory
activity - this talk 3 models
- linear regression
- quantile regression
- sampling method
4The datacenter cost function
workload
proposed physical config
input
per-tier resource utilization
server resource utilization
performance, availability
Jeff Chen
server power consumption
server temperature
HVAC power consumption
cost
output
5Experimental setup
- Rubis PHP application, like eBay
- 2 tiers Apache/PHP and MySQL, running on VMWare
ESX - just one web-server
- 27 types of requests, but only using 10
- 3 HTML files, 7 PHP (dynamic) request
- no writes, all reads cached in memory
- measure workload, metrics every 20 seconds
- workload model
- vector of request rates for 10 classes of
requests - workload generation
- not using Rubis workload generator
- use exponential interarrival times
- we use interarrivals from World Cup 98 and
Ebates.com traces
6Model 1 linear regression
- assumption utilization is a linear function of
workload - 1 request -gt 1M CPU cycles, 10 requests -gt 10M
cycles - same for disk, network, memory
- reasonable when requests independent
- web server, Ruby on Rails, MySQL
- linear regression
- train and evaluate on same data 2-4 error (CPU
and net) - extrapolation two experiments
- increase workload magnitude
- different workload
7Increasing workload magnitude
- training evaluation
- used 10 different workloads
- web CPU 3 - 10
- larger increase in magnitude -gt larger error
- web network, DB net/CPU 3 - 4
request rates
8Changing workload
- train on one workload, evaluate on a different
workload - training evaluation
- results 6 - 8 error
- web server and DB, net and CPU
9What else do we need?
- whats the variance of the resource utilization?
- mean not really useful
- whats the error of the predictions?
- can compute, but assuming Normal distribution
- how much data do we need?
- whats the effect of workload variation?
10Model 2 quantile regression
- estimate 95th percentile, not mean
- formulate as linear program
- assumption P( cpu w ) same for different
workloads
11Estimating error -- bootstrap
- resample the data, fit QR, repeat 100x
- estimate error from bootstrap samples
12Estimating error (2)
13Just 40 points -- larger error
14How much data do we need?
- use 95th perc 2stdev as prediction
- model gets more accurate with more data
15Model 3 sampling
- previous models estimated CPU at a fixed
workload - however, workload fluctuates
- need to estimate
- workload distribution
- P( cpu w )
- generate samples of CPU
- sample workload w
- sample cpu P( cpu w )
- repeat
- compute sample 95th percentile
CPU utilization
request rate
16Estimating workload distribution
17Estimating P(CPUw)
18Sampling workload and CPU
19Results
- actual 95th percentile 347.5 - 4.2 (stdev)
- quantile regression 339.8 - 2.6
- workload and CPU sampling 345.3 - 3.9
- quantile regression already pretty close
- but sampling can accommodate any workload
fluctuation
20Comparison of algorithms
- linear regression
- assumptions error from Normal distn
- ignores workload fluctuation
- running time fast
- quantile regression
- assumptions arbitrary error distn, same for
different workloads - ignores workload fluctuation
- running time fast, but need bootstrap to
estimate error - workload CPU sampling
- assumption arbitrary error distn, same for
different workloads - running time slower, sampling bootstrap to
estimate error
21Future work
- model lifecycle
- when to retrain the model?
- workload characterization
- real web app 100s - 1000s request classes
- what are the important request classes?
- modeling response time, non-linear resource
utilization
22Summary
- use algorithms that make few assumption
- linear regression simple to analyze, but not
useful in practice - similar to M/M/1 queue
- quantile regression, sampling
- harder to analyze/estimate
- but work for broader class of web apps and
workloads - estimate the prediction error of the model
- model still useful even with large error