Title: The Running Time Advisor A Resource Signal-based Approach to Predicting Task Running Time and Its Applications
1The Running Time AdvisorA Resource Signal-based
Approach to Predicting Task Running Time and Its
Applications
- Peter A. Dinda
- Carnegie Mellon University
- http//www.cs.cmu.edu/pdinda
2High Level Goals
Build systems that use statistics to help
distributed applications adapt to highly variable
resource availability Focus on information
- Application-level performance predictions
- Running time of compute-bound tasks
- Adaptation advice
- Host selection to meet soft real-time deadline
- Resource signal approach
- Host load signals
This Talk
3Outline
- Birds eye view
- Adapting to highly variable resource availability
- Dv/QuakeViz
- Real-time scheduling advisor
- Running time advisor
- Confidence intervals
- Performance results (feasible, practical, useful)
- Prototype system
- Host load prediction
- Traces, structure, linear models, evaluation
- RPS Toolkit
- Conclusion
4A Universal Challenge in High Performance
Distributed Applications
- Highly variable resource availability
- Shared resources
- No reservations
- No globally respected priorities
- Competition from other users - background
workload - Running time can vary drastically
- Adaptation
5A Universal Problem
Which host should the application send the task
to so that its running time is appropriate?
?
Task
Known resource requirements
What will the running time be if I...
6DV Framework For Distributed Interactive
Visualization
- Large datasets (e.g., earthquake simulations)
- Distributed VTK visualization pipelines
- Active frames
- Encapsulate data, computation, path through
pipeline - Launched from server by user interaction
- Annotated with deadline
- Dynamically chose on which host each pipeline
stage will execute and what quality settings to
use
http//www.cs.cmu.edu/dv
7Example DV Pipeline for QuakeViz
local display and user
Logical View
resolution
contours
ROI
interpolation
isosurface extraction
Simulation Output
reading
rendering
scene synthesis
interpolation
morphology reconstruction
Physical View
interpolation
isosurface extraction
scene synthesis
deadline
deadline
deadline
Active Frame n2
Active Frame n1
Active Frame n
?
?
?
8Real-time Scheduling Advisor
- Distributed interactive applications
- Examples CMU Dv/QuakeViz, BBN OpenMap
- Assumptions
- Sequential tasks initiated by user actions
- Aperiodic arrivals
- Resilient deadlines (soft real-time)
- Compute-bound tasks
- Known computational requirements
- Best-effort semantics
- Recommend host where deadline is likely to be met
- Predict running time on that host
- No guarantees
9Running Time Advisor
Predicted Running Time
Application notifies advisor of tasks
computational requirements (nominal time) Advisor
predicts running time on each host Application
assigns task to most appropriate host
?
Task
nominal time
10Real-time Scheduling Advisor
Application notifies advisor of tasks
computational requirements (nominal time) and its
deadline Advisor acquires predicted task running
times for all hosts Advisor recommends one of the
hosts where the deadline can be met
Predicted Running Time
deadline
?
Task
nominal time
deadline
11Variability and Prediction
Prediction
resource
High Resource Availability Variability
t
Low Prediction Error Variability
Predictor
resource
error
t
t
Characterization of variability
ACF
t
Exchange high resource availability
variability for low prediction error variability
and a characterization of that variability
12Confidence Intervals to Characterize Variability
3 to 5 seconds with 95 confidence
Application specifies confidence level (e.g.,
95) Running time advisor predicts running times
as a confidence interval (CI) Real-time
scheduling advisor chooses host where CI is less
than deadline CI captures variability to the
extent the application is interested in it
Predicted Running Time
deadline
?
Task
nominal time
deadline
95 confidence
13Confidence Intervals And Predictor Quality
Bad Predictor No obvious choice
Good Predictor Two good choices
Predicted Running Time
Predicted Running Time
deadline
Good predictors provide smaller CIs Smaller CIs
simplify scheduling decisions
14Overview of Research Results
- Predicting CIs is feasible
- Host load prediction using AR(16) models
- Running time estimation using host load
predictions - Predicting CIs is practical
- RPS Toolkit (inc. in CMU Remos, BBN QuO)
- Extremely low-overhead online system
- Predicting CIs is useful
- Performance of real-time scheduling advisor
Measured performance of real system
Statistically rigorous analysis and evaluation
15Experimental Setup
- Environment
- Alphastation 255s, Digital Unix 4.0
- Workload host load trace playback
- Prediction system on each host
- Tasks
- Nominal time U(0.1,10) seconds
- Interarrival time U(5,15) seconds
- Methodology
- Predict CIs / Host recommendations
- Run task and measure
16Predicting CIs is Feasible
Near-perfect CIs on typical hosts
3000 randomized tasks
17Predicting CIs is Practical - RPS System
lt2 of CPU At Appropriate Rate
1-2 ms latency from measurement to
prediction 2KB/sec transfer rate
18Predicting CIs is Useful - Real-time Scheduling
Advisor
Host With Lowest Load
Predicted CI lt Deadline
Random Host
16000 tasks
19Predicting CIs is Useful - Real-time Scheduling
Advisor
Predicted CI lt Deadline
Host With Lowest Load
Random Host
16000 tasks
20Outline
- Birds eye view
- Adapting to highly variable resource availability
- Dv/QuakeViz
- Real-time scheduling advisor
- Running time advisor
- Confidence intervals
- Performance results (feasible, practical, useful)
- Prototype system
- Host load prediction
- Traces, structure, linear models, evaluation
- RPS Toolkit
- Conclusion
21Design Space
Can the gap between the resources and the
application can be spanned? yes!
22Resource Signals
- Characteristics
- Easily measured, time-varying scalar quantities
- Strongly correlated with resource availability
- Periodically sampled (discrete-time signal)
- Examples
- Host load (Digital Unix 5 second load average)
- Network flow bandwidth and latency
Leverage existing statistical signal analysis and
prediction techniques
23RPS Toolkit
- Extensible toolkit for implementing resource
signal prediction systems - Easy buy-in for users
- C and sockets (no threads)
- Prebuilt prediction components
- Libraries (sensors, time series, communication)
- Users have bought in
- Incorporated in CMU Remos, BBN QuO
- Research users Bruce Lowekamp, Nancy Miller,
LeMonte Green
http//www.cs.cmu.edu/pdinda/RPS.html
24Prototype System
RPS components can be composed in other ways
25Research Results
- Host load on real hosts has exploitable structure
- Strong autocorrelation, self-similarity, epochal
behavior - Trace database and host load trace playback
- Host load is predictable using simple linear
models - Recommendation AR(16) models or better for 1-30
sec predictions - RPS Toolkit for low overhead systems (lt2 of CPU)
- C, ported to 5 OSes, incorporated in CMU Remos,
BBN QuO - Running time CIs can be computed from load
predictions - Load discounting, error covariances
- Effective real-time scheduling advice can be
based on CIs - Know if deadline will be met before running task
26Outline
- Birds eye view
- Adapting to Highly variable resource availability
- Dv/QuakeViz
- Real-time scheduling advisor
- Running time advisor
- Confidence intervals
- Performance results (feasible, practical, useful)
- Prototype system
- Host load prediction
- Traces, structure, linear models, evaluation
- RPS Toolkit
- Conclusion
27Questions
- What are the properties of host load?
- Is host load predictable?
- What predictive models are appropriate?
- Are host load predictions useful?
28Overview of Answers
- Host load exhibits complex behavior
- Strong autocorrelation, self-similarity, epochal
behavior - Host load is predictable
- 1 to 30 second timeframe
- Simple linear models are sufficient
- Recommend AR(16) or better
- Predictions are useful
- Can compute effective CIs from them
29Host Load Traces
- DEC Unix 5 second exponential average
- Full bandwidth captured (1 Hz sample rate)
- Long durations
30If Host Load Was Random (White Noise)...
Time domain
Autocorrelation
Spectrogram
Frequency domain
31Host Load Has Exploitable Structure
Time domain
Autocorrelation
Spectrogram
Frequency domain
32Linear Time Series Models
Pole-zero / state-space models capture
autocorrelation parsimoniously
(2000 sample fits, largest models in study, 30
secs ahead)
33Evaluation Methodology
- Ran 190,000 randomly chosen testcases on the
traces - Evaluate models independently of
prediction/evaluation framework - No monitoring
- 30 testcases per trace, model class, parameter
set - Data-mine results
Offline and online systems implemented using RPS
Toolkit
34Testcases
- Models
- MEAN, LAST/BM(32)
- Randomly chosen model from AR(1..32), MA(1..8),
ARMA(1..8,1..8), ARIMA(1..8,1..2,1..8),
ARFIMA(1..8,d,1..8)
35Evaluating a Testcase
Measurements in Fit Interval
Model Type
ltzt-m,..., zt-2 , zt-1gt
Modeler
zt1,t1w
zt2,t2w
zt,tw
...
Model
...
...
...
zt1,t3
zt2,t4
Measurements in Test Interval
zt,t2
...
zt1,t2
zt2,t3
Load Predictor
zt,t1
ztn-1,, zt1 , zt
...
Prediction Stream
Error Estimates
Characterization of variation
Evaluator
One-time use
Measurement of variation
Production
Stream
Error Metrics
36Measured Prediction Variance Mean Squared Error
zt1,t1w
zt2,t2w
zt,tw
...
w step ahead predictions
...
...
...
...
Load Predictor
zt1,t3
zt2,t4
zt,t2
, zt1 , zt
...
2 step ahead predictions
zt1,t2
zt2,t3
zt,t1
...
1 step ahead predictions
s2z
(m - zti)2
Variance of z
s2aw
w step ahead mean squared error
...
...
s2a2
2 step ahead mean squared error
(zti,ti1 - zti1 )2
s2a1
1 step ahead mean squared error
Good Load Predictor s2a1, s2a2 ,,s2aw ltlt s2z
37Unpaired Box Plot Comparisons
Inconsistent low error
Consistent high error
97.5
Mean Squared Error
75
Consistent low error
Mean
50
25
Model A
Model B
Model C
2.5
Good models achieve consistently low error
381 second Predictions, All Hosts
97.5
75
Mean
50
25
2.5
Predictive models clearly worthwhile
3930 second Predictions, All Hosts
97.5
75
Mean
50
25
2.5
Predictive models clearly beneficial even at long
prediction horizons
4030 Second Predictions, High Load, Dynamic Host
97.5
75
Mean
50
25
2.5
Predictive models clearly worthwhile Begin to see
differentiation between models
41Outline
- Birds eye view
- Adapting to highly variable resource availability
- Dv/QuakeViz
- Real-time scheduling advisor
- Running time advisor
- Confidence intervals
- Performance results (feasible, practical, useful)
- Prototype system
- Host load prediction
- Traces, structure, linear models, evaluation
- RPS Toolkit
- Conclusion
42Related Work
- Distributed interactive applications
- QuakeViz/ Dv, Aeschlimann PDPTA99
- Quality of service
- QuO, Zinky, Bakken, Schantz TPOS, April 97
- QRAM, Rajkumar, et al RTSS97
- Distributed soft real-time systems
- Lawrence, Jensen assorted
- Workload studies for load balancing
- Mutka, et al PerfEval 91
- Harchol-Balter, et al SIGMETRICS 96
- Resource signal measurement systems
- Remos HPDC98
- Network Weather Service HPDC97, HPDC99
- Host load prediction
- Wolski, et al HPDC99 (NWS)
- Samadani, et al PODC95
- Hailperin 93
- Application-level scheduling
- Berman, et al HPDC96
43Conclusions
- Help applications adapt tohighly variable
resource availability - Resource signal prediction
- Predict running times as confidence intervals
- Predicting CIs is feasible
- Host load prediction using AR(16) models
- Running time estimation using host load
predictions - Predicting CIs is practical
- RPS Toolkit (inc. in CMU Remos, BBN QuO)
- Extremely low-overhead online system
- Predicting CIs is useful
- Performance of real-time scheduling advisor
44Future Work
- New resource signals
- Network bandwidth and latency (Remos)
- New prediction approaches
- Wavelets, nonlinearity, cointegration
- Resource scheduler models
- Better Unix scheduler model
- Network models
- Adaptation advisors
- Applications and workloads
- DV/QuakeViz, GIMP, Instrumentation
45Tools/Venues for Future work
- Resource signal methodolgy
- RPS Toolkit
- Remos
- QuakeViz/DV
- Grid Forum
46Future Work (Long Term)
- Experimental computer science research
- Application-oriented view
- Measurement studies and analysis
- Statistical approach
- Application services
- Systems building
systems X applications X statistics
47Teaching
- Signals, systems, and statistics for computer
scientists - Performance data analysis
- Introduction to computer systems
48Response of Typical AR(16)
49Response of AR(1024)