Designing Parallel Operating Systems using Modern Interconnects - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Designing Parallel Operating Systems using Modern Interconnects

Description:

Designing Parallel Operating Systems using Modern Interconnects ... Paper covers 32 recurring pitfalls, organized into topics and sorted by severity ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 28

Provided by: csHu

Category:

more less

Transcript and Presenter's Notes

Title: Designing Parallel Operating Systems using Modern Interconnects

1
Designing Parallel Operating Systems using Modern
Interconnects
Pitfalls in Parallel Job Scheduling Evaluation
Eitan Frachtenberg and Dror Feitelson
Computer and Computational Sciences Division Los
Alamos National Laboratory
Ideas that change the world
2
Scope

Numerous methodological issues occur with the
evaluation of parallel job schedulers
Experiment theory and design
Workloads and applications
Implementation issues and assumptions
Metrics and statistics
Paper covers 32 recurring pitfalls, organized
into topics and sorted by severity
Talk will describe a real case study, and the
heroic attempts to avoid most such pitfalls
as well as the less-heroic oversight of several
others

3
Evaluation Paths

Theoretical Analysis (queuing theory)
Reproducible, rigorous, and resource-friendly
Hard for time slicing due to unknown parameters,
application structure, and feedbacks
Simulation
Relatively simple and flexible
Many assumptions, not all known/reported hard to
reproduce rarely factors application
characteristics
Experiments with real sites and workloads
Most representative (at least locally)
Largely impractical and irreproducible
Emulation

4
Emulation Environment

Experimental platform consisting of three
clusters with high-end network
Software several job scheduling algorithms
implemented on top of STORM
Batch / space sharing, with optional EASY
backfilling
Gang Scheduling, Implicit Coscheduling (SB),
Flexible Coscheduling
Results described in JSSPP03 and TPDS05

5
Step One Choosing Workload

Static vs. Dynamic
Size of workload
How many different workloads are needed?
Use trace data?
Different sites have different workload
characteristics
Inconvenient sizes may require imprecise scaling
Polluted data, flurries
Use model-generated data?
Several models exist, with different strengths
By trying to capture everything, may capture
nothing

6
Static Workloads

We start with a synthetic application static
workloads
Simple enough to model, debug, and calibrate
Bulk-synchronous application
Can control granularity, variability and
Communication pattern

7
Synthetic Scenarios
Balanced Complementing Imbalanced
Mixed
8
Example Turnaround Time
9
Dynamic Workloads

We chose Lublins model JPDC03
1000 jobs per workload
Multiplying run-times AND arrival times by
constant to shrink run time (2-4 hours)
Shrinking too much is problematic (system
constants)
Multiplying arrival times by a range of factors
to modify load
Unrepresentative, since deviates from real
correlations with run times and job sizes.
Better solution is to use different workloads

10
Step Two Choosing Applications

Synthetic applications are easy to control, but
Some characteristics are ignored (e.g., I/O,
memory)
Others may not be representative, in particular
communication, which is salient of parallel apps.
Granularity, pattern, network performance
If not sure, conduct sensitivity analysis
Might be assumed malleable, moldable, or with
linear speedup, which many MPI applications are
not
Real applications have no hidden assumptions
But may also have limited generality

11
Example Sensitivity Analysis
12
Application Choices

Synthetic applications on first set
Allows control over more parameters
Allows testing unrealistic but interesting
conditions (e.g., high multiprogramming level)
LANL applications on second set (Sweep3D, Sage)
Real memory and communication use (MPL2)
Important applications for LANLs evaluations
But probably only for LANL
Runtime estimate f-model on batch, MPL on others

13
Step Three Choosing Parameters

What are reasonable input parameters to use in
the evaluation?
Maximum multiprogramming level (MPL)
Timeslice quantum
Input load
Backfilling method and effect on multiprogramming
Run time estimate factor (not tested)
Algorithm constants, tuning, etc.

14
Example 1 MPL

Verified with different offered loads

15
Example 2 Timeslice

Dividing to quantiles allows analysis of effect
on different job types

16
Considerations for Parameters

Realistic MPLs
Scaling traces to different machine sizes
Scaling offered load
Artificial user estimates and multiprogramming
estimates

17
Step Four Choosing Metrics

Not all metrics are easily comparable
Absolute times, slowdown with time slicing, etc.
Metrics may need to be limited to a relevant
context
Use multiple metrics to understand
characteristics
Measuring utilization for an open model
Direct measure of offered load till saturation
Same goes for throughput and makespan
Better metrics slowdown, response time, wait
time
Using mean with asymmetric distributions
Inferring scalability from O(1) nodes

18
Example Bounded Slowdown
19
Example (continued)
20
Response Time
21
Bounded Slowdown
22
Step Five Measurement

Never measure saturated workloads
When arrival rate is higher than service rate,
queues grow to infinity all metrics become
meaningless
but finding saturation point can be tricky
Discard warm-up and cool-down results
May need to measure subgroups separately
(long/short, day/night, weekday/weekend,)
Measurement should still have enough data points
for statistical meaning, especially workload
length

23
Example Saturation Point
24
Example Shortest jobs CDF
25
Example Longest jobs CDF
26
Conclusion

Parallel Job Scheduling Evaluation is complex
but we can avoid past mistakes
Paper can be used as a checklist to work with
when designing and executing evaluations
Additional information in paper
Pitfalls, examples, and scenarios
Suggestions on how to avoid pitfalls
Open research questions (for next JSSPP?)
Many references to positive examples
Be cognizant when Choosing your compromises

27
References