Title: Hybrid Preemptive Scheduling of MPI applications
1Hybrid Preemptive Scheduling of MPI applications
- Aurélien Bouteiller, Hinde Lilia Bouziane, Thomas
Hérault, - Pierre Lemarinier, Franck Cappello
- MPICH-V team
- INRIA Grand-Large
- LRI, University Paris South
1
2Problem definition
- Context Clusters and Grids (made of clusters)
shared by many users - (less available resources than required at a
given time) - In this study finite sets of MPI applications.
- Time sharing of parallel applications is
attractive to increase fairness between users,
compared to Batch scheduling - It is very likely that several applications will
reside in the virtual memory at the same time,
exceeding the total physical memory - ? Out-of-core scheduling of parallel applications
on clusters! (scheduling // applications on
cluster under mem. constraint) - Most of the proposed approaches tries to avoid
this situation (by limiting job admission based
on mem. requirement, delaying some jobs - unpredictably if the jobs exec. time is not
known) - Issue Novel approach (out-of-core) that avoid
delaying some jobs? - Constraint No OS modification (no kernel patch)
2
3Outline
- Introduction (related work)
- A Hybrid approach dedicated to out-of-core
- Evaluation
- Concluding remarks
3
4Related work 1
Scheduling parallel applications on distributed
memory machines ? a long history of research,
still very active (5 papers in 2004 in main
conferences IPDPS, Cluster, SC, Grid, Europar)!
Time
Co-scheduling all processes of each application
are scheduled independently (no coordination)
Gang-scheduling all processes of each
application are executed simultaneously (coordinat
ion)
Time
sometimes called co-scheduling
4
55
66
7Outline
- Introduction (related work)
- A Hybrid approach dedicated to out-of-core
- Evaluation
- Concluding remarks
7
8Our approach 1/2 Hybrid
8
9Our approach 2/2 Checkpointing
9
10Implementation using MPICH-V Framework
MPICH-V framework a set of components A MPICH-V
protocol a composition of a subset of these
components
node
10
1111
12Coordinated Checkpoint 2 ways
Ckpt. Image (P1)
Ckpt. Image (P1)
P1
12
13MPICH-V/CL protocol
Reference protocol for coordinated checkpointing
13
14Implementation details
Dispatchers
- Co-scheduling Several Dispatchers (no
master/checkpoint scheduler) - Gang and (Hybrid) Master Scheduler several
checkpoint schedulers - Master Scheduler issues a checkpoint order to the
Checkpoint Scheduler(s) of running application(s) - When receiving this order, a Checkpoint Scheduler
launches a coordinated checkpoint. Every running
daemon computes the MPI process image and store
it on the local disc. All daemons send a
completion message to the Checkpoint Scheduler. - All running daemons stop the MPI process and
their execution - The Master Scheduler selects the Checkpoint
Scheduler(s) of other application(s) and sends a
restart order. Every Checkpoint Scheduler
receiving this order spawns new daemons
restarting MPI processes from local images.
14
15Outline
- Introduction (related work)
- A Hybride approach dedicated to out-of-core
- Evaluation
- Concluding remarks
15
16Methodology
- LRI cluster
- Athlon 1800
- 1GB memory
- IDE ATA100 Disc
- Ethernet 100Mbs
- Linux 2.4.2
- Benchmark (MPI)
- NAS BT (computation bound)
- NAS CG (communication bound)
- Time measurement
- Homogeneous Applications
- Simultaneous launch (scripts)
- Time is measured between the first launch and the
last termination - Fairness is measured by response time standard
deviation - Gang Scheduling time slice 200 or 600 sec
- Gang sched. also implemented by checkpointing
(not OS signal)
16
17Context switch overlap policy
In core
Near out-of-core
- Policies for NAS Bench. BT C- 25
- Overlapping policies do not provide substantial
- improvements for the in-core situation
- 2) They need 2x the memory capacity to stay
in-core. - the sequential policy is the best
- We used it for the other xps.
lt3
2X
2X
2X
1X
2X
2X
1X
17
18Co VS. Gang (Ckpt based)
- Which scheduling strategy is the best for
communication bound and compute bound - applications?
- Co-scheduling is the best for in-core executions
(but small advantage due to Checkpoint overhead
tinny Comm./comp. overlap) - Gang scheduling outperforms co-scheduling for
out-of-core (ckpt.) - ? Memory constraint is managed by checkpointing
not by delaying jobs
18
19Ckpt Gang VS. Ckpt Hybrid
19
20Overhead comparison
- What is the performance degradation due to time
sharing?
- Gang and Hybrid scheduling add no performance
penalty to CG (and also no improvement), - Gang scheduling add 10 performance penalty to
BT, - Hybrid scheduling improves the performance by
almost 10, - Difference is mostly due to communication/computat
ion overlap.
20
21Co-scheduling Fairness (Linux)
- How fair is co-scheduling for in-core and
out-of-core? - ? Response time of BT 9 with modified memory sizes
Page miss statistics for 7 and 9 BT C 25
(out-of-core)
21
22Outline
- Introduction (related work)
- A Hybrid approach dedicated to out-of-core
- Evaluation
- Concluding remarks
22
23Concluding remarks
- Checkpoint based Gang Scheduling outperforms
Co-scheduling and certainly classical (OS signal
based) Gang scheduling on out-of-core situation
(thanks to a better memory management) - Compared to known approaches, based on job
admission control, the benefit of ckpt is that it
avoids to delay some jobs - Hybrid scheduling, combining the two approaches
checkpointing, outperforms Gang scheduling on BT
(presumably thanks to overlapping communications
and computations) - More generally, Hybrid scheduling can take
advantage of advanced co-scheduling approaches
within a gang subset - Work in progress
- Test with other applications / benchmarks
- Compare with traditional gang scheduling based on
OS signals - Experiments with high speed networks
- Experiments on Hybrid scheduling with
Co-scheduling optimizations
23
24Meet us! at the INRIA booth 2345
INRIA Booth 2345
Mail contact bouteiller_at_mpich-v.net
25References
Ag03 S. Agarwal, G. Choi, C. R. Das, A. B. Yoo,
and S. Nagar. Co-ordinated Coscheduling in
time-sharing Clusters through a Generic
Framework. In Proceedings of International
Conference on Cluster Computing, December
2003. Ar98 A. C. Arpaci-Dusseau, D. E. Culler,
and A. M. Mainwaring. Implicit Scheduling With
Implicit Information in Distributed Systems. In
Proceedings of the 1998 ACM SIGMETRICS joint
International Conference on Measurement and
Modeling of Computer Systems, pages 233243, June
1998. Ba00 Anat Batat and Dror G. Feitelson,
Gang Scheduling with Memory Considerations , in
proceedings of IPDPS 2000. Bo03 Aurélien
Bouteiller, Pierre Lemarinier, Géraud Krawezik,
and Franck Cappello, Coordinated checkpoint
versus message log for fault tolerant MPI , In
IEEE International Conference on Cluster
Computing (Cluster 2003). IEEE CS Press, december
2003. Ch85 K. M. Chandy and L.Lamport,
Distributed snapshots Determining global states
of distributed systems In Transactions on
Computer Systems, volume 3(1), pages 6375. ACM,
February 1985. Fe98 D. G. Feitelson and L.
Rudolph, Metrics and Benchmarking for Parallel
Job Scheduling. In Job Scheduling Strategies for
Parallel Processing, LNCS vol. 1495, pp. 124,
Springer-Verlag, Mar 1998. Fr03 Eitan
Frachtenberg, Dror G. Feitelson, Fabrizio Petrini
and Juan Fernandez, Flexible CoScheduling
Mitigating Load Imbalance and Improving
Utilization of Heterogeneous Resources, IPDPS
2003 Ho98 Atsushi Hori, Hiroshi Tezuka, and
Yutaka Ishikawa, Overhead analysis of
preemptive gang scheduling , Lecture Notes in
Computer Science, 1459 217230, April
1998. Ky04 Kyung Dong Ryu, Nimish Pachapurkar,
Liana L. Fong, Adaptive Memory Paging for
Efficient Gang Scheduling of Parallel
Applications, in proceedings of IPDPS
2004. Na99 S. Nagar, A. Banerjee, A.
Sivasubramaniam, and C. R. Das. Alternatives to
Coscheduling a Network of Workstations. Journal
of Parallel and Distributed Computing,
59(2)302327, November 1999. Ni02 Dimitrios
S. Nikolopoulos and Constantine D.
Polychronopoulos, Adaptive Scheduling under
Memory Pressure on Multiprogrammed Clusters,
CCGRID 2002 Sa04 Gyu Sang Choi, Jin-Ha Kim,
Deniz Ersoz, Andy B. Yoo, Chita R. Das,
Coscheduling in Clusters Is It a Viable
Alternative?, to appear in SC2004 Se99 S.
Setia, M. S. Squillante, and V. K. Naik. The
Impact of Job Memory Requirements on
Gang-Scheduling Performance. ACM SIGMETRICS
Performance Evaluation Review, 26(4)3039,
1999. So98 P. G. Sobalvarro, S. Pakin, W. E.
Weihl, and A. A. Chien. Dynamic Coscheduling on
Workstation Clusters. In Proceedings of the IPPS
Workshop on Job Scheduling Strategies for
Parallel Processing, pages 231256, March
1998. St04 Peter Strazdins and John Uhlmann,
Local scheduling outperforms gang scheduling on a
beowulf cluster Technical report, Department of
Computer Science, Australian National University,
January 2004, to appear in Cluster 2004. Wi03
Yair Wiseman, Dror G. Feitelson, Paired Gang
Scheduling , IEEE TPDS, June 2003
2626
27Is result for in-core situationKernel dependent
(Linux)?
Kernel 2.4.2 was used in our experiment How time
sharing efficiency evolves with Linux kernel
maturation (from 2.4 to 2.6)?
27