Title: ParCo 2003 Presentation
1Advanced Hybrid MPI/OpenMP Parallelization
Paradigms for Nested Loop Algorithms onto
Clusters of SMPs
Nikolaos Drosinos and Nectarios Koziris
National Technical University of Athens
Computing Systems Laboratory ndros,nkoziris_at_cs
lab.ece.ntua.gr www.cslab.ece.ntua.gr
2Overview
- Introduction
- Pure MPI Model
- Hybrid MPI-OpenMP Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
3Introduction
- Motivation
- SMP clusters
- Hybrid programming models
- Mostly fine-grain MPI-OpenMP paradigms
- Mostly DOALL parallelization
4Introduction
- Contribution
- 3 programming models for the parallelization of
nested loops algorithms - pure MPI
- fine-grain hybrid MPI-OpenMP
- coarse-grain hybrid MPI-OpenMP
- Advanced hyperplane scheduling
- minimize synchronization need
- overlap computation with communication
5Introduction
- Algorithmic Model
- FOR j0 min0 TO max0 DO
-
- FOR jn-1 minn-1 TO maxn-1 DO
- Computation(j0,,jn-1)
- ENDFOR
-
- ENDFOR
- Perfectly nested loops
- Constant flow data dependencies
6Introduction
Target Architecture SMP clusters
7Overview
- Introduction
- Pure MPI Model
- Hybrid MPI-OpenMP Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
8Pure MPI Model
- Tiling transformation groups iterations into
atomic execution units (tiles) - Pipelined execution
- Overlapping computation with communication
- Makes no distinction between inter-node and
intra-node communication
9Pure MPI Model
Example FOR j10 TO 9 DO FOR j20 TO 7 DO
Aj1,j2Aj1-1,j2 Aj1,j2-1
ENDFOR ENDFOR
10Pure MPI Model
CPU1
NODE1
CPU0
4 MPI nodes
CPU1
NODE0
CPU0
11Pure MPI Model
CPU1
NODE1
CPU0
4 MPI nodes
CPU1
NODE0
CPU0
12Pure MPI Model
tile0 nod0 tilen-2 nodn-2 FOR tilen-1 0
TO DO Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod) END FOR
13Overview
- Introduction
- Pure MPI Model
- Hybrid MPI-OpenMP Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
14Hyperplane Scheduling
- Implements coarse-grain parallelism assuming
inter-tile data dependencies - Tiles are organized into data-independent
subsets (groups) - Tiles of the same group can be concurrently
executed by multiple threads - Barrier synchronization between threads
15Hyperplane Scheduling
CPU1
2 MPI nodes
NODE1
CPU0
x 2 OpenMP threads
CPU1
NODE0
CPU0
16Hyperplane Scheduling
CPU1
2 MPI nodes
NODE1
CPU0
x 2 OpenMP threads
CPU1
NODE0
CPU0
17Hyperplane Scheduling
pragma omp parallel group0 nod0
groupn-2 nodn-2 tile0 nod0 m0
th0 tilen-2 nodn-2 mn-2 thn-2
FOR(groupn-1) tilen-1 groupn-1 -
if(0 lt tilen-1 lt )
compute(tile) pragma omp
barrier
18Overview
- Introduction
- Pure MPI Model
- Hybrid MPI-OpenMP Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
19Fine-grain Model
- Incremental parallelization of computationally
intensive parts - Relatively straightforward from pure MPI
- Threads (re)spawned at computation
- Inter-node communication outside of
multi-threaded part - Thread synchronization through implicit barrier
of omp parallel directive
20Fine-grain Model
FOR(groupn-1) Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod)) pragma omp
parallel thread_idomp_get_thread_nu
m() if(valid(tile,thread_id,groupn-1))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod)
21Overview
- Introduction
- Pure MPI Model
- Hybrid MPI-OpenMP Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
22Coarse-grain Model
- SPMD paradigm
- Requires more programming effort
- Threads are only spawned once
- Inter-node communication inside multi-threaded
part (requires MPI_THREAD_MULTIPLE) - Thread synchronization through explicit barrier
(omp barrier directive)
23Coarse-grain Model
pragma omp parallel thread_idomp_get_threa
d_num() FOR(groupn-1) pragma omp
master Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod))
if(valid(tile,thread_id,groupn-1))
Compute(tile) pragma omp
master MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod)
pragma omp barrier
24Summary Fine-grain vs Coarse-grain
Fine-grain Coarse-grain
Threads re-spawning Threads are only spawned once
Inter-node MPI communication outside of multi-threaded region Inter-node MPI communication inside multi-threaded region, assumed by master thread
Intra-node synchronization through implicit barrier (omp parallel) Intra-node synchronization through explicit OpenMP barrier
25Overview
- Introduction
- Pure MPI model
- Hybrid MPI-OpenMP models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
26Experimental Results
- 8-node SMP Linux Cluster (800 MHz PIII, 128 MB
RAM, kernel 2.4.20) - MPICH v.1.2.5 (--with-devicech_p4,
--with-commshared) - Intel C compiler 7.0 (-O3
- -mcpupentiumpro -static)
- FastEthernet interconnection
- ADI micro-kernel benchmark (3D)
27Alternating Direction Implicit (ADI)
- Unitary data dependencies
- 3D Iteration Space (X x Y x Z)
28ADI 4 nodes
29ADI 4 nodes
30ADI X512 Y512 Z8192 4 nodes
31ADI X128 Y512 Z8192 4 nodes
32ADI X512 Y128 Z8192 4 nodes
33ADI 2 nodes
34ADI 2 nodes
35ADI X128 Y512 Z8192 2 nodes
36ADI X256 Y512 Z8192 2 nodes
37ADI X512 Y512 Z8192 2 nodes
38ADI X512 Y256 Z8192 2 nodes
39ADI X512 Y128 Z8192 2 nodes
40ADI X128 Y512 Z8192 2 nodes
Computation
Communication
41ADI X512 Y128 Z8192 2 nodes
Computation
Communication
42Overview
- Introduction
- Pure MPI model
- Hybrid MPI-OpenMP models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
43Conclusions
- Nested loop algorithms with arbitrary data
dependencies can be adapted to the hybrid
parallel programming paradigm - Hybrid models can be competitive to the pure MPI
paradigm - Coarse-grain hybrid model can be more efficient
than fine-grain one, but also more complicated - Programming efficiently in OpenMP not easier
than programming efficiently in MPI
44Future Work
- Application of methodology to real applications
and benchmarks - Work balancing for coarse-grain model
- Performance evaluation on advanced
interconnection networks (SCI, Myrinet) - Generalization as compiler technique
45Questions?
http//www.cslab.ece.ntua.gr/ndros