Title: IPDPS 2004 Presentation
1Performance Comparison of Pure MPI vs Hybrid
MPI-OpenMP Parallelization Models on SMP Clusters
Nikolaos Drosinos and Nectarios Koziris
National Technical University of Athens
Computing Systems
Laboratory ndros,nkoziris_at_cslab.ece.ntua.gr ww
w.cslab.ece.ntua.gr
2Overview
- Introduction
- Pure Message-passing Model
- Hybrid Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
3Motivation
- Active research interest in
- SMP clusters
- Hybrid programming models
- However
- Mostly fine-grain hybrid paradigms (masteronly
model) - Mostly DOALL multi-threaded parallelization
4Contribution
- Comparison of 3 programming models for the
parallelization of tiled loops algorithms - pure message-passing
- fine-grain hybrid
- coarse-grain hybrid
- Advanced hyperplane scheduling
- minimize synchronization need
- overlap computation with communication
- preserves data dependencies
5Algorithmic Model
Tiled nested loops with constant flow data
dependencies FORACROSS tile0 DO
FORACROSS tilen-2 DO FOR tilen-1 DO
Receive(tile) Compute(tile)
Send(tile) END FOR END FORACROSS
END FORACROSS
6Target Architecture
SMP clusters
7Overview
- Introduction
- Pure Message-passing Model
- Hybrid Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
8Pure Message-passing Model
tile0 pr0 tilen-2 prn-2 FOR tilen-1 0
TO DO Pack(snd_buf, tilen-1 1,
pr) MPI_Isend(snd_buf, dest(pr))
MPI_Irecv(recv_buf, src(pr)) Compute(tile)
MPI_Waitall Unpack(recv_buf, tilen-1 1,
pr) END FOR
9Pure Message-passing Model
10Overview
- Introduction
- Pure Message-passing Model
- Hybrid Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
11Hyperplane Scheduling
- Implements coarse-grain parallelism assuming
inter-tile data dependencies - Tiles are organized into data-independent
subsets (groups) - Tiles of the same group can be concurrently
executed by multiple threads - Barrier synchronization between threads
12Hyperplane Scheduling
tile (mpi_rank,omp_tid,tile) group
13Hyperplane Scheduling
pragma omp parallel group0 pr0
groupn-2 prn-2 tile0 pr0 m0 th0
tilen-2 prn-2 mn-2 thn-2
FOR(groupn-1) tilen-1 groupn-1 -
if(0 lt tilen-1 lt )
compute(tile) pragma omp barrier
14Overview
- Introduction
- Pure Message-passing Model
- Hybrid Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
15Fine-grain Model
- Incremental parallelization of computationally
intensive parts - Pure MPI hyperplane scheduling
- Inter-node communication outside of
multi-threaded part (MPI_THREAD_MASTERONLY) - Thread synchronization through implicit barrier
of omp parallel directive
16Fine-grain Model
FOR(groupn-1) Pack(snd_buf, tilen-1 1,
pr) MPI_Isend(snd_buf, dest(pr))
MPI_Irecv(recv_buf, src(pr)) pragma omp
parallel thread_idomp_get_thread_nu
m() if(valid(tile,thread_id,groupn-1))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, pr)
17Overview
- Introduction
- Pure Message-passing Model
- Hybrid Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
18Coarse-grain Model
- Threads are only initialized once
- SPMD paradigm (requires more programming effort)
- Inter-node communication inside multi-threaded
part (requires MPI_THREAD_FUNNELED) - Thread synchronization through explicit barrier
(omp barrier directive)
19Coarse-grain Model
pragma omp parallel thread_idomp_get_threa
d_num() FOR(groupn-1) pragma omp
master Pack(snd_buf, tilen-1 1,
pr) MPI_Isend(snd_buf, dest(pr))
MPI_Irecv(recv_buf, src(pr))
if(valid(tile,thread_id,groupn-1))
Compute(tile) pragma omp master
MPI_Waitall
Unpack(recv_buf, tilen-1 1, pr)
pragma omp barrier
20Overview
- Introduction
- Pure Message-passing Model
- Hybrid Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
21Experimental Results
- 8-node SMP Linux Cluster (800 MHz PIII, 128 MB
RAM, kernel 2.4.20) - MPICH v.1.2.5 (--with-devicech_p4,
--with-commshared) - Intel C compiler 7.0 (-O3
- -mcpupentiumpro -static)
- FastEthernet interconnection
- ADI micro-kernel benchmark (3D)
22Alternating Direction Implicit (ADI)
- Stencil computation used for solving partial
differential equations - Unitary data dependencies
- 3D iteration space (X x Y x Z)
23ADI 2 dual SMP nodes
24ADI X128 Y512 Z8192 2 nodes
25ADI X256 Y512 Z8192 2 nodes
26ADI X512 Y512 Z8192 2 nodes
27ADI X512 Y256 Z8192 2 nodes
28ADI X512 Y128 Z8192 2 nodes
29ADI X128 Y512 Z8192 2 nodes
Computation
Communication
30ADI X512 Y128 Z8192 2 nodes
Computation
Communication
31Overview
- Introduction
- Pure Message-passing Model
- Hybrid Models
- Hyperplane Scheduling
- Fine-grain Model
- Coarse-grain Model
- Experimental Results
- Conclusions Future Work
32Conclusions
- Tiled loop algorithms with arbitrary data
dependencies can be adapted to the hybrid
parallel programming paradigm - Hybrid models can be competitive to the pure
message-passing paradigm - Coarse-grain hybrid model can be more efficient
than fine-grain one, but also more complicated - Programming efficiently in OpenMP not easier
than programming efficiently in MPI
33Future Work
- Application of methodology to real applications
and standard benchmarks - Work balancing for coarse-grain model
- Investigation of alternative topologies,
irregular communication patterns - Performance evaluation on advanced
interconnection networks (SCI, Myrinet)
34Thank You!
Questions?