IPDPS 2004 Presentation - PowerPoint PPT Presentation

About This Presentation

Title:

IPDPS 2004 Presentation

Description:

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on ... Unpack(recv_buf, tilen-1 1, pr); END FOR. April 27, 2004. IPDPS 2004. 9 ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 35

Provided by: nikolaos8

Category:

more less

Transcript and Presenter's Notes

Title: IPDPS 2004 Presentation

1
Performance Comparison of Pure MPI vs Hybrid
MPI-OpenMP Parallelization Models on SMP Clusters
Nikolaos Drosinos and Nectarios Koziris
National Technical University of Athens
Computing Systems
Laboratory ndros,nkoziris_at_cslab.ece.ntua.gr ww
w.cslab.ece.ntua.gr
2
Overview

Introduction
Pure Message-passing Model
Hybrid Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

3
Motivation

Active research interest in
SMP clusters
Hybrid programming models
However
Mostly fine-grain hybrid paradigms (masteronly
model)
Mostly DOALL multi-threaded parallelization

4
Contribution

Comparison of 3 programming models for the
parallelization of tiled loops algorithms
pure message-passing
fine-grain hybrid
coarse-grain hybrid
Advanced hyperplane scheduling
minimize synchronization need
overlap computation with communication
preserves data dependencies

5
Algorithmic Model
Tiled nested loops with constant flow data
dependencies FORACROSS tile0 DO
FORACROSS tilen-2 DO FOR tilen-1 DO
Receive(tile) Compute(tile)
Send(tile) END FOR END FORACROSS
END FORACROSS
6
Target Architecture
SMP clusters
7
Overview

Introduction
Pure Message-passing Model
Hybrid Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

8
Pure Message-passing Model
tile0 pr0 tilen-2 prn-2 FOR tilen-1 0
TO DO Pack(snd_buf, tilen-1 1,
pr) MPI_Isend(snd_buf, dest(pr))
MPI_Irecv(recv_buf, src(pr)) Compute(tile)
MPI_Waitall Unpack(recv_buf, tilen-1 1,
pr) END FOR
9
Pure Message-passing Model
10
Overview

Introduction
Pure Message-passing Model
Hybrid Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

11
Hyperplane Scheduling

Implements coarse-grain parallelism assuming
inter-tile data dependencies
Tiles are organized into data-independent
subsets (groups)
Tiles of the same group can be concurrently
executed by multiple threads
Barrier synchronization between threads

12
Hyperplane Scheduling
tile (mpi_rank,omp_tid,tile) group
13
Hyperplane Scheduling
pragma omp parallel group0 pr0
groupn-2 prn-2 tile0 pr0 m0 th0
tilen-2 prn-2 mn-2 thn-2
FOR(groupn-1) tilen-1 groupn-1 -
if(0 lt tilen-1 lt )
compute(tile) pragma omp barrier

14
Overview

Introduction
Pure Message-passing Model
Hybrid Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

15
Fine-grain Model

Incremental parallelization of computationally
intensive parts
Pure MPI hyperplane scheduling
Inter-node communication outside of
multi-threaded part (MPI_THREAD_MASTERONLY)
Thread synchronization through implicit barrier
of omp parallel directive

16
Fine-grain Model
FOR(groupn-1) Pack(snd_buf, tilen-1 1,
pr) MPI_Isend(snd_buf, dest(pr))
MPI_Irecv(recv_buf, src(pr)) pragma omp
parallel thread_idomp_get_thread_nu
m() if(valid(tile,thread_id,groupn-1))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, pr)
17
Overview

Introduction
Pure Message-passing Model
Hybrid Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

18
Coarse-grain Model

Threads are only initialized once
SPMD paradigm (requires more programming effort)
Inter-node communication inside multi-threaded
part (requires MPI_THREAD_FUNNELED)
Thread synchronization through explicit barrier
(omp barrier directive)

19
Coarse-grain Model
pragma omp parallel thread_idomp_get_threa
d_num() FOR(groupn-1) pragma omp
master Pack(snd_buf, tilen-1 1,
pr) MPI_Isend(snd_buf, dest(pr))
MPI_Irecv(recv_buf, src(pr))
if(valid(tile,thread_id,groupn-1))
Compute(tile) pragma omp master
MPI_Waitall
Unpack(recv_buf, tilen-1 1, pr)
pragma omp barrier
20
Overview

Introduction
Pure Message-passing Model
Hybrid Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

21
Experimental Results

8-node SMP Linux Cluster (800 MHz PIII, 128 MB
RAM, kernel 2.4.20)
MPICH v.1.2.5 (--with-devicech_p4,
--with-commshared)
Intel C compiler 7.0 (-O3
-mcpupentiumpro -static)
FastEthernet interconnection
ADI micro-kernel benchmark (3D)

22
Alternating Direction Implicit (ADI)

Stencil computation used for solving partial
differential equations
Unitary data dependencies
3D iteration space (X x Y x Z)

23
ADI 2 dual SMP nodes
24
ADI X128 Y512 Z8192 2 nodes
25
ADI X256 Y512 Z8192 2 nodes
26
ADI X512 Y512 Z8192 2 nodes
27
ADI X512 Y256 Z8192 2 nodes
28
ADI X512 Y128 Z8192 2 nodes
29
ADI X128 Y512 Z8192 2 nodes
Computation
Communication
30
ADI X512 Y128 Z8192 2 nodes
Computation
Communication
31
Overview

Introduction
Pure Message-passing Model
Hybrid Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

32
Conclusions

Tiled loop algorithms with arbitrary data
dependencies can be adapted to the hybrid
parallel programming paradigm
Hybrid models can be competitive to the pure
message-passing paradigm
Coarse-grain hybrid model can be more efficient
than fine-grain one, but also more complicated
Programming efficiently in OpenMP not easier
than programming efficiently in MPI

33
Future Work