ParCo 2003 Presentation - PowerPoint PPT Presentation

About This Presentation

Title:

ParCo 2003 Presentation

Description:

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop ... Unpack(recv_buf, tilen-1 1, nod); #pragma omp barrier. 2/10/2003. EuroPVM/MPI 2003 ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 46

Provided by: nikolaos8

Category:

more less

Transcript and Presenter's Notes

Title: ParCo 2003 Presentation

1
Advanced Hybrid MPI/OpenMP Parallelization
Paradigms for Nested Loop Algorithms onto
Clusters of SMPs
Nikolaos Drosinos and Nectarios Koziris
National Technical University of Athens
Computing Systems Laboratory ndros,nkoziris_at_cs
lab.ece.ntua.gr www.cslab.ece.ntua.gr
2
Overview

Introduction
Pure MPI Model
Hybrid MPI-OpenMP Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

3
Introduction

Motivation
SMP clusters
Hybrid programming models
Mostly fine-grain MPI-OpenMP paradigms
Mostly DOALL parallelization

4
Introduction

Contribution
3 programming models for the parallelization of
nested loops algorithms
pure MPI
fine-grain hybrid MPI-OpenMP
coarse-grain hybrid MPI-OpenMP
Advanced hyperplane scheduling
minimize synchronization need
overlap computation with communication

5
Introduction

Algorithmic Model
FOR j0 min0 TO max0 DO
FOR jn-1 minn-1 TO maxn-1 DO
Computation(j0,,jn-1)
ENDFOR
ENDFOR
Perfectly nested loops
Constant flow data dependencies

6
Introduction
Target Architecture SMP clusters
7
Overview

Introduction
Pure MPI Model
Hybrid MPI-OpenMP Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

8
Pure MPI Model

Tiling transformation groups iterations into
atomic execution units (tiles)
Pipelined execution
Overlapping computation with communication
Makes no distinction between inter-node and
intra-node communication

9
Pure MPI Model
Example FOR j10 TO 9 DO FOR j20 TO 7 DO
Aj1,j2Aj1-1,j2 Aj1,j2-1
ENDFOR ENDFOR
10
Pure MPI Model
CPU1
NODE1
CPU0
4 MPI nodes
CPU1
NODE0
CPU0
11
Pure MPI Model
CPU1
NODE1
CPU0
4 MPI nodes
CPU1
NODE0
CPU0
12
Pure MPI Model
tile0 nod0 tilen-2 nodn-2 FOR tilen-1 0
TO DO Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod) END FOR
13
Overview

Introduction
Pure MPI Model
Hybrid MPI-OpenMP Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

14
Hyperplane Scheduling

Implements coarse-grain parallelism assuming
inter-tile data dependencies
Tiles are organized into data-independent
subsets (groups)
Tiles of the same group can be concurrently
executed by multiple threads
Barrier synchronization between threads

15
Hyperplane Scheduling
CPU1
2 MPI nodes
NODE1
CPU0
x 2 OpenMP threads
CPU1
NODE0
CPU0
16
Hyperplane Scheduling
CPU1
2 MPI nodes
NODE1
CPU0
x 2 OpenMP threads
CPU1
NODE0
CPU0
17
Hyperplane Scheduling
pragma omp parallel group0 nod0
groupn-2 nodn-2 tile0 nod0 m0
th0 tilen-2 nodn-2 mn-2 thn-2
FOR(groupn-1) tilen-1 groupn-1 -
if(0 lt tilen-1 lt )
compute(tile) pragma omp
barrier
18
Overview

Introduction
Pure MPI Model
Hybrid MPI-OpenMP Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

19
Fine-grain Model

Incremental parallelization of computationally
intensive parts
Relatively straightforward from pure MPI
Threads (re)spawned at computation
Inter-node communication outside of
multi-threaded part
Thread synchronization through implicit barrier
of omp parallel directive

20
Fine-grain Model
FOR(groupn-1) Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod)) pragma omp
parallel thread_idomp_get_thread_nu
m() if(valid(tile,thread_id,groupn-1))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod)
21
Overview

Introduction
Pure MPI Model
Hybrid MPI-OpenMP Models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

22
Coarse-grain Model

SPMD paradigm
Requires more programming effort
Threads are only spawned once
Inter-node communication inside multi-threaded
part (requires MPI_THREAD_MULTIPLE)
Thread synchronization through explicit barrier
(omp barrier directive)

23
Coarse-grain Model
pragma omp parallel thread_idomp_get_threa
d_num() FOR(groupn-1) pragma omp
master Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod))
if(valid(tile,thread_id,groupn-1))
Compute(tile) pragma omp
master MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod)
pragma omp barrier
24
Summary Fine-grain vs Coarse-grain
Fine-grain Coarse-grain
Threads re-spawning Threads are only spawned once
Inter-node MPI communication outside of multi-threaded region Inter-node MPI communication inside multi-threaded region, assumed by master thread
Intra-node synchronization through implicit barrier (omp parallel) Intra-node synchronization through explicit OpenMP barrier
25
Overview

Introduction
Pure MPI model
Hybrid MPI-OpenMP models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

26
Experimental Results

8-node SMP Linux Cluster (800 MHz PIII, 128 MB
RAM, kernel 2.4.20)
MPICH v.1.2.5 (--with-devicech_p4,
--with-commshared)
Intel C compiler 7.0 (-O3
-mcpupentiumpro -static)
FastEthernet interconnection
ADI micro-kernel benchmark (3D)

27
Alternating Direction Implicit (ADI)

Unitary data dependencies
3D Iteration Space (X x Y x Z)

28
ADI 4 nodes
29
ADI 4 nodes

X lt Y

X gt Y

30
ADI X512 Y512 Z8192 4 nodes
31
ADI X128 Y512 Z8192 4 nodes
32
ADI X512 Y128 Z8192 4 nodes
33
ADI 2 nodes
34
ADI 2 nodes

X lt Y

X gt Y

35
ADI X128 Y512 Z8192 2 nodes
36
ADI X256 Y512 Z8192 2 nodes
37
ADI X512 Y512 Z8192 2 nodes
38
ADI X512 Y256 Z8192 2 nodes
39
ADI X512 Y128 Z8192 2 nodes
40
ADI X128 Y512 Z8192 2 nodes
Computation
Communication
41
ADI X512 Y128 Z8192 2 nodes
Computation
Communication
42
Overview

Introduction
Pure MPI model
Hybrid MPI-OpenMP models
Hyperplane Scheduling
Fine-grain Model
Coarse-grain Model
Experimental Results
Conclusions Future Work

43
Conclusions

Nested loop algorithms with arbitrary data
dependencies can be adapted to the hybrid
parallel programming paradigm
Hybrid models can be competitive to the pure MPI
paradigm
Coarse-grain hybrid model can be more efficient
than fine-grain one, but also more complicated
Programming efficiently in OpenMP not easier
than programming efficiently in MPI

44
Future Work