Matrix Transpose Results with Hybrid OpenMP / MPI

About This Presentation

Title:

Matrix Transpose Results with Hybrid OpenMP / MPI

Description:

Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft f r wissenschaftliche Datenverarbeitung G ttingen, Germany ( GWDG ) SCICOMP 2000, SDSC, La Jolla – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 27

Provided by: OHA7

Learn more at: http://www.spscicomp.org

Category:

more less

Transcript and Presenter's Notes

Title: Matrix Transpose Results with Hybrid OpenMP / MPI

1
Matrix Transpose Resultswith Hybrid OpenMP / MPI

O. Haan
Gesellschaft für wissenschaftliche
DatenverarbeitungGöttingen, Germany( GWDG )

SCICOMP 2000, SDSC, La Jolla
2
Overview

Hybrid Programming Model
Distributed Matrix Transpose
Performance Measurements
Summary of Results

3
Architecture of Scalable Parallel Computers

Two level hierarchy
cluster of SMP nodes distributed memory high
speed interconnect
SMP nodes with multiple processors shared
memory bus or switch connected

4
Programming Models

message passing over all processors MPI
implementation for shared memory multiple access
to switch adapters SP 4-way Winterhawk2
8-way Nighthawk -
shared memory over all processors virtual global
address space SP -
hybrid message passing - shared memory message
passing between nodes shared memory within
nodes SP

5
Hybrid Programming Model
SPMD programwith MPI tasksOpenMP
threadswithin each taskcommunicationbetween
MPI tasks
6
Example of Hybrid Program

program hybrid_example
include mpif.h
com MPI_COMM_WORLD
call MPI_INIT(ierr)
call MPI_COMM_SIZE(com,nk,ierr)
call MPI_COMM_RANK(com,my_task,ierr)
kp OMP_GET_NUM_PROCS()
!OMP PARALLEL PRIVATE(my_thread)
my_thread OMP_GET_THREAD_NUM()
call work(my_thread,kp,my_task,nk,thread_res
)
!OMP END PARALLEL
do i 0 , kp-1
node_res node_res thread_res(i)
end do
call MPI_REDUCE(node_res,glob_res,1,
MPI_REAL,MPI_SUM,0,com,ierr)
call MPI_FINALIZE(ierr)
stop
end

7
Hybrid Programming vs.Pure Message Passing

works on all SP configuration
coarser internode communication granularity
faster intranode communication
-
larger programming effort
additional synchronization steps
reduced reuse of cached data

the net score depends on the problem
8
Distributed Matrix Transpose
9
3-step Transpose
n1 x n2 matrix A( i1 , i2 ) --gt n2 x n1
matrix B( i2 , i1 ) decompose n1, n2 in local
and global parts n1 n1l np n2 n2l
np write matrices A, B as 4-dim arrays A( i1l
, i1g , i2l i2g ) , B( i2l , i2g , i1l i1g
) step 1 local reorder A( i1l , i1g , i2l
i2g ) -gt a1( i1l , i2l , i1g i2g ) step 2
global reorder a1( i1l , i2l , i1g i2g ) -gt
a2( i1l , i2l , i2g i1g ) step 3 local
transpose a2( i1l , i2l , i2g i1g ) -gt B( i2l
, i2g , i1l i1g )
10
Local Steps Copy with Reorder

data in memoryspeed limited by performance of
bus and memory subsystemsWinterhawk2 all
processors share the same bus bandwidth
1.6 GB/s
data in cachespeed limited by processor
performanceWinterhawk2 one load plus one
store per cyclebandwidth 8 MB / (1/375) s
3 GB / s

11
Copy Data in Memory
12
Copy Prefetch
13
Copy Data in Cache
14
Global Reorder
a1( , , i1g i2g ) -gt a2( , , i2g i1g
) global reorder on np processors in np steps
p0 p1
p2
step0 step1 step2
15
Performance Modelling

Hardware model nk nodes with kp procs each
np nk kp is total procs count
Switch model nk concurrent links between nodes
latency tlat , bandwidth c
execution model for Hybrid reorder on nk nodes
nk steps with n1n2 / nk2 data per node
execution model for MPI reorder on np
processors
np steps with n1n2 / np2 data per
node switch links shared between kp procs

16
Performance Modelling

Hybrid timing model

MPI timing model
17
Timing of Global Reorder (internode part)
18
Timing of Global Reorder (internode part)
19
Timing of Global Reorder
20
Timing of Transpose
21
Scaling of Transpose
22
Timing of Transpose Steps
23
Summary of Results Hardware

Memory access in Winterhawk2 is not
adaquatecopy rate of 400 MB/s 50
Mwords/s peak CPU rate of 6000 Mflops/sa
factor of 100 between computational speed and
memory speed
Sharing of switch link by 4 processors degrades
communication speedbandwidth smaller by more
than a factor of 4 ( factor of 4 expected
)latency larger by nearly a factor of 4 (
factor of 1 expected )

24
Summary of Results Hybrid vs. MPI

hybrid OpenMP / MPI programming is profitable
for distributed matrix tranpose 1000 x 1000
matrix on 16 nodes 2.3 times faster10000 x
10000 matrix on 16 nodes 1.1 times faster
Competing influences MPI programming enhances
use of cached dataHybrid programming has lower
communication latency and coarser communication
granularity

25
Summary of Results Use of Transpose in FFT

2-dim complex array of size

Execution time on nk nodes
where r computational speed per node c
transpose speed per node
effective execution speed per node
26
Summary of Results Use of Transpose in FFT-
Example SP
r 4 200 Mflop/s 800 Mflop/sc depends on
n, nk and programming model nk 16 n
106 109 hybrid c
5.6 7.8 Mword/sMPI c
2.5 7.0 Mword/s effective execution speed
per node hybrid 208 338 Mflop/s MPI
108 317 Mflop/s

Write a Comment

User Comments (0)