Compiling shared memory programs for distributed memory machines presentation

About This Presentation

Transcript and Presenter's Notes

Title: Compiling shared memory programs for distributed memory machines

1
Compiling shared memory programs for distributed
memory machines
2
Rational

We are used to writing shared memory programs
Sequential programs are all shared memory
programs in a sense.
There is a good chance that the future
programming paradigm for large parallel machines
will have the nature of shared memory
programming.
The memory in large (scalable) machines are more
likely distributed (Bluegene, roadrunner, etc).
We would like to know how to compile shared
memory programs for execution on distributed
memory machines.
So far, using the shared memory programming
paradigm for large parallel machines has NOT been
very successful (despite the huge amount of
compiler efforts).
See The rise and fall of HPF by Ken Kennedy

An example

For (I1 Ilt100 I) a(I) b(I-1)
First problem global arrays need to be
partitioned.
a
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
b
P0
P1
P2
P3
4

Second problem How to partition the computation
in the loop?
Partition the iteration space.
The owner computes rule.
What about referencing remote array elements?
Generate explicit communication when needed.

For (I1 Ilt100 I) a(I) b(I-1)
A
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
B
P0
P1
P2
P3
5

A naive implementation
Assuming that all processors have the space for
the whole arrays.
For (I0 Ilt100 I)
If (I own b(I-1) but not a(I)) then
send (b(I-1) to the processor that owns a(I)
If (I own a(I) but not b(I-1)) then
Receive b(I-1) from the processor that owns
b(I-1)
If (I own a(I)) then perform a(I) b(I-1)
The assumption is what we dont want.
Too much space requirement.
Replace the whole global array in each node to
partial array in a node and allocate buffer space
for communication.
Too many communications with small messages (each
communication is for one elements).

Issues for compiling shared memory programs for
distributed memory machines
Data partitioning
Computational partitioning
Communication optimization
Efficient Code generation

Date partitioning
Specified by hand HPF (High Performance
Fortran), Fortran D, Craft Fortran, Fortran 90D.
Allowing manual specification of data alignment
and data distribution
Block, cyclic, block-cyclic distributions.
In many cases, such simple distribution does not
allow efficient code to be generated.
Automatically determined by the compiler
In a number of research compiler (FX(CMU),
paradigm(UIUC), )
The compiler looks at the program and determines
the data distribution that would minimize the
communication requirement.
Different loops usually require different
distribution
Fancy distribution is usually useless
Related to another difficult problem
communication optimization.

Computation partitioning
With the owner computes rule, the computation
partitioning only depends on the data
partitioning.
Communication is only needed for read.
Other computation partitioning technique is also
possible.
Communication may also be needed for write.
The basic method to compile a program is the same
regardless of the computation partitioning
methods.

Communication optimization
Message vectorization combining small messages
within a loop to be a large message before the
loop.

For (i0 ilt500 i) a(i) b(i-1)
d(i) c(i-1) e(i) c(i-1)
Send recv b(0..499) Send recv c(0..499) Send
recv c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
For (i0 ilt500 i) send recv b(i-1)
a(i) b(i-1) send recv c(i-1)
d(i) c(i-1) send recv c(i-1) e(i)
c(i-1)
10

Communication optimization
Redundant communication elimination eliminate
communications that have been performed.
Can be done partially.

Send recv b(0..499) Send recv c(0..499) Send
recv c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
Send recv b(0..499) Send recv c(0..499) For
(i1 ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
11

Communication optimization
Communication scheduling combining message with
the same pattern into one larger message

Send recv b(0..499) Send recv c(0..499) For
(i1 ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
Send recv b(0..499) and c(0.499) For (i1
ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
12

Communication optimization
Overlap communication with computation.
Separate send and receive as far as possible.

Send/recv b(0..499)/c(0..499) For (i1 ilt500
i) a(i) b(i-1) d(i) c(i-1)
e(i) c(i-1)
Non-blocking Send b(0..499)/c(0..499) .. recv
b(0..499)/c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
13

Communication optimization
Exploit pipeline communication.
Allow small messages instead of large messages.

For (i0 ilt500 i) if (i10 0)
send recv 10 elements b(i-1) send
recv 10 elements c(i-1) send recv 10
elements c(i-1) a(i) b(i-1)
d(i) c(i-1) e(i) c(i-1)
For (i0 ilt500 i) send recv b(i-1)
send recv c(i-1) send recv c(i-1)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
14

Generate efficient communication code
Communication buffer management
Cannot assume the whole array is there.
Pack and unpack messages
Memory reference may not be contiguous,
communication are usually for continuous blocks
Need to compute each elements and pack the
elements into the buffer
When the data distribution is complicated, the
formula to compute the array elements to be
communicated may be very complicated.
Fancy data distribution methods result in high
overheads.
Considering the features in the underlying
communication system.
If the hardware have two NICs, you may want to do
something different.
In general, this is a complicated process.

15
Jacobi in HPF
16
Generated Jocabi
17
HPF limitations

HPF targets large distributed memory machines.
The performance of HPF code is still not
comparable to that of MPI programs.
Not able to achieve portable performance.
Companies focus on optimize for their customers
applications.
Architecture dependent optimization is hard.
HPF performance tuning (by programmer) is
problematic, users do not have many options to
play with.

18
HPF performance issues

The limited data distribution methods supported
by the language.
Fancy distributions usually do not work, or
cannot be handle effectively by the compiler
Human programmers do fancy things sometimes.
Good and bad about the lack of fine control.
Good for usability
Bad for performance tuning
Insufficient support for irregular problems.
This is much harder to handle, cannot be
efficient with the simple data distribution
schemes.
Limited support for task parallelism.
Even OpenMP is slightly better (parallel
sections).

19
Last words

HPF may not be particular successful by itself.
It has a profound impact on the development of
parallel programming paradigms.
The state of the art compilation techniques for
distributed memory machines mainly came from the
HPF compilation research.
It might be back, in a different form.

Write a Comment

User Comments (0)

About PowerShow.com

Compiling shared memory programs for distributed memory machines PowerPoint PPT Presentation