Title: Compiling shared memory programs for distributed memory machines
1Compiling shared memory programs for distributed
memory machines
2Rational
- We are used to writing shared memory programs
- Sequential programs are all shared memory
programs in a sense. - There is a good chance that the future
programming paradigm for large parallel machines
will have the nature of shared memory
programming. - The memory in large (scalable) machines are more
likely distributed (Bluegene, roadrunner, etc). - We would like to know how to compile shared
memory programs for execution on distributed
memory machines. - So far, using the shared memory programming
paradigm for large parallel machines has NOT been
very successful (despite the huge amount of
compiler efforts). - See The rise and fall of HPF by Ken Kennedy
3For (I1 Ilt100 I) a(I) b(I-1)
First problem global arrays need to be
partitioned.
a
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
b
P0
P1
P2
P3
4- Second problem How to partition the computation
in the loop? - Partition the iteration space.
- The owner computes rule.
- What about referencing remote array elements?
- Generate explicit communication when needed.
For (I1 Ilt100 I) a(I) b(I-1)
A
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
B
P0
P1
P2
P3
5- A naive implementation
- Assuming that all processors have the space for
the whole arrays. - For (I0 Ilt100 I)
- If (I own b(I-1) but not a(I)) then
- send (b(I-1) to the processor that owns a(I)
- If (I own a(I) but not b(I-1)) then
- Receive b(I-1) from the processor that owns
b(I-1) - If (I own a(I)) then perform a(I) b(I-1)
- The assumption is what we dont want.
- Too much space requirement.
- Replace the whole global array in each node to
partial array in a node and allocate buffer space
for communication. - Too many communications with small messages (each
communication is for one elements).
6- Issues for compiling shared memory programs for
distributed memory machines - Data partitioning
- Computational partitioning
- Communication optimization
- Efficient Code generation
7- Date partitioning
- Specified by hand HPF (High Performance
Fortran), Fortran D, Craft Fortran, Fortran 90D. - Allowing manual specification of data alignment
and data distribution - Block, cyclic, block-cyclic distributions.
- In many cases, such simple distribution does not
allow efficient code to be generated. - Automatically determined by the compiler
- In a number of research compiler (FX(CMU),
paradigm(UIUC), ) - The compiler looks at the program and determines
the data distribution that would minimize the
communication requirement. - Different loops usually require different
distribution - Fancy distribution is usually useless
- Related to another difficult problem
communication optimization.
8- Computation partitioning
- With the owner computes rule, the computation
partitioning only depends on the data
partitioning. - Communication is only needed for read.
- Other computation partitioning technique is also
possible. - Communication may also be needed for write.
- The basic method to compile a program is the same
regardless of the computation partitioning
methods.
9- Communication optimization
- Message vectorization combining small messages
within a loop to be a large message before the
loop.
For (i0 ilt500 i) a(i) b(i-1)
d(i) c(i-1) e(i) c(i-1)
Send recv b(0..499) Send recv c(0..499) Send
recv c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
For (i0 ilt500 i) send recv b(i-1)
a(i) b(i-1) send recv c(i-1)
d(i) c(i-1) send recv c(i-1) e(i)
c(i-1)
10- Communication optimization
- Redundant communication elimination eliminate
communications that have been performed. - Can be done partially.
Send recv b(0..499) Send recv c(0..499) Send
recv c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
Send recv b(0..499) Send recv c(0..499) For
(i1 ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
11- Communication optimization
- Communication scheduling combining message with
the same pattern into one larger message
Send recv b(0..499) Send recv c(0..499) For
(i1 ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
Send recv b(0..499) and c(0.499) For (i1
ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
12- Communication optimization
- Overlap communication with computation.
- Separate send and receive as far as possible.
Send/recv b(0..499)/c(0..499) For (i1 ilt500
i) a(i) b(i-1) d(i) c(i-1)
e(i) c(i-1)
Non-blocking Send b(0..499)/c(0..499) .. recv
b(0..499)/c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
13- Communication optimization
- Exploit pipeline communication.
- Allow small messages instead of large messages.
For (i0 ilt500 i) if (i10 0)
send recv 10 elements b(i-1) send
recv 10 elements c(i-1) send recv 10
elements c(i-1) a(i) b(i-1)
d(i) c(i-1) e(i) c(i-1)
For (i0 ilt500 i) send recv b(i-1)
send recv c(i-1) send recv c(i-1)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
14- Generate efficient communication code
- Communication buffer management
- Cannot assume the whole array is there.
- Pack and unpack messages
- Memory reference may not be contiguous,
communication are usually for continuous blocks - Need to compute each elements and pack the
elements into the buffer - When the data distribution is complicated, the
formula to compute the array elements to be
communicated may be very complicated. - Fancy data distribution methods result in high
overheads. - Considering the features in the underlying
communication system. - If the hardware have two NICs, you may want to do
something different. - In general, this is a complicated process.
15Jacobi in HPF
16Generated Jocabi
17HPF limitations
- HPF targets large distributed memory machines.
- The performance of HPF code is still not
comparable to that of MPI programs. - Not able to achieve portable performance.
- Companies focus on optimize for their customers
applications. - Architecture dependent optimization is hard.
- HPF performance tuning (by programmer) is
problematic, users do not have many options to
play with.
18HPF performance issues
- The limited data distribution methods supported
by the language. - Fancy distributions usually do not work, or
cannot be handle effectively by the compiler - Human programmers do fancy things sometimes.
- Good and bad about the lack of fine control.
- Good for usability
- Bad for performance tuning
- Insufficient support for irregular problems.
- This is much harder to handle, cannot be
efficient with the simple data distribution
schemes. - Limited support for task parallelism.
- Even OpenMP is slightly better (parallel
sections).
19Last words
- HPF may not be particular successful by itself.
- It has a profound impact on the development of
parallel programming paradigms. - The state of the art compilation techniques for
distributed memory machines mainly came from the
HPF compilation research. - It might be back, in a different form.