Compiling shared memory programs for distributed memory machines PowerPoint PPT Presentation

presentation player overlay
1 / 19
About This Presentation
Transcript and Presenter's Notes

Title: Compiling shared memory programs for distributed memory machines


1
Compiling shared memory programs for distributed
memory machines
2
Rational
  • We are used to writing shared memory programs
  • Sequential programs are all shared memory
    programs in a sense.
  • There is a good chance that the future
    programming paradigm for large parallel machines
    will have the nature of shared memory
    programming.
  • The memory in large (scalable) machines are more
    likely distributed (Bluegene, roadrunner, etc).
  • We would like to know how to compile shared
    memory programs for execution on distributed
    memory machines.
  • So far, using the shared memory programming
    paradigm for large parallel machines has NOT been
    very successful (despite the huge amount of
    compiler efforts).
  • See The rise and fall of HPF by Ken Kennedy

3
  • An example

For (I1 Ilt100 I) a(I) b(I-1)
First problem global arrays need to be
partitioned.
a
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
b
P0
P1
P2
P3
4
  • Second problem How to partition the computation
    in the loop?
  • Partition the iteration space.
  • The owner computes rule.
  • What about referencing remote array elements?
  • Generate explicit communication when needed.

For (I1 Ilt100 I) a(I) b(I-1)
A
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
0 1 .24 25 26 . 49 50 51 .74 75 76 . 99
B
P0
P1
P2
P3
5
  • A naive implementation
  • Assuming that all processors have the space for
    the whole arrays.
  • For (I0 Ilt100 I)
  • If (I own b(I-1) but not a(I)) then
  • send (b(I-1) to the processor that owns a(I)
  • If (I own a(I) but not b(I-1)) then
  • Receive b(I-1) from the processor that owns
    b(I-1)
  • If (I own a(I)) then perform a(I) b(I-1)
  • The assumption is what we dont want.
  • Too much space requirement.
  • Replace the whole global array in each node to
    partial array in a node and allocate buffer space
    for communication.
  • Too many communications with small messages (each
    communication is for one elements).

6
  • Issues for compiling shared memory programs for
    distributed memory machines
  • Data partitioning
  • Computational partitioning
  • Communication optimization
  • Efficient Code generation

7
  • Date partitioning
  • Specified by hand HPF (High Performance
    Fortran), Fortran D, Craft Fortran, Fortran 90D.
  • Allowing manual specification of data alignment
    and data distribution
  • Block, cyclic, block-cyclic distributions.
  • In many cases, such simple distribution does not
    allow efficient code to be generated.
  • Automatically determined by the compiler
  • In a number of research compiler (FX(CMU),
    paradigm(UIUC), )
  • The compiler looks at the program and determines
    the data distribution that would minimize the
    communication requirement.
  • Different loops usually require different
    distribution
  • Fancy distribution is usually useless
  • Related to another difficult problem
    communication optimization.

8
  • Computation partitioning
  • With the owner computes rule, the computation
    partitioning only depends on the data
    partitioning.
  • Communication is only needed for read.
  • Other computation partitioning technique is also
    possible.
  • Communication may also be needed for write.
  • The basic method to compile a program is the same
    regardless of the computation partitioning
    methods.

9
  • Communication optimization
  • Message vectorization combining small messages
    within a loop to be a large message before the
    loop.

For (i0 ilt500 i) a(i) b(i-1)
d(i) c(i-1) e(i) c(i-1)
Send recv b(0..499) Send recv c(0..499) Send
recv c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
For (i0 ilt500 i) send recv b(i-1)
a(i) b(i-1) send recv c(i-1)
d(i) c(i-1) send recv c(i-1) e(i)
c(i-1)
10
  • Communication optimization
  • Redundant communication elimination eliminate
    communications that have been performed.
  • Can be done partially.

Send recv b(0..499) Send recv c(0..499) Send
recv c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
Send recv b(0..499) Send recv c(0..499) For
(i1 ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
11
  • Communication optimization
  • Communication scheduling combining message with
    the same pattern into one larger message

Send recv b(0..499) Send recv c(0..499) For
(i1 ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
Send recv b(0..499) and c(0.499) For (i1
ilt500 i) a(i) b(i-1) d(i)
c(i-1) e(i) c(i-1)
12
  • Communication optimization
  • Overlap communication with computation.
  • Separate send and receive as far as possible.

Send/recv b(0..499)/c(0..499) For (i1 ilt500
i) a(i) b(i-1) d(i) c(i-1)
e(i) c(i-1)
Non-blocking Send b(0..499)/c(0..499) .. recv
b(0..499)/c(0..499) For (i1 ilt500 i)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
13
  • Communication optimization
  • Exploit pipeline communication.
  • Allow small messages instead of large messages.

For (i0 ilt500 i) if (i10 0)
send recv 10 elements b(i-1) send
recv 10 elements c(i-1) send recv 10
elements c(i-1) a(i) b(i-1)
d(i) c(i-1) e(i) c(i-1)
For (i0 ilt500 i) send recv b(i-1)
send recv c(i-1) send recv c(i-1)
a(i) b(i-1) d(i) c(i-1) e(i)
c(i-1)
14
  • Generate efficient communication code
  • Communication buffer management
  • Cannot assume the whole array is there.
  • Pack and unpack messages
  • Memory reference may not be contiguous,
    communication are usually for continuous blocks
  • Need to compute each elements and pack the
    elements into the buffer
  • When the data distribution is complicated, the
    formula to compute the array elements to be
    communicated may be very complicated.
  • Fancy data distribution methods result in high
    overheads.
  • Considering the features in the underlying
    communication system.
  • If the hardware have two NICs, you may want to do
    something different.
  • In general, this is a complicated process.

15
Jacobi in HPF
16
Generated Jocabi
17
HPF limitations
  • HPF targets large distributed memory machines.
  • The performance of HPF code is still not
    comparable to that of MPI programs.
  • Not able to achieve portable performance.
  • Companies focus on optimize for their customers
    applications.
  • Architecture dependent optimization is hard.
  • HPF performance tuning (by programmer) is
    problematic, users do not have many options to
    play with.

18
HPF performance issues
  • The limited data distribution methods supported
    by the language.
  • Fancy distributions usually do not work, or
    cannot be handle effectively by the compiler
  • Human programmers do fancy things sometimes.
  • Good and bad about the lack of fine control.
  • Good for usability
  • Bad for performance tuning
  • Insufficient support for irregular problems.
  • This is much harder to handle, cannot be
    efficient with the simple data distribution
    schemes.
  • Limited support for task parallelism.
  • Even OpenMP is slightly better (parallel
    sections).

19
Last words
  • HPF may not be particular successful by itself.
  • It has a profound impact on the development of
    parallel programming paradigms.
  • The state of the art compilation techniques for
    distributed memory machines mainly came from the
    HPF compilation research.
  • It might be back, in a different form.
Write a Comment
User Comments (0)
About PowerShow.com