PLB: A Programming Language for the Blues - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

PLB: A Programming Language for the Blues

Description:

Example 2: The SUMMA Algorithm (1 of 6) Use now the outer-product method (n2-parallelism) ... Example 2: The SUMMA Algorithm (6 of 6) Example 4: Jacobi ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 50
Provided by: david905
Category:

less

Transcript and Presenter's Notes

Title: PLB: A Programming Language for the Blues


1
PL/B A Programming Language for the Blues
  • George Almasi, Luiz DeRose, Jose Moreira, and
    David Padua

2
Objective of this Project
  • To develop a programming system for the Blue Gene
    series (and distributed-memory machines in
    general) that is
  • Convenient to use for program development and
    maintenance. Resulting parallel programs should
    be very readable easy to modify.
  • Not too difficult to implement so that the system
    could be completed in a reasonable time.
  • Main focus is numerical computing.

3
Most Popular Parallel Programming Models Today
  • SPMD
  • Library-based
  • MPI
  • BSP
  • Compiler-based
  • Co-Array Fortran
  • UPC
  • Single thread of execution - loop/array-based
  • Implicit code partitioning and communication
  • Sophisticated compiler transformations
  • HPF
  • ZPL
  • Explicit code partitioning and communication.
  • Simple compiler transformations.
  • OpenMP

4
Design Principles (1 of 2)
  • Wanted to avoid the SPMD programming model.
  • In its more general form can lead to 4D spaghetti
    code. More structure is needed.
  • The most popular of the SPMD implementations,
    MPI, is cumbersome to use, in part due to the
    lack of compiler support. It has been called the
    assembly language of parallel computing.
  • It is difficult to reason about SPMD programs.
    When we reason about them, we rely on an
    (implicit) global view of communication and
    computation.
  • Wanted to avoid the compiler complexity and
    limitations that led to the failure of HPF.

5
Design Principles (2 of 2)
  • OpenMP is a good model, but Blue Gene is not a
    shared-memory machine. OpenMP could be
    implemented on top of TreadMarks, but efficient
    implementations would require untested,
    sophisticated compilers.
  • We propose a few language extensions and a
    programming model based on a single thread of
    execution. We would like the extensions to
    require only simple transformations similar to
    those used by OpenMP compilers.
  • We would like the extensions to be of a general
    nature so that they can be applied to any
    conventional language.

6
MATLAB Implementation
  • We will present next our solution in the context
    of MATLAB.
  • There are at least two reasons why developing a
    parallel extension to MATLAB is a good idea.
  • MATLAB is an excellent language for prototyping
    conventional algorithms. There is nothing
    equivalent for parallel algorithms.
  • Programmers of parallel machines should
    appreciate the MATLAB environment as much as
    conventional programmers.
  • In fact, during a site visit to TJ Watson,
    several government representatives requested that
    a parallel MATLAB be included in the PERCS
    project.
  • It is relatively simple to develop a prototype of
    the proposed extensions in MATLAB.

7
PL/B Data Types and Statements
  • The constructs of PL/B are
  • A new type of objects Hierarchically tiled
    arrays.
  • Array (and hierarchical array) operations and
    assignment statements similar to those found in
    MATLAB, Fortran 90, and APL.
  • Hierarchically tiled arrays (HTAs) are used
  • to specify data distribution in parallel programs
    and
  • to facilitate the writing of blocked sequential
    algorithms (for locality).

8
Example 1 Matrix Multiplication (1 of 4)
1. Tiled Matrix Multiplication in a Conventional
Language
for I1qn for J1qn for K1qn
for iIIq-1 for
jJJq-1 for kKKq-1
c(i,j)c(i,j)a(i,k)b(k,j)
end end
end end end end
9
Example 1 Matrix Multiplication (2 of 4)
2. Tiled Matrix Multiplication Using HTAs
  • Here ci,j, ai,k, bk,j, and T represent
    submatrices.
  • The operator represents matrix multiplication
    in MATLAB.

for i1m for j1m T0 for k1m
TTai,kbk,j end ci,jT
end end
10
Example 1 Matrix Multiplication (3 of 4)
3. Parallel Matrix Multiplication Using HTAs
for j1m for k1m
c,jc,ja,kbk,j end end
  • Use the middle product method to do parallel
    matrix multiply,
  • Assume a two dimensional processor mesh.
  • Assume i,j submatrices are stored in processor
    (i,j)
  • All communications are explicit in separate array
    assignment statements.
  • Parallel computations are also explicit in
    separate array assignment statements and involve
    no communication.

block i,j is in processor (i,j) for j1m
for k1m copy kth column of a
r,ja,k broadcast
bk,j s,jbk,j
compute w/o communication
c,jc,jr,js,j end end
11
Example 1 Matrix Multiplication (4 of 4)
4. A Second Version of Parallel Matrix
Multiplication
for i1m for j1m ci,jdotproduct(a,b,i
,j) end end function x dotproduct(a,b,i,j)
ti,b,j communication
ti,ti,ai, computation
si,ti, for i1ceil(log2(size(a)))
si,eoshift(ti,,2i,eye)
ti,si,ti, end xti,1 end
  • Use now the inner product method,
  • Collective operations that involve both
    computations and communications can be
    represented using intrinsic functions or user
    functions.

12
HTAs
  • HTAs are n-dimensional arrays that have been
    tiled (tiles are d-dimensional, d n).
  • The tiles could be recursive tiled with the tiles
    at each level having the same shape except for
    boundary cases.
  • The topmost levels could be distributed across
    modules of a parallel system.

13
Examples of HTA
A(14,118) ?A1214,13(13)
A(14,117) ?A114,1313,
A214,13(12)
14
Flattening
  • Elements of a HTA are referenced using a tile
    index for each level in the hierarchy followed by
    an array index. Each tile index tuple is enclosed
    within s and the array index is enclosed within
    parentheses.
  • In the matrix multiplication code, ci,j(3,4)
    would represent element 3,4 of submatrix i,j.
  • In the first example of the previous page
    A2,4,3(3) represents the element on the
    bottom right corner.
  • Alternatively, the tiled array could be accessed
    as a flat array.
  • In the matrix multiplication example, assuming
    that each tile is p p, c(r,s) would represent
    element 3,4 of submatrix i,j if r(i-1)p3 and
    s(j-1)p4.

15
Two Ways of Referencing the Elements of an 8 x 8
Array.
16
Parallel Communication and Computation in PL/B (1
of 2)
  • PL/B programs are single-threaded and contain
    array operations on HTAs. The sequential part of
    the program
  • Like in HPF and MPI processors can be arranged in
    different forms. In MPF and MPI, processors are
    arranged into meshes, in PL/B we also consider
    (virtual) nodes so that processors can be
    arranged into hierarchical meshes.
  • The top levels of HTAs can be distributed onto a
    subset of nodes.

17
Parallel Communication and Computation in PL/B (1
of 2)
  • Array operations on HTAs can represent
    communication or computation.
  • Assignment statements where all HTA indices are
    identical are computations executed in the home
    of each of the HTA elements involved.
  • Assignment statements where this is not the case
    represent communication operations.

18
HTA Distribution Examples (1 of 4)
  • Consider a 3x6 processor arrangement.
  • A HTA with shape 13,16(110) will be
    distributed on the processor mesh by assigning
    each 10 element vector o a different processor.
  • A HTA with shape 16,118(110) will be
    distributed cyclically on both dimensions

19
HTA Distribution Examples (2 of 4)
  • Consider the two level processor arrangement
    shown on the left. It is a 3x2 arrangement of 2x2
    meshes of processors.
  • A HTA with shape 13,1212,12(110,110)
    could be distributed by assigning each 10x10
    array to a different processor.
  • Also, the processor arrangement shown on left
    could be flattened so that an HTA with shape
    16,14(110,110) could be distributed by
    assigning each 10x10 array to a different
    processor

20
HTA Distribution Example (3 of 4)
  • A HTA, b, with shape 13,1212,120(15)
    would be distributed cyclically on the
    arrangement on the left. The pair of processors
    i,jk,12 would host the following vectors

bi,jk,1(15) bi,jk,3(15) bi,jk,19(1
5)
bi,jk,2(15) bi,jk,4(15) bi,jk,20(1
5)
  • A HTA, b, with shape 16,140(15) would be
    distributed cyclically on the flattened
    arrangement so that processors i,14 would
    host the following vectors

bi,1(15) bi,5(15) bi,17(15)
bi,2(15) bi,6(15) bi,18(15)
bi,3(15) bi,7(15) bi,19(15)
bi,4(15) bi,8(15) bi,20(15)
21
Advantages of the PL/B Programming Model
  • We believe that by relying on a single thread our
    programming model facilitates the transition from
    sequential to parallel form.
  • In fact, a conventional program can be
    incrementally modified by converting one data
    structure at a time into distributed cell array
    form (cf. OpenMP).

22
Additional Examples
  • We now present several additional examples of
    PL/B
  • Except where indicated, we assume that
  • Each HTA has a single level of tiling,
  • All HTAs have identical sizes and shapes,
  • The HTAs are identically distributed distributed
    on a processor mesh that matches in size and
    shape all HTAs.
  • Our objective in these examples is to show the
    expressiveness of the language for both dense and
    sparse computations.

23
Example 2 Cannons Algorithm (1 of 3)
24
Example 2 Cannons Algorithm (2 of 3)
forall version
c1n,1n zeros(p,p) communication forall
i1n a,i cshift(a(i,,i-1)
communication end forall i1n b,i
cshift(b,i,i-1) communication end for
k1n forall i1n, j1n ci,j
ci,jai,jbi,j computation end forall
i1n ai, cshift(ai,,1)
communication end forall i1n b,i
cshift(b,i,1) communication end end
25
Example 2 Cannons Algorithm (3 of 3)
Triplet version
c1n,1n zeros(p,p) communication for
i2n ain, cshift(a(in,,dim2,shift1
) communication b,in
cshift(b,in,dim1,shift1) communication end
for k1n c, c,a,b,
computation a, cshift(a,,dim2,
shift1) communication b,
cshift(b,,dim1,shift1) communication end
26
Example 2 The SUMMA Algorithm (1 of 6)
Use now the outer-product method
(n2-parallelism) Interchanging the loop headers
of the loop in Example 1 produce for k1n
for i1n for j1n
Ci,jCi,jAi,kBk,
j end
end end To obtain n2 parallelism, the inner
two loops should take the form of a block
operations for k1n
C,C,A,k ? Bk, end Where the
operator ? represents the outer product operations
27
Example 2 The SUMMA Algorithm (2 of 6)
  • The SUMMA Algorithm

A
B
C
b11 b12
Switch Orientation -- By using a column of A and
a row of B broadcast to all, compute the next
terms of the dot product
a11 a21
a11b11
a11b12
a21b11
a21b12
28
Example 2 The SUMMA Algorithm (3 of 6)
c1n,1n zeros(p,p) communication for
i1n t1,spread(a(,i),dim2,ncopiesN)
communication t2,spread(b(i,),dim
1,ncopiesN) communication
c,c,t1,t2, computation end
29
Example 2 The SUMMA Algorithm (4 of 6)
30
Example 2 The SUMMA Algorithm (5 of 6)
31
Example 2 The SUMMA Algorithm (6 of 6)
32
Example 4 Jacobi Relaxation
while dif gt epsilon v2,(0,)
vn-1,(p,) communication vn-1,
(p1,) v2, (1,) communication
v,2 (,0) v,1n-1(,p)
communication v, n-1(,p1)
v,2(,1) communication
u,(1p,1p) a (v, (1p,0p-1)
v, (0p-1,1p)
v, (1p,2p1) v, (2p1,1p))
computation difmax(max( abs ( v u ) ))
v u end
33
Example 5 Sparse MVM/copy vector (1 of 2)
P1
P2
P3
A
b

P4
34
Example 5 Sparse MVM/copy vector (1 of 2)
Distribute a c1n
a(DIST(1n)DIST(1n1)-1,) Broadcast vector
b v1n b Local multiply t c
v Get result forall i1N vi t()
flattened t end
Important observation In MATLAB sparse
computations can be represented as dense
computations. The interpreter only performs the
necessary operations.
35
Example 6 Sparse MVM/distribute vector (1 of 7)
P1
P2
P3
A
b

P4
36
Example 6 Sparse MVM/distribute vector (2 of 7)
Step 1 Compute set of indices needed in
processor i from chunk of vector in processor j
P1
P2
P3
P4
37
Example 6 Sparse MVM/distribute vector (3 of 7)
Step 2 Perform collective communication
by transposing matrix of set of indices
P1
P2
P3
P4
38
Example 6 Sparse MVM/distribute vector (4 of 7)
Step 3 In each processor, collect the
vector elements needed by the other processors
P1
P2
P3
P4
39
Example 6 Sparse MVM/distribute vector (5 of 7)
Step 4 Transpose again to communicate vector
elements to the processor where they are needed
P1
P2
P3
P4
40
Example 6 Sparse MVM/distribute vector (6 of 7)
c1n a(DIST(1n)DIST(2n1)-1,) v1n(DI
ST(1n)DIST(2n1)-1,1) b(DIST(1n)DIST(2n1)
-1) Step 1 PIJ contains index of elements
of vector v from processor J needed by
processor I. First component of P is
distributed and the second is a regular cell
array. forall I1N PI fun(cI,DIST) end
41
Example 6 Sparse MVM/distribute vector (7 of 7)
Step 2 RIJ contains the index of elements
of vector v from processor I needed by
processor J R transpose(P) Step
3 QIJ contains all elements of vector v
from processor I needed by processor J forall
I1N, k1N, (k ! I) QIk
vI(RIk()) end Step 4 RIJ
contains all elements of vector v
from processor J needed by processor I R
transpose(Q ) forall I1N vI(PI(
))RI() end v c v
42
Example 7 SuperLU_DIST
Step 1 Apply LU decomposition to corner block
CK,K
43
Example 7 SuperLU_DIST
Step 2 Factorize block column L,K
C,K/UK,K
44
Example 7 SuperLU_DIST
Step 3 Factorize block row UK,
45
Example 7 SuperLU_DIST
Step 4 Update trailing Submatrix
CI,JCI,J-LI,KUK,J
46
Example 7 SuperLU_DIST
for K1N LK,K,UK,Klu(CK,K) Step
1 LK1N,KCK1N,K/UK,K Step
2 UK,K1NLK,K\CK,K1N Step
3 forall IK1N LI,K1LI,K UK1
,IUK1,I end CK1,K1CK1,K1
-LK1,K1UK1,K1 Step 4 end
47
Example 7 SuperLU_DIST
Avoiding unnecessary communication
48
Example 7 SuperLU_DIST
Tranpose Send
T
T
T
T
T
T
F
F
F
F
F
F
Spread
T
T
T
T
T
T
F
F
F
F
F
F
T
T
T
T
T
T
49
Example 7 SuperLU_DIST
for K1N LK,K,UK,Klu(CK,K) Step
1 TKN,K CKN,K ! 0 SKN,K(KN)s
pread(TKN,K,) R K,KNtranspose(SLK1N,
KKN) LK1N,KCK1N,K/UK,K
Step 2 UK,K1NLK,K\CK,K1N Step
3 forall IK1N LI,K1LI,K end
forall IK1N URK,I,IUK,I end C
K1,K1CK1,K1-LK1,K1UK1,K1
Step 4 end
Write a Comment
User Comments (0)
About PowerShow.com