CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola

Description:

Each processor Pr has a given position where r = in2 jn k for 0 = i,j,k = n-1 ... Resulting in A(i,j,k) = aji for 0 =k =n-1. ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 45
Provided by: jlbh
Learn more at: http://www.cs.gsu.edu
Category:

less

Transcript and Presenter's Notes

Title: CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola


1
CSc 8530Matrix Multiplication and Transpose
By
Jaman Bhola
2
Outline
  • Matrix multiplication one processor
  • Matrix multiplication parallel
  • Algorithm
  • Example
  • Analysis
  • Matrix transpose
  • Algorithm
  • Example
  • Analysis

3
Matrix multiplication 1 processor
  • Using one processor O(n3) time
  • Algorithm
  • for (i 0 i
  • for (j 0 j
  • t 0
  • for(k 0 k
  • t t aik bkj
  • cij t

4
Matrix multiplication parallel
  • Using Hypercube The algorithm given in the book
    assumes the multiplication of two n x n matrices
    where n can be factored into a power of 2.
  • This will facilitate a hypercube network.
  • Need N n3 23q processors where n 2q is the
    size of the matrix.

5
  • N processors allowing each processor to occupy a
    vertex in the hypercube.
  • Each processor Pr has a given position where r
    in2 jn k for 0
  • If r is represented by
  • r3q-1r3q-2r2qr2q-1rqrq-1r0
  • then the binary representation of i, j, k are
  • r3q-1r3q-2r2q, r2q-1rq, rq-1r0
    respectively
  • This allow the positioning of processors such
    that their position only differ by one binary
    digit location.

6
  • Also this allow all processors that agree in one
    or two of the positions i,j,k will form a
    hypercube
  • Example, building a hypercube for q 1, then
    for N n3 23q ? N 8 processors.
  • And for Pr where r in2 jn k we get

7

  • i j k
  • P0 ? r 0 0 0 0 0 0 0
  • P1 ? r 0 0 1 1 0 0 1
  • P2 ? r 0 2 0 2 0 1 0
  • P3 ? r 0 2 1 3 0 1 1
  • P4 ? r 4 0 0 4 1 0 0
  • P5 ? r 4 0 1 5 1 0 1
  • P6 ? r 4 2 0 6 1 1 0
  • P7 ? r 4 2 1 7 1 1 1

8
100
101
000
001
010
011
111
110
9
Processor Layout
  • Each processor will have 3 registers Ar, Br and
    Cr ? P0
  • The following is the step by step description of
    the algorithm

B
A
C
10
  • Step 1The elements of A and B are distributed to
    the n3 processors so that the processor in
    position i,j,k will contain aji and bik
  • (1.1) Copies of data initially in A(0,j,k) and
    B(0,j,k) are sent to processors in positions
    (i,j,k), where 1aij and B(i,j,k) bjk for 0
  • (1.2) Copies of data in A(i,j,k) are sent to
    processors in positions (i,j,k), where 0Resulting in A(i,j,k) aji for 0
  • (1.3) Copies of data in B(i,j,k) are sent to
    processors in positions (i,j,k), where 0Resulting in B(i,j,k) bik for 0

11
  • Step 2 Each processor in position (i,j,k)
    computes the product
  • C(i,j,k) A(i,j,k) B(i,j,k)
  • Thus C(i,j,k) aji bik for 0
  • Step 3 The sum C(0,j,k) ?C(i,j,k) for
    0

12
  • The algorithm
  • Step1 (1.1)
  • for m 3q 1 downto 2q do
  • for all r e N(rm 0) do in parallel
  • (i) Ar(m) Ar
  • (ii) Br(m) Br
  • end for
  • end for
  • (1.2)
  • for m q-1 downto 0 do
  • for all r e N(rm r2qm) do in
    parallel
  • Ar(m) Ar
  • end for
  • end for

13
  • (1.3)
  • for m 2q-1 downto 0 do
  • for all r e N(rm rqm) do in
    parallel
  • Br(m) Br
  • end for
  • end for
  • Step 2
  • for r 0 to N-1 do in parallel
  • Cr Ar Br
  • end for

14
  • Step 3
  • for m 2q to 3q - 1 do
  • for all r e N(rm 0) do in parallel
  • Cr(m) Cr Cr(m)
  • end for
  • end for

15
An Example using a 2x2 matrix.
  • This example will require n3 processors 8. The
    matrices are
  • 1 2
  • A 3 4
  • -1 -2
  • B -3 -4

16
2, -2
1, -1
2, -2
1, -1
2, -2
1, -1
4, -4
3, -3
4, -4
3, -3
4, -4
3, -3
2, -4
2, -3
2, -2
2, -1
1, -1
1, -2
1, -2
1, -1
3, -2
3, -1
3, -4
3, -3
4, -4
4, -3
4, -4
4, -3
17
-8
-6
-2
-1
-6
-3
-16
-12
-10
-7
-22
-15
18
Analysis of algorithm
  • If the layout of the processors is viewed as a n
    x n x n array, then there consist of a layer of
    processors n each with an n x n array of
    processors.
  • Initially, this first layer n will have a
    distinct value from matrix A in its A register
    and a distinct value from matrix B in its B
    register. This is constant time operation.
  • Step 1.1 Copies are sent to n/2 processors, and
    continually to n/4, etc O(log n) to copy data
    from layer 0 to layers n-1

19
  • Step1.2 and 1.3. Each processor from column i in
    layer i sends data to processor in its row.
    Similar from row i sending data to processor in
    its column. Requiring constant time iterations.
  • Step 2 require constant time
  • Step 3 require constant time iteration
  • Overall, it requires O(log n) time
  • But cost is O(n3 log n) not optimal.

20
A faster algorithm - Quinn
  • For all Pm, where 1
  • for i m to n step p do
  • for j 1 to n do
  • t 0
  • for k 1 to n do
  • t t aik bkj
  • cij t
  • time ? O(n3/p p) maximum of processors
    n2

21
An actual implementation
  • Get the processor id
  • This if statement is to make sure that the
    entire size of the matrix is computed
  • chunksize (int) (n/p)
  • if ((chunksize nprocs)
  • int differ n - (chunksizep)
  • if (id 0)
  • lower id chunksize
  • else
  • lower id chunksize differ 1
  • upper (id 1) chunksize differ

22
  • else
  • lower id chunksize
  • upper (id 1) chunksize
  • for (i lower i
  • for(j 0 j
  • total 0
  • for (k 0 k
  • total total mat1ik mat2kj
    mat3ij total

23
Another faster Algorithm Gupta Sadayappan
  • The 3-D Diagonal Algorithm is a 3 phase
    algorithm.
  • The concept a hypercube of p processors viewed
    as a 3-D mesh of size 3vp x 3vp x 3vp
  • Matrices A and B are partitioned into blocks of
    p? with 3vp blocks along each dimension.
  • Initially, it is assumed that A and B are mapped
    onto the 2-D plane x y and the 2-D plane y j
    is responsible for calculating the outer product
    of A,j (the set of columns stored at processors
    pj,j,) and Bj, (the set of rows of B).

24
  • Phase 1 Point to point communication of Bk,i by
    pi,i,k to pi,k,k
  • Phase 2 One-to-all broadcasts of blocks of A
    along the x-direction and the newly acquired
    blocks (from phase 1) of B along the z-direction
    i.e. processor pi,i,k broadcasts Ak,i to p,i,k
    and all other processor of the form of pi,i,k
    broadcasts Bk,i to pi,k,
  • At the end of phase 2, every processor pi,j,k has
    blocks Ak,j and Bj,i
  • Each processor now calculates the product of
    their pair of blocks A and B.

25
  • Phase 3 After computation, there is reduction by
    addition in the y-direction providing the final
    matrix C.

26
Algorithm Analysis
  • Phase 1 Passing messages of size n2/ p? require
    log(3vp(ts tw(n2/ p? ))) where ts is the time
    it takes to start up for message sending and tw
    is time it takes to send a word from one
    processor to its neighbor.
  • Phase two takes twice as much time as phase 1.
  • Phase 3 Can be completed in the same amount of
    time as Phase 1.
  • Overall, the algorithm takes (4/3 log p, n2/ p?
    (4/3 log p)) where communication for each entry
    is tsa twb

27
  • Some added conditions are
  • 1. p
  • 2. Overall space used ? 2n2 3vp
  • The above description is for a one port hypercube
    architecture whereby a processor can use at most
    one communication link to send and receive data.
  • A multi-port architecture, whereby the processor
    can use all of its communication ports
    simultaneously, the algorithm will be faster
    reducing the above amount of time by a factor of
    log(3vp).

28
The algorithm
  • Initial distribution Processor pi,i,k contains
    Aki and Bki
  • Program of processor pi,j,k
  • If (i j) then
  • Send Bki to pi,k,k
  • Broadcast Bji to all processors pi,j,j
  • endif
  • Receive Akj from pi,j,j
  • Calculate Iki Akj x Bji
  • Send Iki to pi,i,k

29
  • if ( i j)
  • for I 0 to 3vp 1
  • Receive Iki from pi,i,k
  • Cki Cki Iki
  • endfor
  • endif
  • I is an intermediate matrix.

30
Matrix Transposition
  • The same concept is used here as in Matrix
    multiplication
  • The number of processors used is N n2 22q and
    processor Pr occupies position (i,j) where r in
    j where 0
  • Initially, processor Pr holds all of the elements
    of matrix A where r in j.
  • Upon termination, processor Ps holds element aij
    where s jn i.

31
  • If r is represented by
  • r2q-1r2q-2rq rq-1 r1r0
  • then the binary representation of i and j are
  • r2q-1 r2q-2 rq, rq-1r1 r0 respectively
  • And s is represented by
    s2q-1s2q-2sq sq-1 s1s0
  • And the binary representation of j and i is
    s2q-1 r2q-2 rq, rq-1r1 r0
    respectively
  • Thus it could be seen that for example
  • r2q-1 r2q-2 rq sq-1 s1s0 and

    rq-1 rq-2 r0 s2q-1s2q-2sq

32
The algorithm
  • First the requirements for the algorithm it
    needs the processors to have registers Au and
    Bu both of processor Pu
  • The index of Pu will be
  • u u2q-1u2q-2uq uq-1 u1u0 matching that of
    r.

33
  • For m 2q-1 downto q do
  • for u 0 to N-1 do in parallel
  • (1) if um ? um-q
  • then Bu(m) Au
  • endif
  • (2) if um um-q
  • then Au(m-q) Bu
  • endif
  • endfor
  • endfor

34
Explanation of algorithm
  • This algorithm is implemented using recursion to
    achieve the transpose of A.
  • Divide the matrix into 4 submatrices n/2 x n/2.
  • For iteration 1 when m 2q-1, swap elements of
    the top right submatrix with that of the bottom
    left submatrix. The other 2 submatrices are not
    touched.
  • Now recursively do this until all of the elements
    are swapped.

35
Example.
  • We want the transpose of the following matrix
  • a b c d
  • A e f g h
  • i j k l
  • m n o p

36
  • We use 16 processors with the following indices
  • 0000 0001 0010 0011
  • 0100 0101 0110 0111
  • 1000 1001 1010 1011
  • 1100 1101 1110 1111

37
Drawing a hypercube for this
  • 6 7 14
    15
  • 4 5 12
    13
  • 2 3 10
    11
  • 0 1 8
    9

38
  • Processor 0 binary 0000 holds a00 which is the
    value a
  • Processor 1 binary 0001 holds a01 which is the
    value b
  • Processor 2 binary 0010 holds a02 which is the
    value c And so on
  • In the first iteration m 2q-1 where q 2 in
    this example ? m 3.
  • Step 1 Each Pu for u3 ? u1 ? sends their element
    of Au to Pu(3) which stores the value in Bu(3)
    i.e.
  • processors 2, 3, 6 7 send to processors 10, 11,
    14 15 respectively.
  • And
  • processors 8, 9, 12 13 send to processors 0, 1,
    4 5 respectively.

39
  • Step 2 Each processor that received a data in
    Step 1, will now send the data from Bu to Pu(1)
    to be stored in Au(1), i.e.
  • Processors 0, 1, 4, 5 send to 2, 3, 6, 7
    respectively
  • Processors 10, 11, 14, 15 send to 8, 9, 12, 13
    respectively
  • By the end of the first iteration our matrix A
    will look like
  • a b i j
  • A e f m n
  • c d k l
  • g h o p

40
  • In the second iteration when m q 2
  • Step 1 Each Pu (where u2 ? u0 ) sends Au to
    Pu(2) storing it in Bu(2). This is a simultaneous
    transfer
  • From processor 4 to processor 0
  • processor 1 to processor 5
  • processor 6 to processor 2
  • processor 3 to processor 7
  • processor 12 to processor 8
  • processor 9 to processor 13
  • processor 14 to processor 10
  • processor 11 to processor 15

41
  • Step 2 For u2 u0, each Pu sends Bu to Pu(0)
    where it is stored in Au(0) thus ?Swap the
    element in the top right corner processor with
    that in the bottom left corner for each of the 2
    x 2 submatrices.
  • From processor 0 to processor 1
  • processor 5 to processor 4
  • processor 2 to processor 3
  • processor 7 to processor 6
  • processor 8 to processor 9
  • processor 13 to processor 12
  • processor 10 to processor 11
  • processor 15 to processor 14

42
  • After the second iteration, we have the following
    transposed matrix
  • a e i m
  • A b f j n
  • c g k o
  • d h l p

43
Algorithm Analysis
  • It takes q constant time iterations giving
  • t(n) O(log n)
  • But it takes n2 processors.
  • Therefore Cost (n2 log n) which is not optimal.

44
Bibliography
  • Akl, Parallel Computation, Models and Methods,
    Prentice Hall 1997.
  • Drake, J.B. and Luo, Q., A scalable Parallel
    Strassens matrix Multiplication Algorithm For
    Distributed-Memory Computers, February 1995  
    Proceedings of the 1995 ACM symposium on Applied
    computing , 221- 226
  • Gupta, H Sadayappan P., Communication Efficient
    Matrix Mulitplication on Hypercubes, August 1994
      Proceedings of the sixth annual ACM symposium
    on Parallel algorithms and architectures, 320 -
    329
  • Quinn, M.J., Parallel Computing Theory and
    Practice, McGraw Hill, 1997
Write a Comment
User Comments (0)
About PowerShow.com