CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola

Description:

Each processor Pr has a given position where r = in2 jn k for 0 = i,j,k = n-1 ... Resulting in A(i,j,k) = aji for 0 =k =n-1. ... – PowerPoint PPT presentation

Number of Views:242

Avg rating:3.0/5.0

Slides: 45

Provided by: jlbh

Learn more at: http://www.cs.gsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola

1
CSc 8530Matrix Multiplication and Transpose
By
Jaman Bhola
2
Outline

Matrix multiplication one processor
Matrix multiplication parallel
Algorithm
Example
Analysis
Matrix transpose
Algorithm
Example
Analysis

3
Matrix multiplication 1 processor

Using one processor O(n3) time
Algorithm
for (i 0 i
for (j 0 j
t 0
for(k 0 k
t t aik bkj
cij t

4
Matrix multiplication parallel

Using Hypercube The algorithm given in the book
assumes the multiplication of two n x n matrices
where n can be factored into a power of 2.
This will facilitate a hypercube network.
Need N n3 23q processors where n 2q is the
size of the matrix.

N processors allowing each processor to occupy a
vertex in the hypercube.
Each processor Pr has a given position where r
in2 jn k for 0
If r is represented by
r3q-1r3q-2r2qr2q-1rqrq-1r0
then the binary representation of i, j, k are
r3q-1r3q-2r2q, r2q-1rq, rq-1r0
respectively
This allow the positioning of processors such
that their position only differ by one binary
digit location.

Also this allow all processors that agree in one
or two of the positions i,j,k will form a
hypercube
Example, building a hypercube for q 1, then
for N n3 23q ? N 8 processors.
And for Pr where r in2 jn k we get

i j k
P0 ? r 0 0 0 0 0 0 0
P1 ? r 0 0 1 1 0 0 1
P2 ? r 0 2 0 2 0 1 0
P3 ? r 0 2 1 3 0 1 1
P4 ? r 4 0 0 4 1 0 0
P5 ? r 4 0 1 5 1 0 1
P6 ? r 4 2 0 6 1 1 0
P7 ? r 4 2 1 7 1 1 1

8
100
101
000
001
010
011
111
110
9
Processor Layout

Each processor will have 3 registers Ar, Br and
Cr ? P0
The following is the step by step description of
the algorithm

B
A
C
10

Step 1The elements of A and B are distributed to
the n3 processors so that the processor in
position i,j,k will contain aji and bik
(1.1) Copies of data initially in A(0,j,k) and
B(0,j,k) are sent to processors in positions
(i,j,k), where 1aij and B(i,j,k) bjk for 0
(1.2) Copies of data in A(i,j,k) are sent to
processors in positions (i,j,k), where 0Resulting in A(i,j,k) aji for 0
(1.3) Copies of data in B(i,j,k) are sent to
processors in positions (i,j,k), where 0Resulting in B(i,j,k) bik for 0

Step 2 Each processor in position (i,j,k)
computes the product
C(i,j,k) A(i,j,k) B(i,j,k)
Thus C(i,j,k) aji bik for 0
Step 3 The sum C(0,j,k) ?C(i,j,k) for
0

The algorithm
Step1 (1.1)
for m 3q 1 downto 2q do
for all r e N(rm 0) do in parallel
(i) Ar(m) Ar
(ii) Br(m) Br
end for
end for
(1.2)
for m q-1 downto 0 do
for all r e N(rm r2qm) do in
parallel
Ar(m) Ar
end for
end for

(1.3)
for m 2q-1 downto 0 do
for all r e N(rm rqm) do in
parallel
Br(m) Br
end for
end for
Step 2
for r 0 to N-1 do in parallel
Cr Ar Br
end for

Step 3
for m 2q to 3q - 1 do
for all r e N(rm 0) do in parallel
Cr(m) Cr Cr(m)
end for
end for

15
An Example using a 2x2 matrix.

This example will require n3 processors 8. The
matrices are
1 2
A 3 4
-1 -2
B -3 -4

16
2, -2
1, -1
2, -2
1, -1
2, -2
1, -1
4, -4
3, -3
4, -4
3, -3
4, -4
3, -3
2, -4
2, -3
2, -2
2, -1
1, -1
1, -2
1, -2
1, -1
3, -2
3, -1
3, -4
3, -3
4, -4
4, -3
4, -4
4, -3
17
-8
-6
-2
-1
-6
-3
-16
-12
-10
-7
-22
-15
18
Analysis of algorithm

If the layout of the processors is viewed as a n
x n x n array, then there consist of a layer of
processors n each with an n x n array of
processors.
Initially, this first layer n will have a
distinct value from matrix A in its A register
and a distinct value from matrix B in its B
register. This is constant time operation.
Step 1.1 Copies are sent to n/2 processors, and
continually to n/4, etc O(log n) to copy data
from layer 0 to layers n-1

Step1.2 and 1.3. Each processor from column i in
layer i sends data to processor in its row.
Similar from row i sending data to processor in
its column. Requiring constant time iterations.
Step 2 require constant time
Step 3 require constant time iteration
Overall, it requires O(log n) time
But cost is O(n3 log n) not optimal.

20
A faster algorithm - Quinn

For all Pm, where 1
for i m to n step p do
for j 1 to n do
t 0
for k 1 to n do
t t aik bkj
cij t
time ? O(n3/p p) maximum of processors
n2

21
An actual implementation

Get the processor id
This if statement is to make sure that the
entire size of the matrix is computed
chunksize (int) (n/p)
if ((chunksize nprocs)
int differ n - (chunksizep)
if (id 0)
lower id chunksize
else
lower id chunksize differ 1
upper (id 1) chunksize differ

else
lower id chunksize
upper (id 1) chunksize
for (i lower i
for(j 0 j
total 0
for (k 0 k
total total mat1ik mat2kj
mat3ij total

23
Another faster Algorithm Gupta Sadayappan

The 3-D Diagonal Algorithm is a 3 phase
algorithm.
The concept a hypercube of p processors viewed
as a 3-D mesh of size 3vp x 3vp x 3vp
Matrices A and B are partitioned into blocks of
p? with 3vp blocks along each dimension.
Initially, it is assumed that A and B are mapped
onto the 2-D plane x y and the 2-D plane y j
is responsible for calculating the outer product
of A,j (the set of columns stored at processors
pj,j,) and Bj, (the set of rows of B).

Phase 1 Point to point communication of Bk,i by
pi,i,k to pi,k,k
Phase 2 One-to-all broadcasts of blocks of A
along the x-direction and the newly acquired
blocks (from phase 1) of B along the z-direction
i.e. processor pi,i,k broadcasts Ak,i to p,i,k
and all other processor of the form of pi,i,k
broadcasts Bk,i to pi,k,
At the end of phase 2, every processor pi,j,k has
blocks Ak,j and Bj,i
Each processor now calculates the product of
their pair of blocks A and B.

Phase 3 After computation, there is reduction by
addition in the y-direction providing the final
matrix C.

26
Algorithm Analysis

Phase 1 Passing messages of size n2/ p? require
log(3vp(ts tw(n2/ p? ))) where ts is the time
it takes to start up for message sending and tw
is time it takes to send a word from one
processor to its neighbor.
Phase two takes twice as much time as phase 1.
Phase 3 Can be completed in the same amount of
time as Phase 1.
Overall, the algorithm takes (4/3 log p, n2/ p?
(4/3 log p)) where communication for each entry
is tsa twb

Some added conditions are
1. p
2. Overall space used ? 2n2 3vp
The above description is for a one port hypercube
architecture whereby a processor can use at most
one communication link to send and receive data.
A multi-port architecture, whereby the processor
can use all of its communication ports
simultaneously, the algorithm will be faster
reducing the above amount of time by a factor of
log(3vp).

28
The algorithm

Initial distribution Processor pi,i,k contains
Aki and Bki
Program of processor pi,j,k
If (i j) then
Send Bki to pi,k,k
Broadcast Bji to all processors pi,j,j
endif
Receive Akj from pi,j,j
Calculate Iki Akj x Bji
Send Iki to pi,i,k

if ( i j)
for I 0 to 3vp 1
Receive Iki from pi,i,k
Cki Cki Iki
endfor
endif
I is an intermediate matrix.

30
Matrix Transposition

The same concept is used here as in Matrix
multiplication
The number of processors used is N n2 22q and
processor Pr occupies position (i,j) where r in
j where 0
Initially, processor Pr holds all of the elements
of matrix A where r in j.
Upon termination, processor Ps holds element aij
where s jn i.

If r is represented by
r2q-1r2q-2rq rq-1 r1r0
then the binary representation of i and j are
r2q-1 r2q-2 rq, rq-1r1 r0 respectively
And s is represented by
s2q-1s2q-2sq sq-1 s1s0
And the binary representation of j and i is
s2q-1 r2q-2 rq, rq-1r1 r0
respectively
Thus it could be seen that for example
r2q-1 r2q-2 rq sq-1 s1s0 and

rq-1 rq-2 r0 s2q-1s2q-2sq

32
The algorithm

First the requirements for the algorithm it
needs the processors to have registers Au and
Bu both of processor Pu
The index of Pu will be
u u2q-1u2q-2uq uq-1 u1u0 matching that of
r.

For m 2q-1 downto q do
for u 0 to N-1 do in parallel
(1) if um ? um-q
then Bu(m) Au
endif
(2) if um um-q
then Au(m-q) Bu
endif
endfor
endfor

34
Explanation of algorithm

This algorithm is implemented using recursion to
achieve the transpose of A.
Divide the matrix into 4 submatrices n/2 x n/2.
For iteration 1 when m 2q-1, swap elements of
the top right submatrix with that of the bottom
left submatrix. The other 2 submatrices are not
touched.
Now recursively do this until all of the elements
are swapped.

35
Example.

We want the transpose of the following matrix
a b c d
A e f g h
i j k l
m n o p

We use 16 processors with the following indices
0000 0001 0010 0011
0100 0101 0110 0111
1000 1001 1010 1011
1100 1101 1110 1111

37
Drawing a hypercube for this

6 7 14
15
4 5 12
13
2 3 10
11
0 1 8
9

Processor 0 binary 0000 holds a00 which is the
value a
Processor 1 binary 0001 holds a01 which is the
value b
Processor 2 binary 0010 holds a02 which is the
value c And so on
In the first iteration m 2q-1 where q 2 in
this example ? m 3.
Step 1 Each Pu for u3 ? u1 ? sends their element
of Au to Pu(3) which stores the value in Bu(3)
i.e.
processors 2, 3, 6 7 send to processors 10, 11,
14 15 respectively.
And
processors 8, 9, 12 13 send to processors 0, 1,
4 5 respectively.

Step 2 Each processor that received a data in
Step 1, will now send the data from Bu to Pu(1)
to be stored in Au(1), i.e.
Processors 0, 1, 4, 5 send to 2, 3, 6, 7
respectively
Processors 10, 11, 14, 15 send to 8, 9, 12, 13
respectively
By the end of the first iteration our matrix A
will look like
a b i j
A e f m n
c d k l
g h o p

In the second iteration when m q 2
Step 1 Each Pu (where u2 ? u0 ) sends Au to
Pu(2) storing it in Bu(2). This is a simultaneous
transfer
From processor 4 to processor 0
processor 1 to processor 5
processor 6 to processor 2
processor 3 to processor 7
processor 12 to processor 8
processor 9 to processor 13
processor 14 to processor 10
processor 11 to processor 15

Step 2 For u2 u0, each Pu sends Bu to Pu(0)
where it is stored in Au(0) thus ?Swap the
element in the top right corner processor with
that in the bottom left corner for each of the 2
x 2 submatrices.
From processor 0 to processor 1
processor 5 to processor 4
processor 2 to processor 3
processor 7 to processor 6
processor 8 to processor 9
processor 13 to processor 12
processor 10 to processor 11
processor 15 to processor 14

After the second iteration, we have the following
transposed matrix
a e i m
A b f j n
c g k o
d h l p

43
Algorithm Analysis

It takes q constant time iterations giving
t(n) O(log n)
But it takes n2 processors.
Therefore Cost (n2 log n) which is not optimal.

44
Bibliography

Akl, Parallel Computation, Models and Methods,
Prentice Hall 1997.
Drake, J.B. and Luo, Q., A scalable Parallel
Strassens matrix Multiplication Algorithm For
Distributed-Memory Computers, February 1995
Proceedings of the 1995 ACM symposium on Applied
computing , 221- 226
Gupta, H Sadayappan P., Communication Efficient
Matrix Mulitplication on Hypercubes, August 1994
Proceedings of the sixth annual ACM symposium
on Parallel algorithms and architectures, 320 -
329
Quinn, M.J., Parallel Computing Theory and
Practice, McGraw Hill, 1997