CSE 160 - Lecture 11 - PowerPoint PPT Presentation

About This Presentation

Title:

CSE 160 - Lecture 11

Description:

First question to ask. What can be computed independently? ... May not be practical for extremely large matrices one 1000x1000 DP matrix is 8MB. ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 24

Provided by: Papado

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE 160 - Lecture 11

1
CSE 160 - Lecture 11

Computation/Communication Analysis - Matrix
Multiply

2
Granularity

Machine granularity has been defined as
MFLOPS/MB/sec FLOP/Byte
This tries to indicate the balance between
computation and communication.
For parallel computation it is important to
understand how much computation could be
accomplished while sending/receiving a message

3
Message Startup Latency

Granularity as defined only tells part of the
story
If it told the whole story than message startup
latency would not be important.
Message startup latency - the time it takes to
start sending a message of any length
This latency is approximated by measuring the
latency of a zero-byte message
There are other measures that are important, too.

4
Back of the envelope calculations

Suppose we have a 733MhZ Pentium, Myrinet (100
MB/sec) and zero-length message latency of 10
microseconds
Granularity is 733/100 7.3
7 Flops can be computed for every byte of data
sent
Double precision float is 8 bytes
(87) 56 Flops for every DP float sent on the
network (hmmm)
in 10 microseconds can accomplish 7333 Flops
Every float takes .08 microseconds to transmit
100MB/sec 100 bytes/microsecond
125 floats transmitted/startup latency

5
First Interpretation

For a 5050 balance (Comp/Comm)
Compute 7333 Flops, Transmit 1 float
If done serially (compute -gt message -gt compute
-gt )
Throughput of CPU cut in 1/2
Only computing 1/2 of the time, messaging the
other half
For a 9010 (comp/comm)
Compute (97333 ? 66000) Flops/ transmit 1 float
If done serially, 90 time computing, 10
messaging

6
One more calculation

733 MHz PIII, 100 Mbit ethernet, 100 microsecond
latency
granularity is 733/10 73.3
in one latency period can do 73333 Flops
9010 requires (9 73333) 666000 Flops
If latency isnt the contraint, but transmission
time is, then we have to balance the computation
time with the communication time

7
Matrix - Matrix Multiply

Given two square matrices (N x N) A, B, want to
multiply them together
The total number of FLOPs is O(2N3)
There are numerous ways to efficiently
parallelize matrix multiply, well pick a simple
method and analyze is communication and
computation costs

8
Matrix Multiply - Review4x4 example
A11
A12
A13
A14
B11
B12
B13
B14
C11
C12
C13
C14
A21
A22
A23
A24
B21
B22
B23
B24
C21
C22
C23
C24

A31
A32
A33
A34
B31
B32
B33
B34
C31
C32
C33
C34
A41
A42
A34
A44
B41
B42
B34
B44
C41
C42
C34
C44
In general, entry Akm dot product of kth row of
B with the mth column of C
A32
B31
C12
B32
C22
B33
C32
B34
C42
B31 C12 B32 C22 B33 C32 B34 C42
9
How much computation for NxN?

Akm ? Bkj Cjm j1,2,,N
Count multiplies N
Count adds N
So every element of A takes 2N Flops
There are NN elements in the results
Total is (flops/element)(elements)
(2N)(NN) O(2N3)

10
The matrix elements can be matrices!

Matrix elements can themselves be matrices.
E.g. B31 C12 would itself be a matrix multiply
We can think about matrix blocks of size qXq.
Total computation is (blocks)2q3
Lets formulate two parallel algorithms for MM

11
First question to ask

What can be computed independently?
Does the computation of Akm depend at all on the
computation of Apq?
Another way of asking, does the order in which we
compute the elements of A matter?
Not for MM
In general, calculations will depend on previous
results, This might limit the amount of
parallelism

12
Simple Parallelism
Divide the computation of A into blocks. Assign a
block to different processors (16 in this
case) Computation of each block of A can be
computed in parallel In theory, should be 16X
faster than on one processor
Computation of A mapped onto a 4x4 Processor grid
13
Next Questions to Ask

How will the three matrices be stored in a
parallel program. Two choices
Every node gets a complete copy of all the
matrices (could run out of memory)
Distribute the storage to different computers so
that each node holds only some part of each
matrix
Can compute on much larger matrices!

14
Every Processor gets copy of the Matrices

This is a good paradigm for a shared-memory
machine?
Every processor can share their copy of the
matrix
For distributed memory machines, need pp more
total memory on a pXp processor grid.
May not be practical for extremely large matrices
one 1000x1000 DP matrix is 8MB.
If we ignore cost of initial distribution of
matrices across multiple memories, then parallel
MM multiply runs pp times faster than on a
single CPU.

15
Case Two

Matrices A, B, and C are distributed so that each
processor has only some part of the matrix.
We call this a parallel data decomposition
Examples include Block, Cyclic and Strip
Why would you do this?
If you only need to multiply two matrices, then
todays machines can locally store fairly large
matrices, but
Codes often contain 10s to 100s of similar
sized matrices.
Local memory starts to become a relatively scarce
commodity

16
Lets Pick A Row Decomposition

Basic idea -
Assume P processors and N qP
Assign q rows of the matrices to each processor

Proc 0
A11
A12
A13
A14
B11
B12
B13
B14
C11
C12
C13
C14
A21
A22
A23
A24
B21
B22
B23
B24
C21
C22
C23
C24
Proc 1

A31
A32
A33
A34
B31
B32
B33
B34
C31
C32
C33
C34
Proc 2
A41
A42
A34
A44
B41
B42
B34
B44
C41
C42
C34
C44
Proc 3
Akm is qXq
17
What do we need to compute a block

Consider a block Akm
Need the kth row of B, and the mth column of C
Where are these located
All of the kth row of B is on the same processor
as that holds the kth row of A
Why? Our chosen decomposition of data
The mth column is distributed among all
processors
Why? Our chosen decomposition.

18
Lets assume some memory contraints

Each process has just enough memory to hold 3 qxN
matrices
enough to get all data in place to easily compute
the a row/column dot product
Can do this on each processor one qXq block at a
time.
How much computation is needed to compute a
single qXq entry in a row
2q3FLOPS
How much data needs to be transferred?
3q2 need qXq blocks of data from 3 neighbors
(column of C)

19
Basic Algorithm - SPMD

Assume my processor id z (z 1, , p)
P is the number of processors
Blocks are qXq
For (j 1 j lt p j)
for (k 1 k lt p , k ? z k)
send Czj to processor k
for (k 1 k lt p , k ? z k)
receive Ckj from processor k
compute Azj locally
barrier()

20
Lets compute some work

For each iteration, each process does the
following
Computes 2q3 flops
Send/receives 3q2 floats
Weve added the transmission of data to the basic
algorithm.
For what size q does the time required for data
transmission balance the time for computation?

21
Some Calculations

Using just a granularity measure how do we choose
a minimum q (for 5050)
On Myrinet, need to perform 56 flops for every
float transferred
Need to perform 2q3 flops/iteration. Assume a
flop costs 1 unit.
Each float transferred costs 56 units
So when is
Flop cost gt transfer cost
2q3 gt 356q2? Q 80
For a size 80 matrix, 733Mflop this takes 1.4ms

22
What does 5050 really mean

On two processors, it takes the same amount of
time as one
On four, with scaling, goes twice as fast
50 efficiency

23
Final Thoughts

This decomposition of data is not close to
optimal?
We could get more parallelism by running on a pXp
processor grid and having each processor do just
one multiply
MM is highly parallelizable and better algorithms
get good efficiencies even on workstations.

Write a Comment

User Comments (0)