Title: CSE 160 - Lecture 11
1CSE 160 - Lecture 11
- Computation/Communication Analysis - Matrix
Multiply
2Granularity
- Machine granularity has been defined as
MFLOPS/MB/sec FLOP/Byte - This tries to indicate the balance between
computation and communication. - For parallel computation it is important to
understand how much computation could be
accomplished while sending/receiving a message
3Message Startup Latency
- Granularity as defined only tells part of the
story - If it told the whole story than message startup
latency would not be important. - Message startup latency - the time it takes to
start sending a message of any length - This latency is approximated by measuring the
latency of a zero-byte message - There are other measures that are important, too.
4Back of the envelope calculations
- Suppose we have a 733MhZ Pentium, Myrinet (100
MB/sec) and zero-length message latency of 10
microseconds - Granularity is 733/100 7.3
- 7 Flops can be computed for every byte of data
sent - Double precision float is 8 bytes
- (87) 56 Flops for every DP float sent on the
network (hmmm) - in 10 microseconds can accomplish 7333 Flops
- Every float takes .08 microseconds to transmit
- 100MB/sec 100 bytes/microsecond
- 125 floats transmitted/startup latency
5First Interpretation
- For a 5050 balance (Comp/Comm)
- Compute 7333 Flops, Transmit 1 float
- If done serially (compute -gt message -gt compute
-gt ) - Throughput of CPU cut in 1/2
- Only computing 1/2 of the time, messaging the
other half - For a 9010 (comp/comm)
- Compute (97333 ? 66000) Flops/ transmit 1 float
- If done serially, 90 time computing, 10
messaging
6One more calculation
- 733 MHz PIII, 100 Mbit ethernet, 100 microsecond
latency - granularity is 733/10 73.3
- in one latency period can do 73333 Flops
- 9010 requires (9 73333) 666000 Flops
- If latency isnt the contraint, but transmission
time is, then we have to balance the computation
time with the communication time
7Matrix - Matrix Multiply
- Given two square matrices (N x N) A, B, want to
multiply them together - The total number of FLOPs is O(2N3)
- There are numerous ways to efficiently
parallelize matrix multiply, well pick a simple
method and analyze is communication and
computation costs
8Matrix Multiply - Review4x4 example
A11
A12
A13
A14
B11
B12
B13
B14
C11
C12
C13
C14
A21
A22
A23
A24
B21
B22
B23
B24
C21
C22
C23
C24
A31
A32
A33
A34
B31
B32
B33
B34
C31
C32
C33
C34
A41
A42
A34
A44
B41
B42
B34
B44
C41
C42
C34
C44
In general, entry Akm dot product of kth row of
B with the mth column of C
A32
B31
C12
B32
C22
B33
C32
B34
C42
B31 C12 B32 C22 B33 C32 B34 C42
9How much computation for NxN?
- Akm ? Bkj Cjm j1,2,,N
- Count multiplies N
- Count adds N
- So every element of A takes 2N Flops
- There are NN elements in the results
- Total is (flops/element)(elements)
- (2N)(NN) O(2N3)
10The matrix elements can be matrices!
- Matrix elements can themselves be matrices.
- E.g. B31 C12 would itself be a matrix multiply
- We can think about matrix blocks of size qXq.
- Total computation is (blocks)2q3
- Lets formulate two parallel algorithms for MM
11First question to ask
- What can be computed independently?
- Does the computation of Akm depend at all on the
computation of Apq? - Another way of asking, does the order in which we
compute the elements of A matter? - Not for MM
- In general, calculations will depend on previous
results, This might limit the amount of
parallelism
12Simple Parallelism
Divide the computation of A into blocks. Assign a
block to different processors (16 in this
case) Computation of each block of A can be
computed in parallel In theory, should be 16X
faster than on one processor
Computation of A mapped onto a 4x4 Processor grid
13Next Questions to Ask
- How will the three matrices be stored in a
parallel program. Two choices - Every node gets a complete copy of all the
matrices (could run out of memory) - Distribute the storage to different computers so
that each node holds only some part of each
matrix - Can compute on much larger matrices!
14Every Processor gets copy of the Matrices
- This is a good paradigm for a shared-memory
machine? - Every processor can share their copy of the
matrix - For distributed memory machines, need pp more
total memory on a pXp processor grid. - May not be practical for extremely large matrices
one 1000x1000 DP matrix is 8MB. - If we ignore cost of initial distribution of
matrices across multiple memories, then parallel
MM multiply runs pp times faster than on a
single CPU.
15Case Two
- Matrices A, B, and C are distributed so that each
processor has only some part of the matrix. - We call this a parallel data decomposition
- Examples include Block, Cyclic and Strip
- Why would you do this?
- If you only need to multiply two matrices, then
todays machines can locally store fairly large
matrices, but - Codes often contain 10s to 100s of similar
sized matrices. - Local memory starts to become a relatively scarce
commodity
16Lets Pick A Row Decomposition
- Basic idea -
- Assume P processors and N qP
- Assign q rows of the matrices to each processor
Proc 0
A11
A12
A13
A14
B11
B12
B13
B14
C11
C12
C13
C14
A21
A22
A23
A24
B21
B22
B23
B24
C21
C22
C23
C24
Proc 1
A31
A32
A33
A34
B31
B32
B33
B34
C31
C32
C33
C34
Proc 2
A41
A42
A34
A44
B41
B42
B34
B44
C41
C42
C34
C44
Proc 3
Akm is qXq
17What do we need to compute a block
- Consider a block Akm
- Need the kth row of B, and the mth column of C
- Where are these located
- All of the kth row of B is on the same processor
as that holds the kth row of A - Why? Our chosen decomposition of data
- The mth column is distributed among all
processors - Why? Our chosen decomposition.
18Lets assume some memory contraints
- Each process has just enough memory to hold 3 qxN
matrices - enough to get all data in place to easily compute
the a row/column dot product - Can do this on each processor one qXq block at a
time. - How much computation is needed to compute a
single qXq entry in a row - 2q3FLOPS
- How much data needs to be transferred?
- 3q2 need qXq blocks of data from 3 neighbors
(column of C)
19Basic Algorithm - SPMD
- Assume my processor id z (z 1, , p)
- P is the number of processors
- Blocks are qXq
- For (j 1 j lt p j)
- for (k 1 k lt p , k ? z k)
- send Czj to processor k
- for (k 1 k lt p , k ? z k)
- receive Ckj from processor k
- compute Azj locally
- barrier()
20Lets compute some work
- For each iteration, each process does the
following - Computes 2q3 flops
- Send/receives 3q2 floats
- Weve added the transmission of data to the basic
algorithm. - For what size q does the time required for data
transmission balance the time for computation?
21Some Calculations
- Using just a granularity measure how do we choose
a minimum q (for 5050) - On Myrinet, need to perform 56 flops for every
float transferred - Need to perform 2q3 flops/iteration. Assume a
flop costs 1 unit. - Each float transferred costs 56 units
- So when is
- Flop cost gt transfer cost
- 2q3 gt 356q2? Q 80
- For a size 80 matrix, 733Mflop this takes 1.4ms
22What does 5050 really mean
- On two processors, it takes the same amount of
time as one - On four, with scaling, goes twice as fast
- 50 efficiency
23Final Thoughts
- This decomposition of data is not close to
optimal? - We could get more parallelism by running on a pXp
processor grid and having each processor do just
one multiply - MM is highly parallelizable and better algorithms
get good efficiencies even on workstations.