CSE 160 - Lecture 11 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE 160 - Lecture 11

Description:

First question to ask. What can be computed independently? ... May not be practical for extremely large matrices one 1000x1000 DP matrix is 8MB. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 24
Provided by: Papado
Learn more at: https://cseweb.ucsd.edu
Category:
Tags: cse | dp | first | lecture

less

Transcript and Presenter's Notes

Title: CSE 160 - Lecture 11


1
CSE 160 - Lecture 11
  • Computation/Communication Analysis - Matrix
    Multiply

2
Granularity
  • Machine granularity has been defined as
    MFLOPS/MB/sec FLOP/Byte
  • This tries to indicate the balance between
    computation and communication.
  • For parallel computation it is important to
    understand how much computation could be
    accomplished while sending/receiving a message

3
Message Startup Latency
  • Granularity as defined only tells part of the
    story
  • If it told the whole story than message startup
    latency would not be important.
  • Message startup latency - the time it takes to
    start sending a message of any length
  • This latency is approximated by measuring the
    latency of a zero-byte message
  • There are other measures that are important, too.

4
Back of the envelope calculations
  • Suppose we have a 733MhZ Pentium, Myrinet (100
    MB/sec) and zero-length message latency of 10
    microseconds
  • Granularity is 733/100 7.3
  • 7 Flops can be computed for every byte of data
    sent
  • Double precision float is 8 bytes
  • (87) 56 Flops for every DP float sent on the
    network (hmmm)
  • in 10 microseconds can accomplish 7333 Flops
  • Every float takes .08 microseconds to transmit
  • 100MB/sec 100 bytes/microsecond
  • 125 floats transmitted/startup latency

5
First Interpretation
  • For a 5050 balance (Comp/Comm)
  • Compute 7333 Flops, Transmit 1 float
  • If done serially (compute -gt message -gt compute
    -gt )
  • Throughput of CPU cut in 1/2
  • Only computing 1/2 of the time, messaging the
    other half
  • For a 9010 (comp/comm)
  • Compute (97333 ? 66000) Flops/ transmit 1 float
  • If done serially, 90 time computing, 10
    messaging

6
One more calculation
  • 733 MHz PIII, 100 Mbit ethernet, 100 microsecond
    latency
  • granularity is 733/10 73.3
  • in one latency period can do 73333 Flops
  • 9010 requires (9 73333) 666000 Flops
  • If latency isnt the contraint, but transmission
    time is, then we have to balance the computation
    time with the communication time

7
Matrix - Matrix Multiply
  • Given two square matrices (N x N) A, B, want to
    multiply them together
  • The total number of FLOPs is O(2N3)
  • There are numerous ways to efficiently
    parallelize matrix multiply, well pick a simple
    method and analyze is communication and
    computation costs

8
Matrix Multiply - Review4x4 example
A11
A12
A13
A14
B11
B12
B13
B14
C11
C12
C13
C14
A21
A22
A23
A24
B21
B22
B23
B24
C21
C22
C23
C24


A31
A32
A33
A34
B31
B32
B33
B34
C31
C32
C33
C34
A41
A42
A34
A44
B41
B42
B34
B44
C41
C42
C34
C44
In general, entry Akm dot product of kth row of
B with the mth column of C
A32
B31
C12
B32
C22
B33
C32
B34
C42
B31 C12 B32 C22 B33 C32 B34 C42
9
How much computation for NxN?
  • Akm ? Bkj Cjm j1,2,,N
  • Count multiplies N
  • Count adds N
  • So every element of A takes 2N Flops
  • There are NN elements in the results
  • Total is (flops/element)(elements)
  • (2N)(NN) O(2N3)

10
The matrix elements can be matrices!
  • Matrix elements can themselves be matrices.
  • E.g. B31 C12 would itself be a matrix multiply
  • We can think about matrix blocks of size qXq.
  • Total computation is (blocks)2q3
  • Lets formulate two parallel algorithms for MM

11
First question to ask
  • What can be computed independently?
  • Does the computation of Akm depend at all on the
    computation of Apq?
  • Another way of asking, does the order in which we
    compute the elements of A matter?
  • Not for MM
  • In general, calculations will depend on previous
    results, This might limit the amount of
    parallelism

12
Simple Parallelism
Divide the computation of A into blocks. Assign a
block to different processors (16 in this
case) Computation of each block of A can be
computed in parallel In theory, should be 16X
faster than on one processor
Computation of A mapped onto a 4x4 Processor grid
13
Next Questions to Ask
  • How will the three matrices be stored in a
    parallel program. Two choices
  • Every node gets a complete copy of all the
    matrices (could run out of memory)
  • Distribute the storage to different computers so
    that each node holds only some part of each
    matrix
  • Can compute on much larger matrices!

14
Every Processor gets copy of the Matrices
  • This is a good paradigm for a shared-memory
    machine?
  • Every processor can share their copy of the
    matrix
  • For distributed memory machines, need pp more
    total memory on a pXp processor grid.
  • May not be practical for extremely large matrices
    one 1000x1000 DP matrix is 8MB.
  • If we ignore cost of initial distribution of
    matrices across multiple memories, then parallel
    MM multiply runs pp times faster than on a
    single CPU.

15
Case Two
  • Matrices A, B, and C are distributed so that each
    processor has only some part of the matrix.
  • We call this a parallel data decomposition
  • Examples include Block, Cyclic and Strip
  • Why would you do this?
  • If you only need to multiply two matrices, then
    todays machines can locally store fairly large
    matrices, but
  • Codes often contain 10s to 100s of similar
    sized matrices.
  • Local memory starts to become a relatively scarce
    commodity

16
Lets Pick A Row Decomposition
  • Basic idea -
  • Assume P processors and N qP
  • Assign q rows of the matrices to each processor

Proc 0
A11
A12
A13
A14
B11
B12
B13
B14
C11
C12
C13
C14
A21
A22
A23
A24
B21
B22
B23
B24
C21
C22
C23
C24
Proc 1


A31
A32
A33
A34
B31
B32
B33
B34
C31
C32
C33
C34
Proc 2
A41
A42
A34
A44
B41
B42
B34
B44
C41
C42
C34
C44
Proc 3
Akm is qXq
17
What do we need to compute a block
  • Consider a block Akm
  • Need the kth row of B, and the mth column of C
  • Where are these located
  • All of the kth row of B is on the same processor
    as that holds the kth row of A
  • Why? Our chosen decomposition of data
  • The mth column is distributed among all
    processors
  • Why? Our chosen decomposition.

18
Lets assume some memory contraints
  • Each process has just enough memory to hold 3 qxN
    matrices
  • enough to get all data in place to easily compute
    the a row/column dot product
  • Can do this on each processor one qXq block at a
    time.
  • How much computation is needed to compute a
    single qXq entry in a row
  • 2q3FLOPS
  • How much data needs to be transferred?
  • 3q2 need qXq blocks of data from 3 neighbors
    (column of C)

19
Basic Algorithm - SPMD
  • Assume my processor id z (z 1, , p)
  • P is the number of processors
  • Blocks are qXq
  • For (j 1 j lt p j)
  • for (k 1 k lt p , k ? z k)
  • send Czj to processor k
  • for (k 1 k lt p , k ? z k)
  • receive Ckj from processor k
  • compute Azj locally
  • barrier()

20
Lets compute some work
  • For each iteration, each process does the
    following
  • Computes 2q3 flops
  • Send/receives 3q2 floats
  • Weve added the transmission of data to the basic
    algorithm.
  • For what size q does the time required for data
    transmission balance the time for computation?

21
Some Calculations
  • Using just a granularity measure how do we choose
    a minimum q (for 5050)
  • On Myrinet, need to perform 56 flops for every
    float transferred
  • Need to perform 2q3 flops/iteration. Assume a
    flop costs 1 unit.
  • Each float transferred costs 56 units
  • So when is
  • Flop cost gt transfer cost
  • 2q3 gt 356q2? Q 80
  • For a size 80 matrix, 733Mflop this takes 1.4ms

22
What does 5050 really mean
  • On two processors, it takes the same amount of
    time as one
  • On four, with scaling, goes twice as fast
  • 50 efficiency

23
Final Thoughts
  • This decomposition of data is not close to
    optimal?
  • We could get more parallelism by running on a pXp
    processor grid and having each processor do just
    one multiply
  • MM is highly parallelizable and better algorithms
    get good efficiencies even on workstations.
Write a Comment
User Comments (0)
About PowerShow.com