18'337 Parallel Computings Challenges - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

18'337 Parallel Computings Challenges

Description:

... are not very good about hiding communication and anticipating data movement to ... Can we hide it somewhere? P0 P1 P2. What about row fft. Suppose block ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 18

Provided by: dad156

Category:

more less

Transcript and Presenter's Notes

Title: 18'337 Parallel Computings Challenges

1
18.337Parallel Computings Challenges
2
Old Homework (emphasized for effect)

Download a parallel program from somewhere.
Make it work
Download another parallel program
Now, , make them work together!

3
SIMD

SIMD (Single Instruction, Multiple Data) refers
to parallel hardware that can execute the same
instruction on multiple data. (Think the
addition of two vectors. One add instruction
applies to every element of the vector.)
Term was coined with one element per processor in
mind, but with todays deep memories and hefty
processors, large chunks of the vectors would be
added on one processor.
Term was coined with a broadcasting of an
instruction in mind, hence the single
instruction, but todays machines are usually
more flexible.
Term was coined with AB and elementwise AxB in
mind and so nobody really knows for sure if
matmul or fft is SIMD or not, but these
operations can certainly be built from SIMD
operations.
Today, it is not unusual to refer to a SIMD
operation (sometimes but not always historically
synonymous with Data Parallel Operations though
this feels wrong to me) when the software appears
to run lock-step with every processor executing
the same instruction.
Usage I hear that machine is particularly fast
when the program primarily consists of SIMD
operations.
Graphics processors such as NVIDEA seem to run
fastest on SIMD type operations, but current
research (and old research too) pushes the limits
of SIMD.

4
Natural Question may not be the most important

How do I parallelize x?
First question many students ask
Answer often either one of
Fairly obvious
Very difficult
Can miss the true issues of high performance
These days people are often good at exploiting
locality for performance
People are not very good about hiding
communication and anticipating data movement to
avoid bottlenecks
People are not very good about interweaving
multiple functions to make the best use of
resources
Usually misses the issue of interoperability
Will my program play nicely with your program?
Will my program really run on your machine?

5
Class Notation

Vectors small roman letters x,y,
Vectors have length n if possible
Matrices large roman (sometimes Greek) letters
A,B,X,?,S
Matrices are n x n, or maybe m x n, but almost
never n x m. Could be p x q.
Scalars may be small greek letters or small roman
letters may not be as consistent

6
Algorithm Example FFTs

For now think of an FFT as a black box
yFFT(x) takes as input and output a vector of
length n defined (but not computed) as a matrix
time vector yFnx, where (Fn)jke-2pijk/n for
j,k0,,(n-1).
Important Use Cases
Column fft fft(X), fft(X, ,1) (MATLAB)
Row fft fft(X, ,2) (MATLAB)
2d fft (do a row and column) fft2(X)
fft2(X) row_fft(col_fft(X)) col_fft(
row_fft(X))

7
How to implement a column FFT?

Put block columns on each processor
Do local column FFTs

Local column FFTs may be column at a time or
pipelined In the case of FFT probably a fast
local package available, but may not be true for
other ops. Also as MIT students have been known
to do, you might try to beat the packages.
P0 P1 P2
8
A closer look at column fft

Put block columns on each processor
Where were the columns? Where are they going?
The cost of the above can be very expensive in
performance. Can we hide it somewhere?

P0 P1 P2
9
What about row fft

Suppose block columns on each processor
Many transpose and then apply column FFT and
transpose back
This thinking is simple and do-able
Not only simple but encourages the paradigm of
1) do whatever 2) get good parallelism and 3) do
whatever
Harder to decide whether to do rows in parallel
or to interweave transposing of pieces and start
computation
May be more performance, but nobody to my
knowledge has done a good job of this yet. You
maybe could be first.

P0 P1 P2
10
Not load balanced column fft?

Suppose block columns on each processor
To load balance or to not load balance, that is
the question
Traditional Wisdom says this is badly load
balanced and parallelism is lost, but there is a
cost of moving the data which may or may not be
worth the gain in load balancing

P0 P1 P2
11
2d fft

Suppose block columns on each processor
Can do columns, transpose, rows, transpose
Can do transpose, rows, transpose, columns
Can be fancier?

P0 P1 P2
12
So much has to do with access to memory and data
movement

The conventional wisdom is that its all about
locality. This remains partially true and
partially not quite as true as it used to be.

13
http//www.cs.berkeley.edu/samw/research/talks/sc
07.pdf
14
A peak inside an FFT(more later in the semester)
Time wasted on the telephone
15
Tracing Back the data dependency
16
New term for the day MIMD

MIMD (Multiple Instruction stream, Multiple Data
stream) refers to most current parallel hardware
where each processor can independently execute
their own instructions. The importance of MIMD
over SIMD emerged in the early 1990s, as
commodity processors became the basis of much
parallel computing.
One may also refer to a MIMD operation in an
implementation, if one wishes to emphasize
non-homogeneous execution. (Often contrasted to
SIMD.)

17
Importance of Abstractions

Ease of use requires that the very notion of a
processor really should be buried underneath the
user
Some think that the very requirements of
performance require the opposite
I am fairly sure the above bullet is more false
than true you can be the ones to figure this
all out!

Write a Comment

User Comments (0)