Title: 18'337 Parallel Computings Challenges
118.337Parallel Computings Challenges
2Old Homework (emphasized for effect)
- Download a parallel program from somewhere.
- Make it work
- Download another parallel program
- Now, , make them work together!
3 SIMD
- SIMD (Single Instruction, Multiple Data) refers
to parallel hardware that can execute the same
instruction on multiple data. (Think the
addition of two vectors. One add instruction
applies to every element of the vector.) - Term was coined with one element per processor in
mind, but with todays deep memories and hefty
processors, large chunks of the vectors would be
added on one processor. - Term was coined with a broadcasting of an
instruction in mind, hence the single
instruction, but todays machines are usually
more flexible. - Term was coined with AB and elementwise AxB in
mind and so nobody really knows for sure if
matmul or fft is SIMD or not, but these
operations can certainly be built from SIMD
operations. - Â Today, it is not unusual to refer to a SIMD
operation (sometimes but not always historically
synonymous with Data Parallel Operations though
this feels wrong to me) when the software appears
to run lock-step with every processor executing
the same instruction. - Usage I hear that machine is particularly fast
when the program primarily consists of SIMD
operations. - Graphics processors such as NVIDEA seem to run
fastest on SIMD type operations, but current
research (and old research too) pushes the limits
of SIMD.
4Natural Question may not be the most important
- How do I parallelize x?
- First question many students ask
- Answer often either one of
- Fairly obvious
- Very difficult
- Can miss the true issues of high performance
- These days people are often good at exploiting
locality for performance - People are not very good about hiding
communication and anticipating data movement to
avoid bottlenecks - People are not very good about interweaving
multiple functions to make the best use of
resources - Usually misses the issue of interoperability
- Will my program play nicely with your program?
- Will my program really run on your machine?
5Class Notation
- Vectors small roman letters x,y,
- Vectors have length n if possible
- Matrices large roman (sometimes Greek) letters
A,B,X,?,S - Matrices are n x n, or maybe m x n, but almost
never n x m. Could be p x q. - Scalars may be small greek letters or small roman
letters may not be as consistent
6Algorithm Example FFTs
- For now think of an FFT as a black box
- yFFT(x) takes as input and output a vector of
length n defined (but not computed) as a matrix
time vector yFnx, where (Fn)jke-2pijk/n for
j,k0,,(n-1). - Important Use Cases
- Column fft fft(X), fft(X, ,1) (MATLAB)
- Row fft fft(X, ,2) (MATLAB)
- 2d fft (do a row and column) fft2(X)
- fft2(X) row_fft(col_fft(X)) col_fft(
row_fft(X))
7How to implement a column FFT?
- Put block columns on each processor
- Do local column FFTs
Local column FFTs may be column at a time or
pipelined In the case of FFT probably a fast
local package available, but may not be true for
other ops. Also as MIT students have been known
to do, you might try to beat the packages.
P0 P1 P2
8A closer look at column fft
- Put block columns on each processor
- Where were the columns? Where are they going?
- The cost of the above can be very expensive in
performance. Can we hide it somewhere?
P0 P1 P2
9What about row fft
- Suppose block columns on each processor
- Many transpose and then apply column FFT and
transpose back - This thinking is simple and do-able
- Not only simple but encourages the paradigm of
- 1) do whatever 2) get good parallelism and 3) do
whatever - Harder to decide whether to do rows in parallel
or to interweave transposing of pieces and start
computation - May be more performance, but nobody to my
knowledge has done a good job of this yet. You
maybe could be first.
P0 P1 P2
10Not load balanced column fft?
- Suppose block columns on each processor
- To load balance or to not load balance, that is
the question - Traditional Wisdom says this is badly load
balanced and parallelism is lost, but there is a
cost of moving the data which may or may not be
worth the gain in load balancing
P0 P1 P2
112d fft
- Suppose block columns on each processor
- Can do columns, transpose, rows, transpose
- Can do transpose, rows, transpose, columns
- Can be fancier?
P0 P1 P2
12So much has to do with access to memory and data
movement
- The conventional wisdom is that its all about
locality. This remains partially true and
partially not quite as true as it used to be.
13http//www.cs.berkeley.edu/samw/research/talks/sc
07.pdf
14A peak inside an FFT(more later in the semester)
Time wasted on the telephone
15Tracing Back the data dependency
16New term for the day MIMD
- MIMD (Multiple Instruction stream, Multiple Data
stream) refers to most current parallel hardware
where each processor can independently execute
their own instructions. The importance of MIMD
over SIMD emerged in the early 1990s, as
commodity processors became the basis of much
parallel computing. - Â One may also refer to a MIMD operation in an
implementation, if one wishes to emphasize
non-homogeneous execution. (Often contrasted to
SIMD.)
17Importance of Abstractions
- Ease of use requires that the very notion of a
processor really should be buried underneath the
user - Some think that the very requirements of
performance require the opposite - I am fairly sure the above bullet is more false
than true you can be the ones to figure this
all out!