18'337 Parallel Computings Challenges - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

18'337 Parallel Computings Challenges

Description:

... are not very good about hiding communication and anticipating data movement to ... Can we hide it somewhere? P0 P1 P2. What about row fft. Suppose block ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 18
Provided by: dad156
Category:

less

Transcript and Presenter's Notes

Title: 18'337 Parallel Computings Challenges


1
18.337Parallel Computings Challenges
2
Old Homework (emphasized for effect)
  • Download a parallel program from somewhere.
  • Make it work
  • Download another parallel program
  • Now, , make them work together!

3
SIMD
  • SIMD (Single Instruction, Multiple Data) refers
    to parallel hardware that can execute the same
    instruction on multiple data. (Think the
    addition of two vectors. One add instruction
    applies to every element of the vector.)
  • Term was coined with one element per processor in
    mind, but with todays deep memories and hefty
    processors, large chunks of the vectors would be
    added on one processor.
  • Term was coined with a broadcasting of an
    instruction in mind, hence the single
    instruction, but todays machines are usually
    more flexible.
  • Term was coined with AB and elementwise AxB in
    mind and so nobody really knows for sure if
    matmul or fft is SIMD or not, but these
    operations can certainly be built from SIMD
    operations.
  •  Today, it is not unusual to refer to a SIMD
    operation (sometimes but not always historically
    synonymous with Data Parallel Operations though
    this feels wrong to me) when the software appears
    to run lock-step with every processor executing
    the same instruction.
  • Usage I hear that machine is particularly fast
    when the program primarily consists of SIMD
    operations.
  • Graphics processors such as NVIDEA seem to run
    fastest on SIMD type operations, but current
    research (and old research too) pushes the limits
    of SIMD.

4
Natural Question may not be the most important
  • How do I parallelize x?
  • First question many students ask
  • Answer often either one of
  • Fairly obvious
  • Very difficult
  • Can miss the true issues of high performance
  • These days people are often good at exploiting
    locality for performance
  • People are not very good about hiding
    communication and anticipating data movement to
    avoid bottlenecks
  • People are not very good about interweaving
    multiple functions to make the best use of
    resources
  • Usually misses the issue of interoperability
  • Will my program play nicely with your program?
  • Will my program really run on your machine?

5
Class Notation
  • Vectors small roman letters x,y,
  • Vectors have length n if possible
  • Matrices large roman (sometimes Greek) letters
    A,B,X,?,S
  • Matrices are n x n, or maybe m x n, but almost
    never n x m. Could be p x q.
  • Scalars may be small greek letters or small roman
    letters may not be as consistent

6
Algorithm Example FFTs
  • For now think of an FFT as a black box
  • yFFT(x) takes as input and output a vector of
    length n defined (but not computed) as a matrix
    time vector yFnx, where (Fn)jke-2pijk/n for
    j,k0,,(n-1).
  • Important Use Cases
  • Column fft fft(X), fft(X, ,1) (MATLAB)
  • Row fft fft(X, ,2) (MATLAB)
  • 2d fft (do a row and column) fft2(X)
  • fft2(X) row_fft(col_fft(X)) col_fft(
    row_fft(X))

7
How to implement a column FFT?
  • Put block columns on each processor
  • Do local column FFTs

Local column FFTs may be column at a time or
pipelined In the case of FFT probably a fast
local package available, but may not be true for
other ops. Also as MIT students have been known
to do, you might try to beat the packages.
P0 P1 P2
8
A closer look at column fft
  • Put block columns on each processor
  • Where were the columns? Where are they going?
  • The cost of the above can be very expensive in
    performance. Can we hide it somewhere?

P0 P1 P2
9
What about row fft
  • Suppose block columns on each processor
  • Many transpose and then apply column FFT and
    transpose back
  • This thinking is simple and do-able
  • Not only simple but encourages the paradigm of
  • 1) do whatever 2) get good parallelism and 3) do
    whatever
  • Harder to decide whether to do rows in parallel
    or to interweave transposing of pieces and start
    computation
  • May be more performance, but nobody to my
    knowledge has done a good job of this yet. You
    maybe could be first.

P0 P1 P2
10
Not load balanced column fft?
  • Suppose block columns on each processor
  • To load balance or to not load balance, that is
    the question
  • Traditional Wisdom says this is badly load
    balanced and parallelism is lost, but there is a
    cost of moving the data which may or may not be
    worth the gain in load balancing

P0 P1 P2
11
2d fft
  • Suppose block columns on each processor
  • Can do columns, transpose, rows, transpose
  • Can do transpose, rows, transpose, columns
  • Can be fancier?

P0 P1 P2
12
So much has to do with access to memory and data
movement
  • The conventional wisdom is that its all about
    locality. This remains partially true and
    partially not quite as true as it used to be.

13
http//www.cs.berkeley.edu/samw/research/talks/sc
07.pdf
14
A peak inside an FFT(more later in the semester)
Time wasted on the telephone
15
Tracing Back the data dependency
16
New term for the day MIMD
  • MIMD (Multiple Instruction stream, Multiple Data
    stream) refers to most current parallel hardware
    where each processor can independently execute
    their own instructions. The importance of MIMD
    over SIMD emerged in the early 1990s, as
    commodity processors became the basis of much
    parallel computing.
  •  One may also refer to a MIMD operation in an
    implementation, if one wishes to emphasize
    non-homogeneous execution. (Often contrasted to
    SIMD.)

17
Importance of Abstractions
  • Ease of use requires that the very notion of a
    processor really should be buried underneath the
    user
  • Some think that the very requirements of
    performance require the opposite
  • I am fairly sure the above bullet is more false
    than true you can be the ones to figure this
    all out!
Write a Comment
User Comments (0)
About PowerShow.com