EECE 571L: Parallel Programming - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

EECE 571L: Parallel Programming

Description:

Source: El-Rewini and Abd-El-Barr - Advanced Computer ... Nonuniform Memory Access (NUMA) E.g. Intel Quad Core. Loosely-Coupled: Disjoint Address Space ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 33

Provided by: chris475

Category:

more less

Transcript and Presenter's Notes

Title: EECE 571L: Parallel Programming

1
EECE 571L Parallel Programming Reconfigurable
Computing

Lecture 1
2009-01-07

UBC EECE571L Prof. Guy Lemieux
2
Goal

Flynns Taxonomy
Types of Parallelism
Limits Dependence
Limits Amdahl

3
Classes of Parallel Architecture

Flynns Taxonomy
SISD
SIMD
MISD
MIMD
New one?
SPMD

UBC EECE571L Prof. Guy Lemieux
4
SISD

Single Instruction Single Data

Source El-Rewini and Abd-El-Barr - Advanced
Computer Architecture and Parallel Processing
UBC EECE571L Prof. Guy Lemieux
5
SIMD

Single Instruction Multiple Data

Source El-Rewini and Abd-El-Barr - Advanced
Computer Architecture and Parallel Processing
UBC EECE571L Prof. Guy Lemieux
6
MISD

Multiple Instruction Single Data
Exist in Concept, Not Implemented

UBC EECE571L Prof. Guy Lemieux
7
MIMD

Multiple Instruction Multiple Data

Source El-Rewini and Abd-El-Barr - Advanced
Computer Architecture and Parallel Processing
UBC EECE571L Prof. Guy Lemieux
8
MIMD

Can either be tightly or loosely-coupled
Tightly-Coupled
Share Address Space
Symmetric Multiprocessors (SMPs)
Uniform Memory Access (UMA)
Nonuniform Memory Access (NUMA)
E.g. Intel Quad Core
Loosely-Coupled
Disjoint Address Space
Distributed SISDs
Message Passing
E.g. Network Clusters

UBC EECE571L Prof. Guy Lemieux
9
SPMD

Single Program, Multiple Data
A special case of MIMD
MIMD compute nodes can run completely different
programs
3D physics on node 1
graphics rendering on node 2
SPMD compute nodes run identical programs
Free-running, out-of-sync programs
At any point in time, each node may run a
different instruction

UBC EECE571L Prof. Guy Lemieux
10
Parallelism Levels

Task
Thread
Data
Loop
Instruction
Bit

UBC EECE571L Prof. Guy Lemieux
11
Task-level Parallelism

Function Level of a Program
Example
Given data A and B, find func1(A,B) and
func2(A,B)
Two tasks
Assume two processors available
func1(A,B) on CPU1
func2(A,B) on CPU2

UBC EECE571L Prof. Guy Lemieux
12
Thread-level Parallelism

Similar to Task-level, but finer grain
Thread could be independent or cooperating to
achieve a greater goal

UBC EECE571L Prof. Guy Lemieux
13
Data-level Parallelism

Distribution of Data among Processors
Example
Given an array of n elements,
multiply each element by 2
Divide n by the number of processors p
Each processor perform division on n/p elements

UBC EECE571L Prof. Guy Lemieux
14
Loop-level Parallelism

Exploit Concurrency in Loops
Possible examples
For-loop to calculate dot product of array A and
B
Is this really data parallelism?
Overlap loop iteration i with iteration i1 by
starting next iteration as early as possible (but
no earlier than any loop-carried dependence)
Is this pipeline parallelism?

UBC EECE571L Prof. Guy Lemieux
15
Instruction-level Parallelism

Machine Instruction Level
Identify independent instructions within an
instruction window
Superscalar done at run-time by cpu
VLIW done at compile-time
Dynamic optimizations by the run-time software
system are also possible (eg, JIT)
Example
ADD R1, R2, R3
LOAD R4, R2

UBC EECE571L Prof. Guy Lemieux
16
Bit-level Parallelism

Example
16-bit addition
Two instructions on a 8-bit ALU
One instruction on a 16-bit ALU

UBC EECE571L Prof. Guy Lemieux
17
Dependence

Does the result of the current instruction depend
on the previous result?
Yes Previous result must be computed first
No Instructions can be computed in parallel

UBC EECE571L Prof. Guy Lemieux
18
Type of Dependencies

UBC EECE571L Prof. Guy Lemieux
19
RAR no dependence

Read after Read
No Dependency
Example
R2 lt R1 1
R3 lt R1 2

UBC EECE571L Prof. Guy Lemieux
20
RAW true dependence

Read after Write
Producer/consumer relationship
Example
R2 lt R1 1
R3 lt R2 2

UBC EECE571L Prof. Guy Lemieux
21
WAR false dependence

Write after Read
Aka anti-dependence
Example
R2 lt R1 1
R1 lt R3 2
Can these be avoided?

UBC EECE571L Prof. Guy Lemieux
22
Avoid WAR false dependence

Avoid by allocating new storage
Register renaming
Separate memory locations
Example
R2 lt R1 1
R1' lt R3 2

UBC EECE571L Prof. Guy Lemieux
23
WAW output dependence

Write after Write
What happens if you reorder the output going to a
printer?
Example
R2 lt R1 1
R2 lt R3 2

UBC EECE571L Prof. Guy Lemieux
24
Avoid WAW output dependence

Avoid by optimizing away earlier computation?
Avoid by allocating new storage?
Register renaming
Separate memory locations
Example
R2 lt R1 1
R2' lt R3 2

UBC EECE571L Prof. Guy Lemieux
25
The Ultimate Speed Limit

beep, beep!

UBC EECE571L Prof. Guy Lemieux
26
Amdahls Law

Question
If you improve part of the system, how much
faster does the entire system run?

Gene Amdahl Famous computersystems architect
atIBM in 60s and 70s.

Amdahls Law gives us the speed limit!
Given Enhancement E, define
Speedup(E) PerformanceAfter(E) /
PerformanceBefore(E)
ExecutionTimeBefore(E
) / ExecutionTimeAfter(E)

UBC EECE571L Prof. Guy Lemieux
27
Amdahls Law

More detail.
Enhancement E
results in a speedup of S
to only some fraction of the program F
ExecutionTimeAfter(E) (1-F) F/S
ExecutionTimeBefore(E)
(derivation on next slide)
Usually expressed as a speedup
Speedup(E)
ExecutionTimeBefore(E) / ExecutionTimeAfter(E
)
1 / (1-F) F/S

UBC EECE571L Prof. Guy Lemieux
28
Amdahls Law Derivation

Before E

(1-F) portion untouched
ExecutionTimeBefore(E) (1-F) F 1
F portion improved by S times, to F/S
ExecutionTimeAfter(E) (1-F) F/S
Therefore
Speedup(E)
Before/After 1 / (1-F) F/S
Lesson when speeding up a computer system,
work on the part with the biggest F

UBC EECE571L Prof. Guy Lemieux
29
Amdahls Law Speed Limits!
F is portion of program that can be sped up.
UBC EECE571L Prof. Guy Lemieux
30
Amdahls Law Summary

Amdahls Law
Designers Mantra Make the common case fast
Applies to all engineering optimizations !!!
Corollary
Rare cases dont matter
Students Corollary
On a test, do the easy stuff for the most marks
first

UBC EECE571L Prof. Guy Lemieux
31
Amdahls Law Rebuttal ?

Does Amdahl always win?
Gustafsons Law
As the number of processors increases, you can
scale the problem size
As the problem size grows, ideally the sequential
part will shrink

UBC EECE571L Prof. Guy Lemieux
32
Summary

Concurrency try to identify independent elements
that can be performed in parallel
Only parallelize the common case, and make sure
it is frequent enough to matter

UBC EECE571L Prof. Guy Lemieux

Write a Comment

User Comments (0)