Parallel Computing - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Parallel Computing

Description:

Also known as message-passing or distributed address-space computers. ... Multiple processors executing the same program in lockstep ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 63

Provided by: CITI49

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Computing

1

Parallel Computing

Overview of Parallel Architecture
MPI and Cluster Computing

2
Definitions

Multicomputers Interconnected computers with
separate address spaces. Also known as
message-passing or distributed address-space
computers.
Multiprocessors Multiple processors having
access to the same memory (shared address space
or single address-space computers.)
Note Some use term multiprocessor to include
multicomputers. The definitions above have not
gained universal acceptance

3
Flynns Taxonomy
Data Stream
Single Multiple
SISD Uniprocessors
SIMD Processor Arrays Pipelined Vector Processors
Instruction Stream
Multiple Single
MISD Systolic Arrays
MIMD Multiprocessors Multicomputers
4
SISD

SISD
Model of serial von Neumann machine
Single control processor

5
SIMD

Multiple processors executing the same program in
lockstep
Data that each processor sees may be different
Single control processor issues each instruction
Individual processors can be turned on/off at
each cycle (masking)
Each processor executes the same instruction

control processor
. . .
interconnect
6
MIMD

All processors execute their own set of
instructions
Processors operate on separate data streams
May have separate clocks
IBM SPs, TMCs CM-5, Cray T3D T3E, SGI Origin,
Tera MTA, Clusters, etc.
MIMD are divided into
Shared Memory Sometimes called
Multiprocessors
Distributed Memory Sometimes called
Multicomputers

7
A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory

Where is the memory physically located?

8
Shared Memory MIMD and Clusters of SMPs

Since small shared memory machines (SMPs) are the
fastest commodity machine, build a larger machine
by connecting many of them with a network.
Shared memory within one SMP, but message passing
outside of an SMP.
Processors all connected to a large shared
memory.
Local memory is not (usually) part of the
hardware.
Cost much cheaper to access data in cache than
in main memory.

9
Distributed Memory

This is what we have!!
Each processor is connected to its own memory and
cache but cannot directly access another
processors memory.
Each node has a network interface (NI) for all
communication and synchronization.

10
Parallel Programming Models

Control
How is parallelism created?
What orderings exist between operations?
How do different threads of control synchronize?
Data
What data is private vs. shared?
How is logically shared data accessed or
communicated?
Operations
What are the atomic operations?
Data Parallelism vs. Functional Parallelism

11
Trivial Example

Parallel Decomposition
Each evaluation and each partial sum is a task.
Assign n/p numbers to each of p procs
Each computes independent private results and
partial sum.
One (or all) collects the p partial sums and
computes the global sum.
Two Classes of Data
Logically Shared
The original n numbers, the global sum.
Logically Private
The individual function evaluations.
What about the individual partial sums?

12
Functional Parallelism

Parallel Decomposition
Each processor has a different function
Find the parallelism in this sequence of
operations
a ? 2
b ? 3
m ? (a b)/2
s ? (a2 b2)/2
v ? s m2

13
Multicomputer topology

Interconnection network should provide
connectivity, low latency, high bandwidth
Many interconnection networks developed over
last 2 decades
Hypercube
Mesh, torus
Ring, etc.

14
Lines and Rings

Simplest interconnection network
Routing becomes an issue
No direct connection between nodes

15
Mesh and Torus

Generalization of line/ring to multiple
dimensions
2D Mesh used on Intel Paragon 3D Torus used on
Cray T3D and T3E.

16
Mesh and Torus

Torus uses wraparound links to increase
connectivity

17
Hop Count

Networks can be measured by diameter
This is the minimum number of hops that message
must traverse for the two nodes that furthest
apart
Line Diameter N-1
2D (NxM) Mesh Diameter (N-1) (M-1)
2D (NxM) Torus Diameter ?N/2? ?M/2?

18
Hypercube Networks

Dimension N Hypercube is constructed by
connecting the corners of two N-1 hypercubes
Interconnect for Cosmic Cube (Caltech, 1985) and
its offshoots (Intel iPSC, nCUBE), Thinking
Machines CM2, and others.

19
Fat-tree Interconnect

Bandwidth is increased towards the root (but
aggregate bandwidth decreases)
Data network for TMCs CM-5 (a MIMD MPP)
4 leaf nodes, internal nodes have 2 or 4 children
To route from leaf A to leaf B, pick random
switch C in the least common ancestor fat node of
A and B, take unique tree route from A to C and
from C to B

Binary fat-tree in which all internal nodes have
two children
20
An Impractical Interconnection Topology

Completely connected
Every node has a direct wire connection to every
other node
N x (N-1)/2 Wires

21
Shared Address Space Multiprocessors

4 basic types of interconnection media
Bus
Crossbar switch
Multistage network
Interconnection network with distributed shared
memory (DSM)

22
Bus architectures

Bus acts as a party line between processors and
shared memories
Bus provides uniform access to shared memory (UMA
Uniform Memory Access) (SMP symmetric
multiprocessor
When bus saturates, performance of system
degrades
Bus-based systems do not scale to more than 32
processors Sequent Symmetry, Balance

23
Crossbar Switch

Uses O(mn) switches to connect m processors and n
memories with distinct paths between each
processor/memory pair
UMA
Scalable performance but not cost.

24
Some Multistage Networks

Butterfly multistage

Shuffle multistage

25
Distributed Shared Memory (DSM)

Rather than having all processors on one side of
network and all memory on the other, DSM has some
memory at each processor (or group of
processors).
NUMA (Non-uniform memory access)
Example HP/Convex Exemplar (late 90s)
3 cycles to access data in cache
92 cycles for local memory (shared by 16 procs)
450 cycles for non-local memory

26
Message Passing vs. Shared Memory

Message Passing
Requires software involvement to move data
More cumbersome to program
More scalable
Shared Memory
Subtle programming and performance bugs
Multi-tiered
Best(?) Worst(?) of both worlds

27
Message Passing Interface,Dependencies
28
Single-Program Multiple DataSPMD

if (my_rank ! 0)
. . .
else
. . .
We write a single program
Portions of that program are not executed by some
of the processors
We can change the data that each of the
processors uses for computation

29
Single-Program Multiple DataSPMD

x p/num // p array size, num cluster
size
start myidx
if (myid num-1)
end p
else
endstartx
for(istart iltendii1)
//summation goes here

30
When can two statements execute in parallel?

On one processor
statement 1
statement 2
On two processors
processor1 processor2
statement1 statement2

31
Fundamental Assumption
Willy Zwaenepoel This is ok for completely
unsynchronized parallel execution, but it
excludes the possibility of synchronization among
processors.

Processors execute independently no control
over order of execution between processors

32
When can 2 statements execute in parallel?

Possibility 1
Processor1 Processor2
statement1
statement2
Possibility 2
Processor1 Processor2
statement2
statement1

33
When can 2 statements execute in parallel?
Willy Zwaenepoel It is really that if there is a
requirement that they be executed in the
sequential order, then they need to be executed
in the same order here. Maybe that is the same.

Their order of execution must not matter!
In other words,
statement1 statement2
must be equivalent to
statement2 statement1

34
Example 1

a 1
b 2
Statements can be executed in parallel.

35
Example 2

a 1
b a
Statements cannot be executed in parallel
Program modifications may make it possible.

36
Example 3

a f(x)
b a
May not be wise to change the program (sequential
execution would take longer).

37
Example 4

b a
a 1
Statements cannot be executed in parallel.

38
Example 5

a 1
a 2
Statements cannot be executed in parallel.

39
True dependence

Statements S1, S2
S2 has a true dependence on S1
iff
S2 reads a value written by S1

40
Anti-dependence

Statements S1, S2.
S2 has an anti-dependence on S1
iff
S2 writes a value read by S1.

41
Output Dependence

Statements S1, S2.
S2 has an output dependence on S1
iff
S2 writes a variable written by S1.

42
When can 2 statements execute in parallel?
Willy Zwaenepoel At thia point generalize to
arbitrary pieces of code, like small procedures
and explain that dependences hold there too, or
rather use the readset-writeset formalism from
the start?

S1 and S2 can execute in parallel
iff
there are no dependences between S1 and S2
true dependences
anti-dependences
output dependences
Some dependences can be removed.

43
Example 6
Willy Zwaenepoel Same comment as next slide.

Most parallelism occurs in loops.
for(i0 ilt100 i)
ai i
No dependences.
Iterations can be executed in parallel.

44
Example 7
Willy Zwaenepoel There was an issue of the
parallel processors having to have different
variables for the loop index.

for(i0 ilt100 i)
ai i
bi 2i
Iterations and statements can be executed in
parallel.

45
Example 8
Willy Zwaenepoel There was some issue here that
there is a dependence caused by I -- addressed by
rewriting the second loop as for j.

for(i0ilt100i) ai i
for(i0ilt100i) bi 2i
Iterations and loops can be executed in parallel.

46
Example 9
Willy Zwaenepoel Maybe we should introduce the
notion on a dependence on itself earlier (before
a loop is introduced). The notion of a
loop-independent dependence was never extended to
different statements within the body. It should
be.

for(i0 ilt100 i)
ai ai 100
There is a dependence on itself!
Loop is still parallelizable.

47
Example 10
Willy Zwaenepoel Same comment here on different
statements?

for( i0 ilt100 i )
ai f(ai-1)
Dependence between ai and ai-1.
Loop iterations are not parallelizable.

48
Loop-carried dependence

A loop carried dependence is a dependence that is
present only if the statements are part of the
execution of a loop.
Otherwise, we call it a loop-independent
dependence.
Loop-carried dependences prevent loop iteration
parallelization.

49
Example 11

for(i0 ilt100 i )
for(j0 jlt100 j )
aij f(aij-1)
Loop-independent dependence on i.
Loop-carried dependence on j.
Outer loop can be parallelized, inner loop cannot.

50
Example 12
Willy Zwaenepoel Say a little more about loop
interchange, make loops with loop-carried
dependences appear on the inside of the nest, the
others on the outside.

for( j0 jlt100 j )
for( i0 ilt100 i )
aij f(aij-1)
Inner loop can be parallelized, outer loop
cannot.
Less desirable situation.
Loop interchange is sometimes possible.

51
Level of loop-carried dependence

Is the nesting depth of the loop that carries the
dependence.
Indicates which loops can be parallelized.

52
Be careful Example 13

printf(a)
printf(b)
Statements have a hidden output dependence due to
the output stream.

53
Be careful Example 14
Willy Zwaenepoel Also depends on what f and g
can do to x.

a f(x)
b g(x)
Statements could have a hidden dependence if f
and g update the same variable.

54
Be careful Example 15
Willy Zwaenepoel Is this the business about the
distance of a dependence in Kens book? It is
also possible to transform this to a nested loop.

for(i0 ilt100 i)
ai10 f(ai)
Dependence between a10, a20,
Dependence between a11, a21,
Some parallel execution is possible.

55
Be careful Example 16
Willy Zwaenepoel This requires synchronization
between processors for parallel execution.Goes
against initial assumption.

for( i1 ilt100i )
ai
... ai-1
Dependence between ai and ai-1
Complete parallel execution impossible
Pipelined parallel execution possible

56
Be careful Example 14

for( i0 ilt100 i )
ai f(aindexai)
Cannot tell for sure.
Parallelization depends on user knowledge of
values in indexa.
User can tell, compiler cannot.

57
An aside

Parallelizing compilers analyze program
dependences to decide parallelization.
In parallelization by hand, user does the same
analysis.
Compiler more convenient and more correct
User more powerful, can analyze more patterns.

58
To remember

Statement order must not matter.
Statements must not have dependences.
Some dependences can be removed.
Some dependences may not be obvious.

59
Parallelism First Program

Ability to execute different parts of a program
concurrently on different machines
Goal shorten execution time
There are p processes executing a program
They have ranks 0, 1, 2, , p-1
Each process prints out a Hello message

60
Hello World