Title: Parallel Computing
1- Overview of Parallel Architecture
- MPI and Cluster Computing
2Definitions
- Multicomputers Interconnected computers with
separate address spaces. Also known as
message-passing or distributed address-space
computers. - Multiprocessors Multiple processors having
access to the same memory (shared address space
or single address-space computers.) - Note Some use term multiprocessor to include
multicomputers. The definitions above have not
gained universal acceptance
3Flynns Taxonomy
Data Stream
Single Multiple
SISD Uniprocessors
SIMD Processor Arrays Pipelined Vector Processors
Instruction Stream
Multiple Single
MISD Systolic Arrays
MIMD Multiprocessors Multicomputers
4SISD
- SISD
- Model of serial von Neumann machine
- Single control processor
5SIMD
- Multiple processors executing the same program in
lockstep - Data that each processor sees may be different
- Single control processor issues each instruction
- Individual processors can be turned on/off at
each cycle (masking) - Each processor executes the same instruction
control processor
. . .
interconnect
6MIMD
- All processors execute their own set of
instructions - Processors operate on separate data streams
- May have separate clocks
- IBM SPs, TMCs CM-5, Cray T3D T3E, SGI Origin,
Tera MTA, Clusters, etc. - MIMD are divided into
- Shared Memory Sometimes called
Multiprocessors - Distributed Memory Sometimes called
Multicomputers
7A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory
- Where is the memory physically located?
8Shared Memory MIMD and Clusters of SMPs
- Since small shared memory machines (SMPs) are the
fastest commodity machine, build a larger machine
by connecting many of them with a network. - Shared memory within one SMP, but message passing
outside of an SMP. - Processors all connected to a large shared
memory. - Local memory is not (usually) part of the
hardware. - Cost much cheaper to access data in cache than
in main memory.
9Distributed Memory
- This is what we have!!
- Each processor is connected to its own memory and
cache but cannot directly access another
processors memory. - Each node has a network interface (NI) for all
communication and synchronization.
10Parallel Programming Models
- Control
- How is parallelism created?
- What orderings exist between operations?
- How do different threads of control synchronize?
- Data
- What data is private vs. shared?
- How is logically shared data accessed or
communicated? - Operations
- What are the atomic operations?
- Data Parallelism vs. Functional Parallelism
11Trivial Example
- Parallel Decomposition
- Each evaluation and each partial sum is a task.
- Assign n/p numbers to each of p procs
- Each computes independent private results and
partial sum. - One (or all) collects the p partial sums and
computes the global sum. - Two Classes of Data
- Logically Shared
- The original n numbers, the global sum.
- Logically Private
- The individual function evaluations.
- What about the individual partial sums?
12Functional Parallelism
- Parallel Decomposition
- Each processor has a different function
- Find the parallelism in this sequence of
operations - a ? 2
- b ? 3
- m ? (a b)/2
- s ? (a2 b2)/2
- v ? s m2
13Multicomputer topology
- Interconnection network should provide
connectivity, low latency, high bandwidth - Many interconnection networks developed over
last 2 decades - Hypercube
- Mesh, torus
- Ring, etc.
14Lines and Rings
- Simplest interconnection network
- Routing becomes an issue
- No direct connection between nodes
15Mesh and Torus
- Generalization of line/ring to multiple
dimensions - 2D Mesh used on Intel Paragon 3D Torus used on
Cray T3D and T3E.
16Mesh and Torus
- Torus uses wraparound links to increase
connectivity
17Hop Count
- Networks can be measured by diameter
- This is the minimum number of hops that message
must traverse for the two nodes that furthest
apart - Line Diameter N-1
- 2D (NxM) Mesh Diameter (N-1) (M-1)
- 2D (NxM) Torus Diameter ?N/2? ?M/2?
18Hypercube Networks
- Dimension N Hypercube is constructed by
connecting the corners of two N-1 hypercubes - Interconnect for Cosmic Cube (Caltech, 1985) and
its offshoots (Intel iPSC, nCUBE), Thinking
Machines CM2, and others.
19Fat-tree Interconnect
- Bandwidth is increased towards the root (but
aggregate bandwidth decreases) - Data network for TMCs CM-5 (a MIMD MPP)
- 4 leaf nodes, internal nodes have 2 or 4 children
- To route from leaf A to leaf B, pick random
switch C in the least common ancestor fat node of
A and B, take unique tree route from A to C and
from C to B
Binary fat-tree in which all internal nodes have
two children
20An Impractical Interconnection Topology
- Completely connected
- Every node has a direct wire connection to every
other node - N x (N-1)/2 Wires
21 Shared Address Space Multiprocessors
- 4 basic types of interconnection media
- Bus
- Crossbar switch
- Multistage network
- Interconnection network with distributed shared
memory (DSM)
22Bus architectures
- Bus acts as a party line between processors and
shared memories - Bus provides uniform access to shared memory (UMA
Uniform Memory Access) (SMP symmetric
multiprocessor - When bus saturates, performance of system
degrades - Bus-based systems do not scale to more than 32
processors Sequent Symmetry, Balance
23Crossbar Switch
- Uses O(mn) switches to connect m processors and n
memories with distinct paths between each
processor/memory pair - UMA
- Scalable performance but not cost.
24Some Multistage Networks
25Distributed Shared Memory (DSM)
- Rather than having all processors on one side of
network and all memory on the other, DSM has some
memory at each processor (or group of
processors). - NUMA (Non-uniform memory access)
- Example HP/Convex Exemplar (late 90s)
- 3 cycles to access data in cache
- 92 cycles for local memory (shared by 16 procs)
- 450 cycles for non-local memory
26Message Passing vs. Shared Memory
- Message Passing
- Requires software involvement to move data
- More cumbersome to program
- More scalable
- Shared Memory
- Subtle programming and performance bugs
- Multi-tiered
- Best(?) Worst(?) of both worlds
27Message Passing Interface,Dependencies
28Single-Program Multiple DataSPMD
- if (my_rank ! 0)
- . . .
- else
- . . .
- We write a single program
- Portions of that program are not executed by some
of the processors - We can change the data that each of the
processors uses for computation
29Single-Program Multiple DataSPMD
- x p/num // p array size, num cluster
size - start myidx
- if (myid num-1)
- end p
- else
- endstartx
- for(istart iltendii1)
- //summation goes here
30When can two statements execute in parallel?
- On one processor
- statement 1
- statement 2
- On two processors
- processor1 processor2
- statement1 statement2
31Fundamental Assumption
Willy Zwaenepoel This is ok for completely
unsynchronized parallel execution, but it
excludes the possibility of synchronization among
processors.
- Processors execute independently no control
over order of execution between processors
32When can 2 statements execute in parallel?
- Possibility 1
- Processor1 Processor2
- statement1
- statement2
- Possibility 2
- Processor1 Processor2
- statement2
- statement1
33When can 2 statements execute in parallel?
Willy Zwaenepoel It is really that if there is a
requirement that they be executed in the
sequential order, then they need to be executed
in the same order here. Maybe that is the same.
- Their order of execution must not matter!
- In other words,
- statement1 statement2
- must be equivalent to
- statement2 statement1
34Example 1
- a 1
- b 2
- Statements can be executed in parallel.
35Example 2
- a 1
- b a
- Statements cannot be executed in parallel
- Program modifications may make it possible.
36Example 3
- a f(x)
- b a
- May not be wise to change the program (sequential
execution would take longer).
37Example 4
- b a
- a 1
- Statements cannot be executed in parallel.
38Example 5
- a 1
- a 2
- Statements cannot be executed in parallel.
39True dependence
- Statements S1, S2
- S2 has a true dependence on S1
- iff
- S2 reads a value written by S1
40Anti-dependence
- Statements S1, S2.
- S2 has an anti-dependence on S1
- iff
- S2 writes a value read by S1.
41Output Dependence
- Statements S1, S2.
- S2 has an output dependence on S1
- iff
- S2 writes a variable written by S1.
42When can 2 statements execute in parallel?
Willy Zwaenepoel At thia point generalize to
arbitrary pieces of code, like small procedures
and explain that dependences hold there too, or
rather use the readset-writeset formalism from
the start?
- S1 and S2 can execute in parallel
- iff
- there are no dependences between S1 and S2
- true dependences
- anti-dependences
- output dependences
- Some dependences can be removed.
43Example 6
Willy Zwaenepoel Same comment as next slide.
- Most parallelism occurs in loops.
- for(i0 ilt100 i)
- ai i
- No dependences.
- Iterations can be executed in parallel.
44Example 7
Willy Zwaenepoel There was an issue of the
parallel processors having to have different
variables for the loop index.
- for(i0 ilt100 i)
- ai i
- bi 2i
-
- Iterations and statements can be executed in
parallel.
45Example 8
Willy Zwaenepoel There was some issue here that
there is a dependence caused by I -- addressed by
rewriting the second loop as for j.
- for(i0ilt100i) ai i
- for(i0ilt100i) bi 2i
- Iterations and loops can be executed in parallel.
46Example 9
Willy Zwaenepoel Maybe we should introduce the
notion on a dependence on itself earlier (before
a loop is introduced). The notion of a
loop-independent dependence was never extended to
different statements within the body. It should
be.
- for(i0 ilt100 i)
- ai ai 100
- There is a dependence on itself!
- Loop is still parallelizable.
47Example 10
Willy Zwaenepoel Same comment here on different
statements?
- for( i0 ilt100 i )
- ai f(ai-1)
- Dependence between ai and ai-1.
- Loop iterations are not parallelizable.
48Loop-carried dependence
- A loop carried dependence is a dependence that is
present only if the statements are part of the
execution of a loop. - Otherwise, we call it a loop-independent
dependence. - Loop-carried dependences prevent loop iteration
parallelization.
49Example 11
- for(i0 ilt100 i )
- for(j0 jlt100 j )
- aij f(aij-1)
- Loop-independent dependence on i.
- Loop-carried dependence on j.
- Outer loop can be parallelized, inner loop cannot.
50Example 12
Willy Zwaenepoel Say a little more about loop
interchange, make loops with loop-carried
dependences appear on the inside of the nest, the
others on the outside.
- for( j0 jlt100 j )
- for( i0 ilt100 i )
- aij f(aij-1)
- Inner loop can be parallelized, outer loop
cannot. - Less desirable situation.
- Loop interchange is sometimes possible.
51Level of loop-carried dependence
- Is the nesting depth of the loop that carries the
dependence. - Indicates which loops can be parallelized.
52Be careful Example 13
- printf(a)
- printf(b)
- Statements have a hidden output dependence due to
the output stream.
53Be careful Example 14
Willy Zwaenepoel Also depends on what f and g
can do to x.
- a f(x)
- b g(x)
- Statements could have a hidden dependence if f
and g update the same variable.
54Be careful Example 15
Willy Zwaenepoel Is this the business about the
distance of a dependence in Kens book? It is
also possible to transform this to a nested loop.
- for(i0 ilt100 i)
- ai10 f(ai)
- Dependence between a10, a20,
- Dependence between a11, a21,
-
- Some parallel execution is possible.
55Be careful Example 16
Willy Zwaenepoel This requires synchronization
between processors for parallel execution.Goes
against initial assumption.
- for( i1 ilt100i )
- ai
- ... ai-1
-
- Dependence between ai and ai-1
- Complete parallel execution impossible
- Pipelined parallel execution possible
56Be careful Example 14
- for( i0 ilt100 i )
- ai f(aindexai)
- Cannot tell for sure.
- Parallelization depends on user knowledge of
values in indexa. - User can tell, compiler cannot.
57An aside
- Parallelizing compilers analyze program
dependences to decide parallelization. - In parallelization by hand, user does the same
analysis. - Compiler more convenient and more correct
- User more powerful, can analyze more patterns.
58To remember
- Statement order must not matter.
- Statements must not have dependences.
- Some dependences can be removed.
- Some dependences may not be obvious.
59Parallelism First Program
- Ability to execute different parts of a program
concurrently on different machines - Goal shorten execution time
- There are p processes executing a program
- They have ranks 0, 1, 2, , p-1
- Each process prints out a Hello message
60Hello World
- include "mpi.h"
- include ltstdio.hgt
- include ltmath.hgt
- int main( int argc, char argv)
-
- int myid, num, d
- MPI_Init(argc,argv)
- MPI_Comm_size(MPI_COMM_WORLD,num)
- MPI_Comm_rank(MPI_COMM_WORLD,myid)
- if (myid 0) printf("Hello.\n")
- printf("How many processors would you
like?\n") - scanf("d", d)
- if (d gt num)
- printf("number too big\n")
-
- MPI_Bcast(d, 1, MPI_INT, 0, MPI_COMM_WORLD)
- if (myid lt d)
- printf("Hello from Process d of
d\n", myid, num)
61MPI Function Calls
- include "mpi.h"
- MPI_Init(argc,argv)
- MPI_Comm_size(MPI_COMM_WORLD,num)
- MPI_Comm_rank(MPI_COMM_WORLD,myid)
- MPI_Bcast(d, 1, MPI_INT, 0, MPI_COMM_WORLD)
- // From example on page 42
- MPI_Send( . . . ) // What goes here? See Page 47
- MPI_Recv( . . . ) // What goes here? See Page 48
- MPI_Finalize()
62MPI Data Types
- MPI_CHAR
- MPI_INT
- MPI_LONG
- MPI_FLOAT
- MPI_LONG_DOUBLE