Parallel Computing - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Parallel Computing

Description:

Also known as message-passing or distributed address-space computers. ... Multiple processors executing the same program in lockstep ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 63
Provided by: CITI49
Category:

less

Transcript and Presenter's Notes

Title: Parallel Computing


1
  • Parallel Computing
  • Overview of Parallel Architecture
  • MPI and Cluster Computing

2
Definitions
  • Multicomputers Interconnected computers with
    separate address spaces. Also known as
    message-passing or distributed address-space
    computers.
  • Multiprocessors Multiple processors having
    access to the same memory (shared address space
    or single address-space computers.)
  • Note Some use term multiprocessor to include
    multicomputers. The definitions above have not
    gained universal acceptance

3
Flynns Taxonomy
Data Stream
Single Multiple
SISD Uniprocessors
SIMD Processor Arrays Pipelined Vector Processors
Instruction Stream
Multiple Single
MISD Systolic Arrays
MIMD Multiprocessors Multicomputers
4
SISD
  • SISD
  • Model of serial von Neumann machine
  • Single control processor

5
SIMD
  • Multiple processors executing the same program in
    lockstep
  • Data that each processor sees may be different
  • Single control processor issues each instruction
  • Individual processors can be turned on/off at
    each cycle (masking)
  • Each processor executes the same instruction

control processor
. . .
interconnect
6
MIMD
  • All processors execute their own set of
    instructions
  • Processors operate on separate data streams
  • May have separate clocks
  • IBM SPs, TMCs CM-5, Cray T3D T3E, SGI Origin,
    Tera MTA, Clusters, etc.
  • MIMD are divided into
  • Shared Memory Sometimes called
    Multiprocessors
  • Distributed Memory Sometimes called
    Multicomputers

7
A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory
  • Where is the memory physically located?

8
Shared Memory MIMD and Clusters of SMPs
  • Since small shared memory machines (SMPs) are the
    fastest commodity machine, build a larger machine
    by connecting many of them with a network.
  • Shared memory within one SMP, but message passing
    outside of an SMP.
  • Processors all connected to a large shared
    memory.
  • Local memory is not (usually) part of the
    hardware.
  • Cost much cheaper to access data in cache than
    in main memory.

9
Distributed Memory
  • This is what we have!!
  • Each processor is connected to its own memory and
    cache but cannot directly access another
    processors memory.
  • Each node has a network interface (NI) for all
    communication and synchronization.

10
Parallel Programming Models
  • Control
  • How is parallelism created?
  • What orderings exist between operations?
  • How do different threads of control synchronize?
  • Data
  • What data is private vs. shared?
  • How is logically shared data accessed or
    communicated?
  • Operations
  • What are the atomic operations?
  • Data Parallelism vs. Functional Parallelism

11
Trivial Example
  • Parallel Decomposition
  • Each evaluation and each partial sum is a task.
  • Assign n/p numbers to each of p procs
  • Each computes independent private results and
    partial sum.
  • One (or all) collects the p partial sums and
    computes the global sum.
  • Two Classes of Data
  • Logically Shared
  • The original n numbers, the global sum.
  • Logically Private
  • The individual function evaluations.
  • What about the individual partial sums?

12
Functional Parallelism
  • Parallel Decomposition
  • Each processor has a different function
  • Find the parallelism in this sequence of
    operations
  • a ? 2
  • b ? 3
  • m ? (a b)/2
  • s ? (a2 b2)/2
  • v ? s m2

13
Multicomputer topology
  • Interconnection network should provide
    connectivity, low latency, high bandwidth
  • Many interconnection networks developed over
    last 2 decades
  • Hypercube
  • Mesh, torus
  • Ring, etc.

14
Lines and Rings
  • Simplest interconnection network
  • Routing becomes an issue
  • No direct connection between nodes

15
Mesh and Torus
  • Generalization of line/ring to multiple
    dimensions
  • 2D Mesh used on Intel Paragon 3D Torus used on
    Cray T3D and T3E.

16
Mesh and Torus
  • Torus uses wraparound links to increase
    connectivity

17
Hop Count
  • Networks can be measured by diameter
  • This is the minimum number of hops that message
    must traverse for the two nodes that furthest
    apart
  • Line Diameter N-1
  • 2D (NxM) Mesh Diameter (N-1) (M-1)
  • 2D (NxM) Torus Diameter ?N/2? ?M/2?

18
Hypercube Networks
  • Dimension N Hypercube is constructed by
    connecting the corners of two N-1 hypercubes
  • Interconnect for Cosmic Cube (Caltech, 1985) and
    its offshoots (Intel iPSC, nCUBE), Thinking
    Machines CM2, and others.

19
Fat-tree Interconnect
  • Bandwidth is increased towards the root (but
    aggregate bandwidth decreases)
  • Data network for TMCs CM-5 (a MIMD MPP)
  • 4 leaf nodes, internal nodes have 2 or 4 children
  • To route from leaf A to leaf B, pick random
    switch C in the least common ancestor fat node of
    A and B, take unique tree route from A to C and
    from C to B

Binary fat-tree in which all internal nodes have
two children
20
An Impractical Interconnection Topology
  • Completely connected
  • Every node has a direct wire connection to every
    other node
  • N x (N-1)/2 Wires

21
Shared Address Space Multiprocessors
  • 4 basic types of interconnection media
  • Bus
  • Crossbar switch
  • Multistage network
  • Interconnection network with distributed shared
    memory (DSM)

22
Bus architectures
  • Bus acts as a party line between processors and
    shared memories
  • Bus provides uniform access to shared memory (UMA
    Uniform Memory Access) (SMP symmetric
    multiprocessor
  • When bus saturates, performance of system
    degrades
  • Bus-based systems do not scale to more than 32
    processors Sequent Symmetry, Balance

23
Crossbar Switch
  • Uses O(mn) switches to connect m processors and n
    memories with distinct paths between each
    processor/memory pair
  • UMA
  • Scalable performance but not cost.

24
Some Multistage Networks
  • Butterfly multistage
  • Shuffle multistage

25
Distributed Shared Memory (DSM)
  • Rather than having all processors on one side of
    network and all memory on the other, DSM has some
    memory at each processor (or group of
    processors).
  • NUMA (Non-uniform memory access)
  • Example HP/Convex Exemplar (late 90s)
  • 3 cycles to access data in cache
  • 92 cycles for local memory (shared by 16 procs)
  • 450 cycles for non-local memory

26
Message Passing vs. Shared Memory
  • Message Passing
  • Requires software involvement to move data
  • More cumbersome to program
  • More scalable
  • Shared Memory
  • Subtle programming and performance bugs
  • Multi-tiered
  • Best(?) Worst(?) of both worlds

27
Message Passing Interface,Dependencies
28
Single-Program Multiple DataSPMD
  • if (my_rank ! 0)
  • . . .
  • else
  • . . .
  • We write a single program
  • Portions of that program are not executed by some
    of the processors
  • We can change the data that each of the
    processors uses for computation

29
Single-Program Multiple DataSPMD
  • x p/num // p array size, num cluster
    size
  • start myidx
  • if (myid num-1)
  • end p
  • else
  • endstartx
  • for(istart iltendii1)
  • //summation goes here

30
When can two statements execute in parallel?
  • On one processor
  • statement 1
  • statement 2
  • On two processors
  • processor1 processor2
  • statement1 statement2

31
Fundamental Assumption
Willy Zwaenepoel This is ok for completely
unsynchronized parallel execution, but it
excludes the possibility of synchronization among
processors.
  • Processors execute independently no control
    over order of execution between processors

32
When can 2 statements execute in parallel?
  • Possibility 1
  • Processor1 Processor2
  • statement1
  • statement2
  • Possibility 2
  • Processor1 Processor2
  • statement2
  • statement1

33
When can 2 statements execute in parallel?
Willy Zwaenepoel It is really that if there is a
requirement that they be executed in the
sequential order, then they need to be executed
in the same order here. Maybe that is the same.
  • Their order of execution must not matter!
  • In other words,
  • statement1 statement2
  • must be equivalent to
  • statement2 statement1

34
Example 1
  • a 1
  • b 2
  • Statements can be executed in parallel.

35
Example 2
  • a 1
  • b a
  • Statements cannot be executed in parallel
  • Program modifications may make it possible.

36
Example 3
  • a f(x)
  • b a
  • May not be wise to change the program (sequential
    execution would take longer).

37
Example 4
  • b a
  • a 1
  • Statements cannot be executed in parallel.

38
Example 5
  • a 1
  • a 2
  • Statements cannot be executed in parallel.

39
True dependence
  • Statements S1, S2
  • S2 has a true dependence on S1
  • iff
  • S2 reads a value written by S1

40
Anti-dependence
  • Statements S1, S2.
  • S2 has an anti-dependence on S1
  • iff
  • S2 writes a value read by S1.

41
Output Dependence
  • Statements S1, S2.
  • S2 has an output dependence on S1
  • iff
  • S2 writes a variable written by S1.

42
When can 2 statements execute in parallel?
Willy Zwaenepoel At thia point generalize to
arbitrary pieces of code, like small procedures
and explain that dependences hold there too, or
rather use the readset-writeset formalism from
the start?
  • S1 and S2 can execute in parallel
  • iff
  • there are no dependences between S1 and S2
  • true dependences
  • anti-dependences
  • output dependences
  • Some dependences can be removed.

43
Example 6
Willy Zwaenepoel Same comment as next slide.
  • Most parallelism occurs in loops.
  • for(i0 ilt100 i)
  • ai i
  • No dependences.
  • Iterations can be executed in parallel.

44
Example 7
Willy Zwaenepoel There was an issue of the
parallel processors having to have different
variables for the loop index.
  • for(i0 ilt100 i)
  • ai i
  • bi 2i
  • Iterations and statements can be executed in
    parallel.

45
Example 8
Willy Zwaenepoel There was some issue here that
there is a dependence caused by I -- addressed by
rewriting the second loop as for j.
  • for(i0ilt100i) ai i
  • for(i0ilt100i) bi 2i
  • Iterations and loops can be executed in parallel.

46
Example 9
Willy Zwaenepoel Maybe we should introduce the
notion on a dependence on itself earlier (before
a loop is introduced). The notion of a
loop-independent dependence was never extended to
different statements within the body. It should
be.
  • for(i0 ilt100 i)
  • ai ai 100
  • There is a dependence on itself!
  • Loop is still parallelizable.

47
Example 10
Willy Zwaenepoel Same comment here on different
statements?
  • for( i0 ilt100 i )
  • ai f(ai-1)
  • Dependence between ai and ai-1.
  • Loop iterations are not parallelizable.

48
Loop-carried dependence
  • A loop carried dependence is a dependence that is
    present only if the statements are part of the
    execution of a loop.
  • Otherwise, we call it a loop-independent
    dependence.
  • Loop-carried dependences prevent loop iteration
    parallelization.

49
Example 11
  • for(i0 ilt100 i )
  • for(j0 jlt100 j )
  • aij f(aij-1)
  • Loop-independent dependence on i.
  • Loop-carried dependence on j.
  • Outer loop can be parallelized, inner loop cannot.

50
Example 12
Willy Zwaenepoel Say a little more about loop
interchange, make loops with loop-carried
dependences appear on the inside of the nest, the
others on the outside.
  • for( j0 jlt100 j )
  • for( i0 ilt100 i )
  • aij f(aij-1)
  • Inner loop can be parallelized, outer loop
    cannot.
  • Less desirable situation.
  • Loop interchange is sometimes possible.

51
Level of loop-carried dependence
  • Is the nesting depth of the loop that carries the
    dependence.
  • Indicates which loops can be parallelized.

52
Be careful Example 13
  • printf(a)
  • printf(b)
  • Statements have a hidden output dependence due to
    the output stream.

53
Be careful Example 14
Willy Zwaenepoel Also depends on what f and g
can do to x.
  • a f(x)
  • b g(x)
  • Statements could have a hidden dependence if f
    and g update the same variable.

54
Be careful Example 15
Willy Zwaenepoel Is this the business about the
distance of a dependence in Kens book? It is
also possible to transform this to a nested loop.
  • for(i0 ilt100 i)
  • ai10 f(ai)
  • Dependence between a10, a20,
  • Dependence between a11, a21,
  • Some parallel execution is possible.

55
Be careful Example 16
Willy Zwaenepoel This requires synchronization
between processors for parallel execution.Goes
against initial assumption.
  • for( i1 ilt100i )
  • ai
  • ... ai-1
  • Dependence between ai and ai-1
  • Complete parallel execution impossible
  • Pipelined parallel execution possible

56
Be careful Example 14
  • for( i0 ilt100 i )
  • ai f(aindexai)
  • Cannot tell for sure.
  • Parallelization depends on user knowledge of
    values in indexa.
  • User can tell, compiler cannot.

57
An aside
  • Parallelizing compilers analyze program
    dependences to decide parallelization.
  • In parallelization by hand, user does the same
    analysis.
  • Compiler more convenient and more correct
  • User more powerful, can analyze more patterns.

58
To remember
  • Statement order must not matter.
  • Statements must not have dependences.
  • Some dependences can be removed.
  • Some dependences may not be obvious.

59
Parallelism First Program
  • Ability to execute different parts of a program
    concurrently on different machines
  • Goal shorten execution time
  • There are p processes executing a program
  • They have ranks 0, 1, 2, , p-1
  • Each process prints out a Hello message

60
Hello World
  • include "mpi.h"
  • include ltstdio.hgt
  • include ltmath.hgt
  • int main( int argc, char argv)
  • int myid, num, d
  • MPI_Init(argc,argv)
  • MPI_Comm_size(MPI_COMM_WORLD,num)
  • MPI_Comm_rank(MPI_COMM_WORLD,myid)
  • if (myid 0) printf("Hello.\n")
  • printf("How many processors would you
    like?\n")
  • scanf("d", d)
  • if (d gt num)
  • printf("number too big\n")
  • MPI_Bcast(d, 1, MPI_INT, 0, MPI_COMM_WORLD)
  • if (myid lt d)
  • printf("Hello from Process d of
    d\n", myid, num)

61
MPI Function Calls
  • include "mpi.h"
  • MPI_Init(argc,argv)
  • MPI_Comm_size(MPI_COMM_WORLD,num)
  • MPI_Comm_rank(MPI_COMM_WORLD,myid)
  • MPI_Bcast(d, 1, MPI_INT, 0, MPI_COMM_WORLD)
  • // From example on page 42
  • MPI_Send( . . . ) // What goes here? See Page 47
  • MPI_Recv( . . . ) // What goes here? See Page 48
  • MPI_Finalize()

62
MPI Data Types
  • MPI_CHAR
  • MPI_INT
  • MPI_LONG
  • MPI_FLOAT
  • MPI_LONG_DOUBLE
Write a Comment
User Comments (0)
About PowerShow.com