Homework from last lecture: - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Homework from last lecture:

Description:

thread level parallelism Hyperthreading in Pentium IV, multiple cores coming ... (wide issue, multimedia instructions, hyperthreading), and this bit is growing. ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 78
Provided by: Pao3
Category:

less

Transcript and Presenter's Notes

Title: Homework from last lecture:


1
Homework from last lecture
  • Given an array A of n elements, evenly
    distributed among p processors (you may assume p
    divides n), and an integer kgt0
  • an index i is a k-maximum iff AigtAj for all
    j?i, i-k j ik
  • Goal report all k-maxima in the array A


A
0
n/p
n(p-1)/p
p0
pp-1
p1
2
Homework from last lecture solution
  • Case 1 k n/p
  • processor i has to communicate with processors
    ?
  • what data has to be communicated?
  • how many communication rounds?
  • Answers
  • processors i-1 and i1 (unless i0 or ip-1)
  • k data items on each side, or just the k-maximum
    candidate for each side
  • 1 or 2
  • Question Which solution is better?
  • depends on the platform and k

3
Homework from last lecture solution
  • Case 2 k gt n/p
  • processor i has to communicate with n/pk
    processors on each side
  • k is rather high, you want to send just
    candidates
  • several techniques possible, will cover that
    later on the the course
  • pipelining
  • broadcasting/multicasting trees

4
Lecture Outline (2 lectures)
  • A bit of Historical Perspective
  • Parallelisation Approaches
  • Parallel Programming Platforms
  • logical organization - programmers view of the
    platform
  • physical organization hardware view
  • communication costs
  • process-processors mappings

5
A bit of historical perspective
Parallel computing has been here since the early
days of computing. Traditionally custom HW,
custom SW, high The doom of the Moore law
custom HW has hard time catching up with the
commodity processors Current trend use commodity
HW components, standardize SW Market size of High
Performance Computing the market size for
disposable diapers (Explicitly!) Parallel
computing has never been mainstream.
6
A bit of historical perspective (cont.)
  • Parallelism sneaking into commodity computers
  • Instruction Level Parallelism - wide issue,
    pipelining, OOO
  • data level parallelism SSE, 3DNow, Altivec
  • thread level parallelism Hyperthreading in
    Pentium IV, multiple cores coming in not too
    distant future
  • Transistor budgets allow for multiple processor
    cores on a chip.
  • Most applications would benefit from being
    parallelised and executed on a parallel computer.
  • even PC applications, especially the most
    demanding ones
  • games, multimedia

7
A bit of historical perspective III
  • Chicken Egg Problem
  • Why build parallel computers when the
    applications are sequential?
  • Why parallelize applications when there are no
    parallel commodity computers?
  • Answers
  • What else to do with all those transistors?
  • They already are a bit parallel (wide issue,
    multimedia instructions, hyperthreading), and
    this bit is growing.
  • Yet another reason to study parallel computing
  • Principles of parallel algorithm design (locality
    of data reference) lend themselves to
    cache-friendly sequential algorithms.
  • The same applies for out-of-core computations
    (data servers).

8
Parallelisation Approaches
  • Parallelizing compiler
  • advantage use your current code
  • disadvantage very limited abilities
  • Parallel domain-specific libraries
  • e.g. linear algebra, numerical libraries,
    quantum chemistry
  • usually good choice, use when possible
  • Communication libraries
  • message passing libraries MPI, PVM
  • shared memory libraries declare and access
    shared memory variables (on MPP machines done by
    emulation)
  • advantage use standard compiler
  • disadvantage low level programming (parallel
    assembler)

9
Parallelisation Approaches (cont.)
  • New parallel languages
  • use a language with built-in explicit control
    for parallelism
  • no language is the best in every domain
  • needs new compiler
  • fights against inertia
  • Parallel features in existing languages
  • adding parallel features to an existing language
  • I.e. for expressing loop parallelism (pardo) and
    data placement
  • example High Performance Fortran
  • Additional possibilities in shared-memory systems
  • use threads
  • preprocessor compiler directives (OpenMP)

10
Parallelisation Approaches Our Focus
  • Communication libraries MPI, PVM
  • industry standard, available for every platform
  • very general, low level approach
  • perfect match for clusters
  • most likely to be useful for you
  • Shared memory programming
  • also very important
  • likely to be useful in next iterations of PCs

11
Parallel Programming Platforms
  • Implicit Parallelism in Modern Microprocessors
  • pipelining, superscalar execution, VLIW
  • Limitations of Memory System Performance
  • problem high latency of memory vs. speed of
    computing
  • solutions caches, latency hiding using
    multithreading and prefetching
  • Explicit Parallel Programming Platforms
  • logical organization - programmers view of the
    platform
  • physical organization hardware view
  • communication costs
  • process-processors mappings

12
Logical View of a PP Platform
  • Control Structure - how to express parallel
    tasks
  • Single Instruction stream, Multiple Data stream
  • Multiple Instruction stream, Multiple Data
    stream
  • Single Program Multiple Data
  • Communication Model - how to specify interactions
    between these tasks
  • Shared Address Space Platforms
    (multiprocessors)
  • Uniform Memory Access multiprocessors
  • Non-Uniform Memory Access multiprocessors
  • cache coherence issues
  • Message Passing Platforms

13
Control Structure SIMD
Single Instruction stream, Multiple Data stream
Example for (i0 ilt1000 i) pardo ci
aibi Processor k executes ck akbk
14
SIMD (cont.)
  • early parallel machines
  • Illiac IV, MPP, CM-2, MasPar MP-1
  • modern settings
  • multimedia extensions - MMX, SSE
  • DSP chips
  • positives
  • less hardware needed
  • easy to understand and reason about
  • negatives
  • proprietary hardware needed fast obsolescence,
    high development costs/time
  • rigid structure suitable only for highly
    structured problems
  • inherent inefficiency due to selective turn-off

15
SIMD inefficiency example
Example for (i0 ilt10 i) if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
16
SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
17
SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
9
4
8
1
15
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
18
SIMD inefficiency example
Example for (i0 ilt10 i) pardo if
(ailtbi) ci aibi
else ci 0
a
4
1
7
2
9
3
3
0
6
7
5
3
4
1
4
5
3
1
4
8
b
9
4
0
0
0
8
0
1
0
15
c
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p0
p1
p2
p3
p4
p5
p6
p7
p8
p9
19
Control Structure MIMD
Multiple Instruction stream, Multiple Data stream
  • Single Program, Multiple Data
  • popular way to program MIMD computers
  • simplifies code maintenance/program distribution
  • equivalent to MIMD (big switch at the beginning)

20
MIMD (cont.)
  • positives
  • can be easily/fast/cheaply built from existing
    microprocessors
  • very flexible (suitable for irregular problems)
  • negatives
  • requires more resources (duplicated program, OS,
    )
  • more difficult to reason about/design correct
    programs

21
Logical View of PP Platform
  • Control Structure - how to express parallel
    tasks
  • Single Instruction stream, Multiple Data stream
  • Multiple Instruction stream, Multiple Data
    stream
  • Single Program Multiple Data
  • Communication Model - how to specify interactions
    between these tasks
  • Shared Address Space Platforms
    (multiprocessors)
  • Uniform Memory Access multiprocessors
  • Non-Uniform Memory Access multiprocessors
  • cache coherence issues
  • Message Passing Platforms

22
Shared Address Space Platforms
shared memory, UMA
distributed memory, NUMA
23
Shared Address Space Platforms
  • all memory addressable by all processors
  • needs address translation mechanism
  • may or may not provide cache coherence
  • access time
  • uniform UMA (shared memory)
  • non-uniform NUMA (distributed memory)
  • principal communication mechanisms
  • put() and get()

24
Message Passing Platforms
  • p processing nodes, each with its own exclusive
    address space
  • each node can be a single processor or a shared
    address space multiprocessor
  • inter-node communication possible only through
    message passing
  • principal functions
  • send(), receive()
  • each processor has unique ID
  • mechanisms provided for learning your ID, of
    nodes etc.
  • several standard APIs available MPI, PVM

25
Physical Organization of Parallel Platforms
  • Ideal Parallel Computer PRAM
  • Interconnection Networks for Parallel Computers
  • static
  • dynamic
  • cost of communication
  • evaluating interconnection networks

26
Parallel Random Access Machine
  • p processors, each has a local memory and is
    connected to unbounded shared memory
  • processors work in lock-step, the access time to
    shared memory costs 1 step
  • 4 main classes, depending on how simultaneous
    accesses are handled
  • Exclusive read, exclusive write - EREW PRAM
  • Concurrent read, exclusive write - CREW PRAM
  • Exclusive read, concurrent write - ERCW PRAM
    (for completeness)
  • Concurrent read, concurrent write - CRCW PRAM
  • Resolving concurrent writes
  • Common all writes must write the same value
  • Arbitrary arbitrary write succeeds
  • Priority the write with highest priority
    succeeds
  • Sum the sum of the written values is stored

27
Parallel Random Access Machine (cont.)
  • abstracts away communication, allows to focus on
    the parallel tasks
  • an algorithm for PRAM might lead you to a good
    algorithm for a real machine
  • if you prove that something cannot be
    efficiently solved on PRAM, it cannot be
    efficiently done on any practical machine (based
    on current technology)
  • it is not feasible to manufacture PRAM
  • the cost of connecting p processors to m memory
    cells such that their accesses do not interfere
    is 1(pm), which is huge for any practical values
    of m

28
PRAM Algorithm Example
Problem use EREW PRAM to sum numbers stored at
m0, m1, , mn-1, where n2k for some k. The
result should be stored at m0.
Example for k3
Algorithm for processor pi for (j0 jltk j)
if (i 2(j1) 0)
a read(mi) b read(mi2j)
write(ab, mi)
29
PRAM Example Notes
  • the program is written in SIMD (and SPMD) format
  • the inefficiency caused by idling processors is
    clearly visible
  • can be easily extended for n not power of 2
  • takes log2(n) rounds to execute
  • using similar approach ( some other ideas) it
    can be shown that
  • Any CRCW PRAM can be simulated by EREW PRAM with
    a slowdown factor of O(log n)

30
PRAM Example2
  • Problem use Sum - CRCW PRAM with n2 processors
    to sort n numbers stored at m0, m1, , mn-1.
  • Question How many steps would it take?
  • O(n log n)
  • O(n)
  • O(log n)
  • O(1)
  • less then (n log n)/ n2

31
PRAM Example2
Note We will mark processors pi,j for 0lti,jltn
Algorithm for processor pi,j a read(mi)
b read(mj) if
((agtb) ((ab)(igtj)))
write(1, mi) if (j0)
b read(mi) write(a,
mb)
m
1
7
3
9
3
0
m0
m1
m2
m3
m4
m5
0
1
1
1
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
1
1
0
1
1
1
1
1
0
m
1
4
2
5
3
0
m
0
1
2
3
7
9
O(1) sorting algorithm!
32
Think at home
Problem Adapt the above algorithm to work on
EREW PRAM. What is the time complexity?
33
Physical Organization of Parallel Platforms
  • Ideal Parallel Computer PRAM
  • Interconnection Networks for Parallel Computers
  • dynamic
  • static
  • graph embeddings
  • cost of communication

34
Interconnection Networks for Parallel Computers
  • static networks
  • point-to-point communication links among
    processing nodes
  • also called direct networks
  • dynamic networks
  • communication links are connected dynamically by
    switches to create paths between processing nodes
    and memory banks/other processing nodes
  • also called indirect networks

35
Interconnection Networks
Static/direct network
Dynamic/indirect network
p
p
p
p
p
p
p
p
network interface/switch
switching element
processing node
36
Dynamic Interconnection Networks
37
BUS Based Interconnection Networks
  • processors and the memory modules are connected
    to a shared bus
  • Advantages
  • simple, low cost
  • Disadvantages
  • only one processor can access memory at a given
    time
  • bandwidth does not scale with the number of
    processors/memory modules
  • Example
  • quad Pentium Xeon

38
Crossbar
  • Advantages
  • non blocking network
  • Disadvantages
  • cost O(pm)
  • Example
  • high end UMA

39
Multistage networks (i.e. S-network)
  • - Intermediate case between bus and crossbar
  • - Blocking network (but not always)
  • - Often used in NUMA computers
  • S-network
  • Each switch is a 2x2 crossbar
  • log(p) stages
  • cost p log(p)
  • Simple routing algorithm
  • At each stage, look at the corresponding bit
    (starting with msb) of the source and destination
    address
  • If the bits are the same, messages pass through,
    otherwise cross-over

40
S-network
41
S-network
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
42
Dynamic network exercises
Question 1 Which of the following pairs of
(processors, memory block) requests will
collide/block?
Question 2 (difficult) For a given
processor/memory request (a,b), how many requests
(x,y), with (x ! a) and (y ! b) will block with
(a,b) in an 8 node S-network? How does this
number depend on the choice of (a,b)?
43
Physical Organization of Parallel Platforms
  • Ideal Parallel Computer PRAM
  • Interconnection Networks for Parallel Computers
  • dynamic
  • static
  • graph embeddings
  • cost of communication

44
Static Interconnection Networks
  • Complete network (clique)
  • Star network
  • Linear array
  • Ring
  • Tree
  • 2D 3D mesh/torus
  • Hypercube
  • Butterfly
  • Fat tree

45
Evaluating Interconnection Networks
  • diameter
  • the longest distance (number of hops) between
    any two nodes
  • gives lower bound on time for algorithms
    communicating only with direct neighbours
  • connectivity
  • multiplicity of paths between any two nodes
  • high connectivity lowers contention for
    communication resources
  • bisection width (bisection bandwidth)
  • the minimal number of links (resp. their
    aggregate bandwidth) that must be removed to
    partition the network into two equal halves
  • provides lower bound on time when the data must
    be shuffled from one half of the network to
    another half
  • VLSI area/volume in 2D,
    in 3D

46
  • For each of the following networks
  • determine the diameter and bisection width
  • give a programming example which would use such
    communication pattern.

47
Clique, Star, Linear Array, Ring, Tree
  • important logical topologies, as many common
    communication patters correspond to these
    topologies
  • clique all-to-all broadcast
  • star master slave, broadcast
  • line, ring pipelined execution
  • tree hierarchical decomposition
  • none of them is very practical
  • clique cost
  • star, line, ring, tree low bisection width
  • line, ring high diameter
  • actual execution is performed on the embedding
    into the physical network

48
2D 3D Array Torus
  • good match for discrete simulation and matrix
    operations
  • easy to manufacture and extend
  • Examples Cray 3D (3d torus), Intel Paragon (2D
    mesh)

49
Hypercube
  • good graph-theoretic properties (low diameter,
    high bisection width)
  • nice recursive structure
  • good for simulating other topologies (they can
    be efficiently embedded into hypercube)
  • degree log (n), diameter log (n), bisection
    width n/2
  • costly/difficult to manufacture for high n, not
    so popular nowadays

50
Butterfly
  • Hypercube derived network of log(n) diameter
    and constant degree
  • perfect match for Fast Fourier Transform
  • there are other Hypercube-related networks (Cube
    Connected Cycles, Shuffle-Exchange, De-Bruin and
    Bene networks), see the Leightons book for
    details

51
Fat Tree
  • Observation trees are nice low diameter,
    simple structure
  • Problem low bandwidth
  • Solution exponentially increase the multiplicity
    of links as the distance from the bottom
    increases
  • keeps nice properties of the binary tree (low
    diameter)
  • solves the low bisection and bottleneck at the
    top levels
  • Example CM5

52
Evaluating Interconnection Networks
Cost ( of links)
Arc Connectivity
Bisection Width
Diameter
Network
p(p-1)/2
p-1
p2/4
1
clique
p-1
1
1
2
star
p-1
1
1
2log((p1)/2)
complete binary tree
p-1
1
1
p-2
linear array
2(p-?p)
2
?p
2(?p-1)
2D mesh
2p
4
2?p
2?p/2
2D torus
(p log p)/2
log p
p/2
log p
hypercube
53
Possible questions
  • Assume point-to-point communication with cost 1 .
  • Is is possible to sort in 2D mesh in time O(log
    n)? ? ?
  • Is it possible to sort leaves of complete binary
    tree in time O(log n)? What about ?
  • Can you find maximum in 2D mesh in time O(log
    n)? What about complete binary tree?

54
Last Lecture
  • PRAM
  • what is PRAM? Why? Strong/weak points.
  • PRAM types
  • binary tree algorithm
  • O(1) sorting algorithm
  • Dynamic Interconnection Networks
  • BUS, Crossbar, S-network
  • blocking/non blocking access
  • advantages/disadvantages
  • Static Interconnection Networks
  • tree, mesh, hypercube,
  • diameter, bisection, connectivity, cost

55
Possible questions
  • Assume point-to-point communication with cost 1 .
  • Is is possible to sort in 2D mesh in time O(log
    n)? ? ?
  • Is it possible to sort leaves of complete binary
    tree in time O(log n)? What about ?
  • Can you find maximum in 2D mesh in time O(log
    n)? What about complete binary tree?

56
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
57
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, b0) for (lk-1 lgt0 l--) if (j
2(l1) 0) a read(bj)
write(a, bj2l) read bj
Do you see any problems? All rows use the same
array b!
58
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
59
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
60
Homework
Algorithm for processor pi,j a read(mi) b
read(mj) if ((agtb) ((ab)(igtj)))
write(1, mi) if (j0) b read(mi)
write(a, mb)
Remember, n 2k if (j 0) a read(mi)
write(a, bi,0) for (lk-1 lgt0 l--) if
(j 2(l1) 0) a read(bi,j)
write(a, bi,j2l) read bi,j
61
Physical Organization of Parallel Platforms
  • Ideal Parallel Computer PRAM
  • Interconnection Networks for Parallel Computers
  • dynamic
  • static
  • graph embeddings
  • cost of communication

62
Motivating Graph Embeddings
You want to use this algorithm
But your computer is connected like this
How do you map processes to processors? Why?
63
Process Processor Mappings and Graph Embeddings
Problem 1 You have an algorithm which uses p
logical processes. The communication pattern
between these processes is captured by a
communication graph G. How do you map these
processes to your real machine, which have p
processors interconnected into a network G so
that the overall communication cost is
minimized/kept reasonably small? Problem 2
Assume you have an algorithm designed for a
specific topology G. How do you get it work on an
network of different topology G? Solution
Graph embedding, simulate G on G.
64
Example two mappings
1
2
3
4
a
b
d
c
5
6
7
8
underlying architecture
processes and their interactions
e
f
g
h
9
10
11
12
i
j
k
l
13
14
15
16
m
n
o
p
1
2
3
4
1
2
3
4
a
b
d
c
k
h
i
m
5
6
7
8
5
6
7
8
intuitive mapping of processes to nodes
random mapping
e
f
g
h
j
p
o
b
9
10
11
12
9
10
11
12
i
j
k
l
d
e
a
n
13
14
15
16
13
14
15
16
m
n
o
p
c
l
g
f
65
Embedding
  • Formally Given networks G(V,E) and G(V,E),
    find a mapping f which maps each vertex from V
    into a vertex of V and each edge from E into a
    path in G. Several vertices from V may map into
    one vertex from V (especially if G has more
    vertices then G). Such a mapping is called
    embedding of G in G.
  • Goals
  • balance the number of vertices mapped to each
    node of G
  • to balance the workload of the simulating
    processors
  • each edge should map to a short path, optimally
    single link
  • so each communication step can be simulated
    efficiently (small dilation)
  • there should be little overlaps between
    resulting simulating paths
  • to prevent congestion on links

66
Embedding - examples
  • Embedding ring into line
  • dilation 2, congestion 2
  • similar idea can be used to embed torus into mesh
  • Embedding ring into 2D torus
  • dilation 1, congestion 1

67
Embedding Ring into Hypercube
  • map processor i into node G(d, i) of d-ary
    hypercube
  • function G() called binary reflected Grey code
  • G() can be easily defined recursively
  • G(d1) 0G(d), 1G(d)
  • Example
  • 0,1 00,01,11,10 000,001, 011,010,
    110, 111, 101, 100

r
110
111
010
011
100
101
000
001
68
Embedding trees into Hypercube
  • arbitrary binary trees can be (slightly less)
    efficiently embedded as well
  • example below assumes the processors are only at
    the leaves of the tree

69
Embedding mesh into linear array
  • congestion? dilation?
  • is there an embedding of n x n mesh into linear
    array of n2 nodes with congestion less then n?
    dilation less then n? why?

70
Possible questions
  • Given an embedding of G into H. What is its
    dilation? Congestion?
  • Show how to embed 2D torus into 2D mesh with
    constant dilation. What is the dilation of your
    embedding? What is its congestion?
  • Show how to embed ring into 2D mesh. Is it
    always possible to do it with both dilation and
    congestion equal 1? Constant?
  • Given number x. Who is its predecessor/successor
    in d-ary binary reflected Grey code?
  • Consider graph G with diameter d and bisection
    width w and graph G with diameter d and
    bisection width w. What is the best congestion
    we can hope for in any embedding of G into G?
    What is the best dilation we can hope for?

71
Physical Organization of Parallel Platforms
  • Ideal Parallel Computer PRAM
  • Interconnection Networks for Parallel Computers
  • dynamic
  • static
  • graph embeddings
  • cost of communication

72
Communication Costs in Parallel Machines
  • Message Passing Costs
  • Startup time (ts)
  • once per message transfer
  • covers message preparation, routing and
    establishing interface between local node and
    router
  • Per-hop time (th)
  • incurred in each internal node of the path
  • the travel time of the header between two direct
    neighbours
  • Per-word transfer time (tw)
  • determined by the bandwidth w of the channel
    tw1/w
  • includes network and buffering overheads

73
Routing Message Passing Costs
  • Store-Forward Routing
  • message of size m traveling l links
  • tcomm ts(mtw th)l ? tsmltw
  • Packet Routing
  • split the message into fixed size packets, route
    the packets independently
  • used in highly dynamic settings with high error
    rates (e.g. Internet)
  • tcomm tslthmtw
  • Cut-Through Routing
  • the header establishes path, the tail (data) of
    the message follows like a snake
  • used in networks with low failure rate, no
    buffers needed
  • tcomm tslthmtw with smaller constants

74
Message Passing Costs
  • General formula tcomm tslthmtw
  • In order to communicate efficiently
  • communicate in bulk
  • minimize the volume of the transferred data
  • minimize the traveled distance
  • We will talk about the first two objectives a
    lot, the third is often difficult to achieve in
    practice
  • often little control over mapping processes onto
    processors
  • randomized routing commonly used
  • usually ts dominates for small messages and mtw
    for large ones, while l and th tend to be quite
    small
  • Simplified formula tcomm tsmtw

75
Message Passing Costs and Congestion
  • the previous discussion applies only if the
    network is not congested
  • effective bandwidth
  • link bandwidth scaled down by the degree of
    congestion
  • difficult to estimate, as it depends on process
    to processor mapping, routing algorithm,
    communication schedule
  • lower bound estimate scale link bandwidth by
    factor of p/b

76
Communication Costs in Shared Address Space
Machines
  • Difficult to properly estimate due to
  • programmer has minimal control over memory
    layout
  • possible cache thrashing
  • cache maintenance overhead (invalidate, update)
    is difficult to quantify
  • difficult to model spatial locality
  • prefetching uncertainties
  • false sharing
  • contention in shared accesses
  • The simplified formula tcomm tsmtw can still
    be used.

77
New Concepts and Terms - Summary
  • SIMD, MIMD, SPMD
  • shared memory, distributed (shared) memory,
    cache coherence
  • (cc)UMA, SMP, (cc)NUMA, MPP
  • interconnection networks static, dynamic
  • bus, crossbar, ?-network
  • blocking, non-blocking network
  • torus, hypercube, butterfly, fat tree
  • bisection width
  • graph embedding, dilation, congestion
  • grey codes
Write a Comment
User Comments (0)
About PowerShow.com