Title: Introduction to DistributedMemory Computing
1Introduction to Distributed-Memory Computing
2More Concurrency
- So far we have talked about concurrency within a
Box - Within a processor
- Pipelining
- Multiple functional units
- Instruction Level Parallelism
- Hyper-Threading
- Across processors
- Multi-proc systems
- Multi-core systems
- Multi-proc/core systems
- But this can only get us so far for many
applications - Were limited by the number of processors we can
put in a single box - Were limited by the size of the memory we can
put in a single box
3Toward Distributed Memory
- Although for many applications one would rather
write simple non-concurrent code, one has to go
to concurrency because we need more cycles and
thus have to work with multi-core/proc systems - Number of cycles in a non-concurrent processor is
limited by technology and cost - Although for many applications on would rather
write code that runs within a single system, one
has to use multiple systems because we need
(even) more cycles and/or a larger memory than
available in a single system - The size of memory is limited by technology and
cost - The number of processor cores is limited by
technology and cost - Therefore, we often have to use multiple systems
- Note thats because were force to do it, not
because we want to do it (although its
intellectually challenging and sometimes
considered fun and cool)
4Distributed Memory Computing
- Distributed memory platforms
- so-called supercomputers
- Issues when writing distributed memory programs
5A host of parallel machines
- There are (have been) many kinds of parallel
machines - For the last 12 years their performance has been
measured and recorded with the LINPACK benchmark,
as part of Top500 - It is a good source of information about what
machines are and how
they have evolved - Note that its really about supercomputers
- http//www.top500.org
6LINPACK Benchmark?
- LINPACK LINear algebra PACKage
- A FORTRAN
- Matrix multiply, LU/QR/Choleski factorizations,
eigensolvers, SVD, etc. - LINPACK Benchmark
- Dense linear system solve with LU factorization
- 2/3 n3 O(n2)
- Measure MFlops
- The problem size can be chosen
- You have to report the best performance for the
best n, and the n that achieves half of the best
performance.
7What can we find on the Top500?
8Pies
9Pies
10Pies
11Pies
12Pies
13Platform Architectures
14Clusters, Constellations, MPPs
- These are the only 3 categories today in the
Top500 - They all belong to the Distributed Memory model
(MIMD) (with many twists) - Each processor/node has its own memory and cache
but cannot directly access another processors
memory. - nodes may be SMPs
- Each node has a network interface (NI) for all
communication and synchronization.
15Clusters
- 80 of the Top500 machines are labeled as
clusters - Definition Parallel computer system comprising
an integrated collection of independent nodes,
each of which is a system in its own right
capable on independent operation and derived from
products developed and marketed for other
standalone purposes - A commodity cluster is one in which both the
network and the compute nodes are available in
the market - In the Top500, cluster means commodity
cluster - A well-known type of commodity clusters are
Beowulf-class PC clusters, or Beowulfs
16What is Beowulf?
- An experiment in parallel computing systems
- Established vision of low cost, high end
computing, with public domain software (and led
to software development) - Tutorials and book for best practice on how to
build such platforms - Today by Beowulf cluster one means a
commodity cluster that runs Linux and
GNU-type software - Project initiated by T. Sterling and D.
Becker at NASA in 1994
17MPP????????
- Probably the most imprecise term for describing a
machine (isnt a 256-node cluster of 4-way SMPs
massively parallel?) - May use proprietary networks, vector processors,
as opposed to commodity component - Basically, everything thats fast and not
commodity is an MPP, in terms of todays Top500. - Lets look at these non-commodity things
- Peoples definition of commodity varies
18Cray X1 Parallel Vector Architecture
- Cray combines several technologies in the X1
- 12.8 Gflop/s Vector processors (MSP)
- Shared caches (unusual on earlier vector
machines) - 4 processor nodes sharing up to 64 GB of memory
- Single System Image to 4096 Processors
- Remote put/get between nodes (faster than
explicit messaging)
19Cray X1 the MSP
- Cray X1 building block is the MSP
- Multi-Streaming vector Processor
- 4 SSPs (each a 2-pipe vector processor)
- Compiler will (try to) vectorize/parallelize
across the MSP, achieving streaming
custom blocks
12.8 Gflops (64 bit)
S
S
S
S
25.6 Gflops (32 bit)
V
V
V
V
V
V
V
V
25-41 GB/s
0.5 MB
0.5 MB
0.5 MB
0.5 MB
shared caches
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
20Cray X1 A node
- Shared memory
- 32 network links and four I/O links per node
21Cray X1 32 nodes
Fast Switch
22Cray X1 128 nodes
23Cray X1 Parallelism
- Many levels of parallelism
- Within a processor vectorization
- Within an MSP streaming
- Within a node shared memory
- Across nodes message passing
- Some are automated by the compiler, some require
work by the programmer - This is a common theme
- The more complex the architecture, the more
difficult it is for the programmer to exploit it - Hard to fit this machine into a simple taxonomy
- Similar story for the Earth Simulator
24The Earth Simulator (NEC)
- Each node
- Shared memory (16GB)
- 8 vector processors I/O processor
- 640 nodes fully-connected by a 640x640 crossbar
switch - Total 5120 8GFlop processors -gt 40TFlop peak
25Blue Gene/L
- 65,536 processors
- Relatively modest clock rates, so that power
consumption is low, cooling is easy, and space is
small (1024 nodes in the same rack) - Besides, processor speed is on par with the
memory speed so faster clock rate does not help - 2-way SMP nodes (really different from the X1)
- several networks
- 64x32x32 3-D torus for point-to-point
- tree for collective operations and for I/O
- plus other Ethernet, etc.
26If you like dead Supercomputers
- Lots of old supercomputers w/ pictures
- http//www.geocities.com/Athens/6270/superp.html
- Dead Supercomputers
- http//www.paralogos.com/DeadSuper/Projects.html
- e-Bay
- Cray Y-MP/C90, 1993
- 45,100.70
- From the Pittsburgh Supercomputer Center who
wanted to get rid of it to make space in their
machine room - Original cost 35,000,000
- Weight 30 tons
- Cost 400,000 to make it work at the buyers
ranch in Northern California
27Distributed Memory Programming
- So this is all well and good, we can put tons of
machines together - The big question is How do we write code for
something like this? - The application now consists of multiple
processes running on different machines - Each process can consist of multiple threads!
- Lets look at a picture
28Distributed Memory Platform
hyper-threaded processor core
dual-core chip
dual-core system
L1
L1
L1
29Distributed Memory Platform
hyper-threaded processor core
dual-core chip
dual-core system
L1
L1
L1
8-way Switch
Cluster of dual-core systems
30Distributed Memory Program
8-way Switch
- 8 processes
- Each process may contain 4 threads
- 2 threads are running on each core using hyper
threading - Each process may contain more or fewer threads
31Distributed Memory Program
8-way Switch
- Each process stores some data in the memory of
its box - Lets see an example
32Distributed-Memory Heat Equation
- Say you want to solve the Heat Transfer equation
- This application really looks like an image
processing filter - You just run it multiple times in a row
f( , , , )
33Sample Stencil App Code
- The code could look like something like this
- int aNN, a_newNN
- for (i1 iltN-1 i)
- for (j1 jltN-1 j)
- a_newij f(aij,
- ai-1j,ai1j,
- aij-1,aij1)
-
-
- Probably with threads, etc.
34Too Large?
- This is all well and good, but what if my array
requires 8GB of memory and I only have 1GB of
RAM? - I could think of just relying on virtual memory
- This is bound to be very slow
- I could manage the reads and writes to disk
myself - Could be a bit faster than virtual memory if I am
really clever, but would be complicated and still
slow - Called an out of core implementation
- Or, I could use 8 machines with 1GB RAMs and run
fast without really ever swapping between the
memory and the disk! - For instance, we can use a cluster!
35How do we write the program?
- Of course, the big question is how do we write
the code - We cannot have a declaration of an NxN array any
more, because that would not fit in memory - Each process (running on a different system) must
handle an array of size N x N/8 - Each process allocates memory for 1/8 of the
overall array
36Data Distribution
37Data Distribution
38Data Distribution
process 1
process 3
process 2
- Each piece of the image is stored in the memory
of a different system - A process running on one system can only see
(i.e., address) the local image piece, and has no
way to address other pieces - This is what makes distributed memory programming
MUCH harder than shared-memory programming
39Boundaries!
- One of the problems now is what happens at the
boundaries/edges of the image tiles?
- Process 1 needs pixels from process 2
- Process 2 needs pixels from process 1
- Both processes cannot share memory because
theyre on different systems! - We cannot just change them into thread like in
shared-memory programming
process 1
process 2
40Message-Passing
- Since processes cannot share memory, they have to
exchange messages - here are the pixels you need from me, give me
the ones I need from you - This type of programming is called
message-passing - Uses network communication
- e.g., socket and TCP
- So your code will have special function calls
- Send(...)
- Receive(...)
- Were getting further away from simple
shared-memory programming
41SPMD Program
- So at this point, we could
- implement 8 different programs
- start them up somehow on different nodes of our
cluster (for instance) - have them all somehow identify their left and
right neighbors, if any - Turns out that this is really cumbersome
- And if I want to use 1000 processes, I have to
write 1000 programs? - Typically one uses/implements the notion of a
process rank
42Process Ranks
- To identify the processes participating in the
computation, each process is assigned an index
from 0 to N-1 - And each process can find out what its rank is
and how many processes there are in total
0 1 2 3 4 5 6 7
43Communication Patterns
0 1 2 3 4 5 6 7
- Process 0 will send to 1 and receive from 1
- Process 1 will send to 0, receive from 0, send to
2, and receive from 2 - ...
- Process 7 will receive from 6 and send to 6
44SPMD Programming
- If every process can find out its rank and the
total number of processes, then one can write a
Single Program to operate on Multiple pieces of
Data simultaneously (SPMD)
int main() if (my_rank() 0) //
talk to my below neighbor else if (my_rank()
num_processes() -1) // talk to my above
neighbor else // talk to my above and
below neighbors
45Ranks and Number of Processes
- For now were going to assume we have the
my_rank() and the num_processes() functions, and
the all the logistics of starting up the
processes is taken care of - We will see later that there are standard ways to
make this happen - But this can also be implemented by hand if
necessary - At any rate, the way to write distributed memory
programs is to rely on the process ranks
assumption
46Writing the SPMD Program
- The pseudo-code of the SMPD program could then
look like
int main() int M N/num_processes() //
assumed to be integer! int original_imageMN
int new_imageMN // load my part of the
image from disk // compute all the pixels that
do not require communication // send pixels to
my neighbor(s) // receive pixels from my
neighbors() // compute the remaining pixels
// save the new image to file
47Writing the SPMD Program
- For now, lets ignore the issue of
loading/writing files to disk - There are a lot of options here, simple/slow
ones, and complex/fast ones - Lets focus on computation and communication
48Computing the easy pixels
N
Can be computed without communication
0
M
1
Requires pixels from neighbors
2
3
(note that process 0 and process N-1 can compute
one more row than the others without any
communication
4
5
6
49Computing the easy pixels
for (j0 iltN j) if (my_rank() 0) //
top process can compute an extra row
new_image0j f ( original_image0j,
original_image0j-1, original_image0j1,
original_image1j ) if (my_rank()
num_processes()-1) // bottom process can
compute
// an extra row
new_imageM-1j f ( original_imageM-1j,
original_imageM-1j-1, original_imageM-1j1
,
original_imageM-2j ) for (i1
iltM-1 i) // Everybody computes the middle
M-2 rows new_imageij f (
original_imageij,
original_imagei1j,
original_imagei-1j,
original_imageij-1,
original_imageij1 )
50Global/Local Index
- One of the reason why distributed memory
programming is difficult is because of the
discrepancy between global and local indices - When I think globally of the whole image, I
know where pixel at coordinates (100,100) is - But when I write the code, I will not reference
the pixel as image100100! - Lets look at this on an example
51Global/Local Index
Process 0
Process 1
- The red pixels global coordinates are (5,1)
- The pixel on the 6th row and the 2nd column of
the big array - But when Process 1 references it, it must use
coordinates (1,1) - The pixel on the 2nd row and the 2nd column of
the tile thats stored in Process 1
52Message Passing
- Lets assume that we have a send() function that
takes as argument - The rank of the destination process
- An address in local memory
- A size (in bytes)
- Lets assume that we have a recv() function that
takes as argument - An address in local memory
- A size (in bytes)
53A Process Memory
original_image MxN
sent to above neighbor
not communicated
sent to below neighbor
buffer_top 1xN
received from above neighbor
buffer_bottom 1xN
received from below neighbor
new_image MxN
updated with received data
updated w/o using received data
updated with received data
54Sending/Receiving Pixels
- int buffer_topM, buffer_bottomM
- if (my_rank() ! 0) // receive from above
neighbor - send(my_rank()-1,(original_image00),sizeof(
double)N) - recv(buffer_top, sizeof(double)N)
-
- if (my_rank() ! num_processes()-1) //
receive from below neighbor - send(my_rank()1, (original_imageM-10),
sizeof(double)N) - recv(buffer_bottom, sizeof(double)N)
-
- // assumes non-blocking sending
55Computing Remaining Pixels
- if (my_rank() ! 0) // update top pixels
- for (j0 jltN j)
- new_imageij f (
original_imageij, -
original_imagei1j, buffer_top0j, -
original_imageij-1, original_imageij1 ) -
-
- if (my_rank() ! N-1) // update bottom
pixels - for (j0 jltN j)
- new_imageij f (
original_imageij, -
buffer_bottom0j, original_imagei1j, -
original_imageij-1, original_imageij1 ) -
56Were done!
- At this point, we have written the whole code
- Whats missing is I/O
- Read the image in
- Write the image out
- Dealing with I/O (efficiently) is a difficult
problem, and we wont really talk about it in
depth - And of course we need to use a tool that provides
the my_rank(), the num_processors(), the send()
and the recv() functions - Each process allocates 1xN 1xN 2(M/P)xN
(2M/P2)N pixels, where P is the number of
processors - Therefore, the total number of pixels allocated
is 2MN 2NP - 2NP extra pixels allocated than in the sequential
version - But its insignificant when spread across
multiple systems
57The Code
- int main()
- int i, j, M N/num_processes() // assumed to
be integer! - int original_imageMN, new_imageMN
- double buffer_topM, buffer_bottomM
- for (j0 iltN j)
- if (my_rank() 0) // top process can
compute an extra row - new_image0j f ( original_image0j,
original_image0j-1, original_image0j1,
original_image1j ) -
- if (my_rank() num_processes()-1) //
bottom process can compute an extra row - new_imageM-1j f ( original_imageM-1
j, original_imageM-1j-1, original_imageM-1
j1, original_imageM-2j ) -
- for (i1 iltM-1 i) // Everybody computes
the middle M-2 rows - new_imageij f ( original_imageij
, original_imagei1j, original_imagei-1j,
original_imageij-1, original_imageij1
) -
- if (my_rank() ! 0) // receive from
above neighbor - send(my_rank()-1,(original_image00),sizeo
f(double)N) - recv(buffer_top, sizeof(double)N)
58The Open/MP Code
- int main()
- int i,j
- int old_imageNN, new_imageNN
- pragma omp parallel for private(i,j)
shared(original_image, new_image) - for (i0 iltN i)
- for (j0 jltN j)
- new_imageij f ( original_imageij,
-
original_imagei1j, original_imagei-1j, -
original_imageij-1, original_imageij1 ) -
-
-
- And in the distributed memory code we have made
the simplifying assumption that P divides N,
which would increase the complexity of the
distributed memory code (but not of the
shared-memory code!)
59Overlapping Comp and Comm
- One of the complexities of writing distributed
memory programs is to hide the cost of
communication - Again, wed like to pretend we have a big
shared-memory machine without a network at all - Its all very similar to what we did with the
image processing application (HW5) - In the previous example, as opposed to doing
- compute easy pixels, send, recv, compute
remaining pixels - one should do
- send, compute easy pixels, recv, compute
remaining pixels
60Hybrid Parallelism
- In a cluster, individual systems are
multi-proc/core - Therefore, one should use multiple threads within
each system - This can be done by adding a few deft Open/MP
pragmas to the distributed memory code - For instance
- pragma omp parallel for private(i,j)
shared(original_image, new_image) - for (i1 iltM-1 i) // Everybody computes the
middle M-2 rows - new_imageij f ( original_imageij,
-
original_imagei1j, original_imagei-1j, -
original_imageij-1, original_imageij1 ) -
61Conclusion
- Writing distributed memory code is much more
complex that shared memory code - One must identify what must be communicated
- One must keep a mental picture of the memory
across systems - All this in addition to all the concerns we have
mentioned in class - e.g., cache reuse, synchronization among threads
- And the typical problems of shared memory are
still there - There can be communication deadlocks, for
instance - An in-depth study of distributed-memory
programming belongs to a graduate-level class - But its likely that youll end up at some point
writing distributed applications with data
distribution among disjoint processes