Introduction to DistributedMemory Computing

About This Presentation

Title:

Introduction to DistributedMemory Computing

Description:

Cluster of dual-core systems. Mem. L2. L1. L1. Mem. L2. L1. L1. Mem. L2. L1. L1. Mem. L2. L1. L1. Mem ... Called an 'out of core' implementation ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 62

Provided by: henrica

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to DistributedMemory Computing

1
Introduction to Distributed-Memory Computing
2
More Concurrency

So far we have talked about concurrency within a
Box
Within a processor
Pipelining
Multiple functional units
Instruction Level Parallelism
Hyper-Threading
Across processors
Multi-proc systems
Multi-core systems
Multi-proc/core systems
But this can only get us so far for many
applications
Were limited by the number of processors we can
put in a single box
Were limited by the size of the memory we can
put in a single box

3
Toward Distributed Memory

Although for many applications one would rather
write simple non-concurrent code, one has to go
to concurrency because we need more cycles and
thus have to work with multi-core/proc systems
Number of cycles in a non-concurrent processor is
limited by technology and cost
Although for many applications on would rather
write code that runs within a single system, one
has to use multiple systems because we need
(even) more cycles and/or a larger memory than
available in a single system
The size of memory is limited by technology and
cost
The number of processor cores is limited by
technology and cost
Therefore, we often have to use multiple systems
Note thats because were force to do it, not
because we want to do it (although its
intellectually challenging and sometimes
considered fun and cool)

4
Distributed Memory Computing

Distributed memory platforms
so-called supercomputers
Issues when writing distributed memory programs

5
A host of parallel machines

There are (have been) many kinds of parallel
machines
For the last 12 years their performance has been
measured and recorded with the LINPACK benchmark,
as part of Top500
It is a good source of information about what
machines are and how
they have evolved
Note that its really about supercomputers
http//www.top500.org

6
LINPACK Benchmark?

LINPACK LINear algebra PACKage
A FORTRAN
Matrix multiply, LU/QR/Choleski factorizations,
eigensolvers, SVD, etc.
LINPACK Benchmark
Dense linear system solve with LU factorization
2/3 n3 O(n2)
Measure MFlops
The problem size can be chosen
You have to report the best performance for the
best n, and the n that achieves half of the best
performance.

7
What can we find on the Top500?
8
Pies
9
Pies
10
Pies
11
Pies
12
Pies
13
Platform Architectures
14
Clusters, Constellations, MPPs

These are the only 3 categories today in the
Top500
They all belong to the Distributed Memory model
(MIMD) (with many twists)
Each processor/node has its own memory and cache
but cannot directly access another processors
memory.
nodes may be SMPs
Each node has a network interface (NI) for all
communication and synchronization.

15
Clusters

80 of the Top500 machines are labeled as
clusters
Definition Parallel computer system comprising
an integrated collection of independent nodes,
each of which is a system in its own right
capable on independent operation and derived from
products developed and marketed for other
standalone purposes
A commodity cluster is one in which both the
network and the compute nodes are available in
the market
In the Top500, cluster means commodity
cluster
A well-known type of commodity clusters are
Beowulf-class PC clusters, or Beowulfs

16
What is Beowulf?

An experiment in parallel computing systems
Established vision of low cost, high end
computing, with public domain software (and led
to software development)
Tutorials and book for best practice on how to
build such platforms
Today by Beowulf cluster one means a
commodity cluster that runs Linux and
GNU-type software
Project initiated by T. Sterling and D.
Becker at NASA in 1994

17
MPP????????

Probably the most imprecise term for describing a
machine (isnt a 256-node cluster of 4-way SMPs
massively parallel?)
May use proprietary networks, vector processors,
as opposed to commodity component
Basically, everything thats fast and not
commodity is an MPP, in terms of todays Top500.
Lets look at these non-commodity things
Peoples definition of commodity varies

18
Cray X1 Parallel Vector Architecture

Cray combines several technologies in the X1
12.8 Gflop/s Vector processors (MSP)
Shared caches (unusual on earlier vector
machines)
4 processor nodes sharing up to 64 GB of memory
Single System Image to 4096 Processors
Remote put/get between nodes (faster than
explicit messaging)

19
Cray X1 the MSP

Cray X1 building block is the MSP
Multi-Streaming vector Processor
4 SSPs (each a 2-pipe vector processor)
Compiler will (try to) vectorize/parallelize
across the MSP, achieving streaming

custom blocks
12.8 Gflops (64 bit)
S
S
S
S
25.6 Gflops (32 bit)
V
V
V
V
V
V
V
V
25-41 GB/s
0.5 MB
0.5 MB
0.5 MB
0.5 MB
shared caches
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
20
Cray X1 A node

Shared memory
32 network links and four I/O links per node

21
Cray X1 32 nodes
Fast Switch
22
Cray X1 128 nodes
23
Cray X1 Parallelism

Many levels of parallelism
Within a processor vectorization
Within an MSP streaming
Within a node shared memory
Across nodes message passing
Some are automated by the compiler, some require
work by the programmer
This is a common theme
The more complex the architecture, the more
difficult it is for the programmer to exploit it
Hard to fit this machine into a simple taxonomy
Similar story for the Earth Simulator

24
The Earth Simulator (NEC)

Each node
Shared memory (16GB)
8 vector processors I/O processor
640 nodes fully-connected by a 640x640 crossbar
switch
Total 5120 8GFlop processors -gt 40TFlop peak

25
Blue Gene/L

65,536 processors
Relatively modest clock rates, so that power
consumption is low, cooling is easy, and space is
small (1024 nodes in the same rack)
Besides, processor speed is on par with the
memory speed so faster clock rate does not help
2-way SMP nodes (really different from the X1)
several networks
64x32x32 3-D torus for point-to-point
tree for collective operations and for I/O
plus other Ethernet, etc.

26
If you like dead Supercomputers

Lots of old supercomputers w/ pictures
http//www.geocities.com/Athens/6270/superp.html
Dead Supercomputers
http//www.paralogos.com/DeadSuper/Projects.html
e-Bay
Cray Y-MP/C90, 1993
45,100.70
From the Pittsburgh Supercomputer Center who
wanted to get rid of it to make space in their
machine room
Original cost 35,000,000
Weight 30 tons
Cost 400,000 to make it work at the buyers
ranch in Northern California

27
Distributed Memory Programming

So this is all well and good, we can put tons of
machines together
The big question is How do we write code for
something like this?
The application now consists of multiple
processes running on different machines
Each process can consist of multiple threads!
Lets look at a picture

28
Distributed Memory Platform
hyper-threaded processor core
dual-core chip
dual-core system
L1
L1
L1
29
Distributed Memory Platform
hyper-threaded processor core
dual-core chip
dual-core system
L1
L1
L1
8-way Switch
Cluster of dual-core systems
30
Distributed Memory Program
8-way Switch

8 processes
Each process may contain 4 threads
2 threads are running on each core using hyper
threading
Each process may contain more or fewer threads

31
Distributed Memory Program
8-way Switch

Each process stores some data in the memory of
its box
Lets see an example

32
Distributed-Memory Heat Equation

Say you want to solve the Heat Transfer equation
This application really looks like an image
processing filter
You just run it multiple times in a row

f( , , , )
33
Sample Stencil App Code

The code could look like something like this
int aNN, a_newNN
for (i1 iltN-1 i)
for (j1 jltN-1 j)
a_newij f(aij,
ai-1j,ai1j,
aij-1,aij1)
Probably with threads, etc.

34
Too Large?

This is all well and good, but what if my array
requires 8GB of memory and I only have 1GB of
RAM?
I could think of just relying on virtual memory
This is bound to be very slow
I could manage the reads and writes to disk
myself
Could be a bit faster than virtual memory if I am
really clever, but would be complicated and still
slow
Called an out of core implementation
Or, I could use 8 machines with 1GB RAMs and run
fast without really ever swapping between the
memory and the disk!
For instance, we can use a cluster!

35
How do we write the program?

Of course, the big question is how do we write
the code
We cannot have a declaration of an NxN array any
more, because that would not fit in memory
Each process (running on a different system) must
handle an array of size N x N/8
Each process allocates memory for 1/8 of the
overall array

36
Data Distribution
37
Data Distribution
38
Data Distribution
process 1
process 3
process 2

Each piece of the image is stored in the memory
of a different system
A process running on one system can only see
(i.e., address) the local image piece, and has no
way to address other pieces
This is what makes distributed memory programming
MUCH harder than shared-memory programming

39
Boundaries!

One of the problems now is what happens at the
boundaries/edges of the image tiles?

Process 1 needs pixels from process 2
Process 2 needs pixels from process 1
Both processes cannot share memory because
theyre on different systems!
We cannot just change them into thread like in
shared-memory programming

process 1
process 2
40
Message-Passing

Since processes cannot share memory, they have to
exchange messages
here are the pixels you need from me, give me
the ones I need from you
This type of programming is called
message-passing
Uses network communication
e.g., socket and TCP
So your code will have special function calls
Send(...)
Receive(...)
Were getting further away from simple
shared-memory programming

41
SPMD Program

So at this point, we could
implement 8 different programs
start them up somehow on different nodes of our
cluster (for instance)
have them all somehow identify their left and
right neighbors, if any
Turns out that this is really cumbersome
And if I want to use 1000 processes, I have to
write 1000 programs?
Typically one uses/implements the notion of a
process rank

42
Process Ranks

To identify the processes participating in the
computation, each process is assigned an index
from 0 to N-1
And each process can find out what its rank is
and how many processes there are in total

0 1 2 3 4 5 6 7
43
Communication Patterns
0 1 2 3 4 5 6 7

Process 0 will send to 1 and receive from 1
Process 1 will send to 0, receive from 0, send to
2, and receive from 2
...
Process 7 will receive from 6 and send to 6

44
SPMD Programming

If every process can find out its rank and the
total number of processes, then one can write a
Single Program to operate on Multiple pieces of
Data simultaneously (SPMD)

int main() if (my_rank() 0) //
talk to my below neighbor else if (my_rank()
num_processes() -1) // talk to my above
neighbor else // talk to my above and
below neighbors
45
Ranks and Number of Processes

For now were going to assume we have the
my_rank() and the num_processes() functions, and
the all the logistics of starting up the
processes is taken care of
We will see later that there are standard ways to
make this happen
But this can also be implemented by hand if
necessary
At any rate, the way to write distributed memory
programs is to rely on the process ranks
assumption

46
Writing the SPMD Program

The pseudo-code of the SMPD program could then
look like

int main() int M N/num_processes() //
assumed to be integer! int original_imageMN
int new_imageMN // load my part of the
image from disk // compute all the pixels that
do not require communication // send pixels to
my neighbor(s) // receive pixels from my
neighbors() // compute the remaining pixels
// save the new image to file
47
Writing the SPMD Program

For now, lets ignore the issue of
loading/writing files to disk
There are a lot of options here, simple/slow
ones, and complex/fast ones
Lets focus on computation and communication

48
Computing the easy pixels
N
Can be computed without communication
0
M
1
Requires pixels from neighbors
2
3
(note that process 0 and process N-1 can compute
one more row than the others without any
communication
4
5
6
49
Computing the easy pixels
for (j0 iltN j) if (my_rank() 0) //
top process can compute an extra row
new_image0j f ( original_image0j,

original_image0j-1, original_image0j1,

original_image1j ) if (my_rank()
num_processes()-1) // bottom process can
compute
// an extra row
new_imageM-1j f ( original_imageM-1j,

original_imageM-1j-1, original_imageM-1j1
,
original_imageM-2j ) for (i1
iltM-1 i) // Everybody computes the middle
M-2 rows new_imageij f (
original_imageij,
original_imagei1j,
original_imagei-1j,
original_imageij-1,
original_imageij1 )
50
Global/Local Index

One of the reason why distributed memory
programming is difficult is because of the
discrepancy between global and local indices
When I think globally of the whole image, I
know where pixel at coordinates (100,100) is
But when I write the code, I will not reference
the pixel as image100100!
Lets look at this on an example

51
Global/Local Index
Process 0
Process 1

The red pixels global coordinates are (5,1)
The pixel on the 6th row and the 2nd column of
the big array
But when Process 1 references it, it must use
coordinates (1,1)
The pixel on the 2nd row and the 2nd column of
the tile thats stored in Process 1

52
Message Passing

Lets assume that we have a send() function that
takes as argument
The rank of the destination process
An address in local memory
A size (in bytes)
Lets assume that we have a recv() function that
takes as argument
An address in local memory
A size (in bytes)

53
A Process Memory
original_image MxN
sent to above neighbor
not communicated
sent to below neighbor
buffer_top 1xN
received from above neighbor
buffer_bottom 1xN
received from below neighbor
new_image MxN
updated with received data
updated w/o using received data
updated with received data
54
Sending/Receiving Pixels

int buffer_topM, buffer_bottomM
if (my_rank() ! 0) // receive from above
neighbor
send(my_rank()-1,(original_image00),sizeof(
double)N)
recv(buffer_top, sizeof(double)N)
if (my_rank() ! num_processes()-1) //
receive from below neighbor
send(my_rank()1, (original_imageM-10),
sizeof(double)N)
recv(buffer_bottom, sizeof(double)N)
// assumes non-blocking sending

55
Computing Remaining Pixels

if (my_rank() ! 0) // update top pixels
for (j0 jltN j)
new_imageij f (
original_imageij,
original_imagei1j, buffer_top0j,
original_imageij-1, original_imageij1 )
if (my_rank() ! N-1) // update bottom
pixels
for (j0 jltN j)
new_imageij f (
original_imageij,
buffer_bottom0j, original_imagei1j,
original_imageij-1, original_imageij1 )

56
Were done!

At this point, we have written the whole code
Whats missing is I/O
Read the image in
Write the image out
Dealing with I/O (efficiently) is a difficult
problem, and we wont really talk about it in
depth
And of course we need to use a tool that provides
the my_rank(), the num_processors(), the send()
and the recv() functions
Each process allocates 1xN 1xN 2(M/P)xN
(2M/P2)N pixels, where P is the number of
processors
Therefore, the total number of pixels allocated
is 2MN 2NP
2NP extra pixels allocated than in the sequential
version
But its insignificant when spread across
multiple systems

57
The Code

int main()
int i, j, M N/num_processes() // assumed to
be integer!
int original_imageMN, new_imageMN
double buffer_topM, buffer_bottomM
for (j0 iltN j)
if (my_rank() 0) // top process can
compute an extra row
new_image0j f ( original_image0j,
original_image0j-1, original_image0j1,
original_image1j )
if (my_rank() num_processes()-1) //
bottom process can compute an extra row
new_imageM-1j f ( original_imageM-1
j, original_imageM-1j-1, original_imageM-1
j1, original_imageM-2j )
for (i1 iltM-1 i) // Everybody computes
the middle M-2 rows
new_imageij f ( original_imageij
, original_imagei1j, original_imagei-1j,
original_imageij-1, original_imageij1
)
if (my_rank() ! 0) // receive from
above neighbor
send(my_rank()-1,(original_image00),sizeo
f(double)N)
recv(buffer_top, sizeof(double)N)

58
The Open/MP Code

int main()
int i,j
int old_imageNN, new_imageNN
pragma omp parallel for private(i,j)
shared(original_image, new_image)
for (i0 iltN i)
for (j0 jltN j)
new_imageij f ( original_imageij,
original_imagei1j, original_imagei-1j,
original_imageij-1, original_imageij1 )
And in the distributed memory code we have made
the simplifying assumption that P divides N,
which would increase the complexity of the
distributed memory code (but not of the
shared-memory code!)

59
Overlapping Comp and Comm

One of the complexities of writing distributed
memory programs is to hide the cost of
communication
Again, wed like to pretend we have a big
shared-memory machine without a network at all
Its all very similar to what we did with the
image processing application (HW5)
In the previous example, as opposed to doing
compute easy pixels, send, recv, compute
remaining pixels
one should do
send, compute easy pixels, recv, compute
remaining pixels

60
Hybrid Parallelism

In a cluster, individual systems are
multi-proc/core
Therefore, one should use multiple threads within
each system
This can be done by adding a few deft Open/MP
pragmas to the distributed memory code
For instance
pragma omp parallel for private(i,j)
shared(original_image, new_image)
for (i1 iltM-1 i) // Everybody computes the
middle M-2 rows
new_imageij f ( original_imageij,
original_imagei1j, original_imagei-1j,
original_imageij-1, original_imageij1 )

61
Conclusion

Writing distributed memory code is much more
complex that shared memory code
One must identify what must be communicated
One must keep a mental picture of the memory
across systems
All this in addition to all the concerns we have
mentioned in class
e.g., cache reuse, synchronization among threads
And the typical problems of shared memory are
still there
There can be communication deadlocks, for
instance
An in-depth study of distributed-memory
programming belongs to a graduate-level class
But its likely that youll end up at some point
writing distributed applications with data
distribution among disjoint processes