Title: Global FFT, Global EPSTREAM Triad, HPL written in MC
1Global FFT, Global EP-STREAM Triad, HPLwritten
in MCVadim B. Guzev vguzev_at_yandex.ruRussian
People Friendship UniversityOctober 2006
2Global FFT
3In this submission well show how to implement
GlobalFFT in MC programming language and well
try to concentrate on the process of writing
parallel distributed programs, but not on the
performance/number of lines issues. We are quite
sure that the future belongs to very high-level
programming languages and that one day the
productivity of the programmers will become more
important thing, than the productivity of the
platforms! Thats why MC was born In
object-oriented languages all programs are
composed of objects and their interaction. It is
natural that when programmer starts thinking
about the problem first of all he would like to
describe the object model before writing any
logic. In Global FFT program these classes are
Complex (structure) and GlobalFFT (algorithm). We
will start writing our program by defining the
Complex class
The simple math behind the Global FFT problem is
the following
4The natural way to distribute this computation is
to split the execution by index k. Thats exactly
what we will do. In MC if you want to create a
method which must be executed in different
thread/node/cluster all you need to do is to mark
this method as movable (distributed analogue of
void, or async keyword of C 3.0). Where exactly
movable method will be executed is determined by
the Runtime system and the call of movable method
on the callers side occurs almost immediately
(i.e. caller of the method doesnt wait until the
method execution is completed). In our case this
movable method will receive as parameters (a)
Array of complex values z, (b) Current number of
processor and (c) Special channel into which the
result will be sent.
As you can see, in MC it is possible to use
almost any types of parameters for movable
methods. When distributed mode is enabled these
parameters will be automatically serialized and
sent to remote node. The same applies to channels
it is possible to send values of any .Net type
(which supports serialization) through the
channels. To get the results from channels you
have to connect them with synchronous methods
this is known as bounds in languages like
Polyphonic C, C 3.0 or MC. More information
about bounds you can get on MC site.
5And finally lets write down the main method
which will launch the computation
6So, the first version of our Global FFT program
is the following
7Parallel programs written in MC language can be
executed either in local mode (i.e. as simple
.exe files in this mode all movable methods
will be executed in different threads) or in
distributed mode in this case all movable calls
will be distributed across nodes of the
Cluster/MetaCluster/GRID network (depending on
the currently used Runtime). That means that
programmer can write and debug his program
locally (for example on his Windows machine) and
then copy his program to Windows-based or
Linux-based cluster and run it in distributed
mode. User can even emulate cluster environment
on his home computer! MC makes cluster
computations accessible to every programmer, even
to those of them who currently do not have access
to clusters! Lets try to run this program in
local mode on Windows machine
Or we can run this program in local mode on Linux
machine
8OK, it works. Now lets try to run this program
on more serious hardware. Well use 16-nodes
cluster with the following configuration
Lets run our program on 16 processors
Here is the result graph
9Not bad, especially if we take into account that
we wrote this program on modern high-level
object-oriented language without thinking about
any optimization issues or the physical structure
of computational platform. Now lets try to
optimize it a little bit. The main problem in
the first version of our program is that we need
to move thousands of complex user-defined objects
from frontend to cluster nodes and back.
Serialization/Deserialization process of such
objects takes a lot of resources and time. We can
significantly reduce the execution time if we
replace arrays of Complex to arrays of
doubles. Here is the modified version of the
program (lines 46, NCSL 43, TPtoks 490)
10Lets run this version of the program
Now we can get some performance numbers for the
second version of our Global FFT program
11Global EP-STREAM-Triad
12There is another one task in Class 2
Specification which is entitled as Global
EP-STREAM-Triad. Although C currently doesnt
support kernel vector operations we think that it
is still good example to demonstrate the syntax
of MC. Well write a simple program which will
make the same calculations on several nodes
simultaneously and then print the average time
taken on all nodes in general. There will be only
one movable method in this program which will
accept special Channel through which only objects
of class TimeSpan can be sent
Movable methods cannot return any values.
Channels must be used instead to pass the
information between nodes. To read from
semi-directional channels bounds must be used
(special syntax constructs) which can synchronize
multiple threads. In our case we need only one
bound
When Thread A is calling method GetResult then
Runtime system checks whether any object has been
delivered to the result channel and queued in the
special channel queue. If no objects have been
received then Thread A is suspended until result
channel receives some object. When this object
will be received then Thread A will be resumed
and the reading from the channel occurs. When
object is sent to result channel and there is no
waiting callers of method GetResult then this
object is put into special channel queue. Object
will be read when corresponding GetResult method
will be called.
13Here is the final version of this EP-STREAM Triad
program (lines 36, NCSL 33, TPtoks 311)
Lets run this program
14HPL
15HPL solves a linear system of equations of order
n Axb, by first computing ALU factorization
and then solving the equations Lyb and Uxy one
by one. In this scheme L and U matrixes are
triangular matrixes, so it is not a problem to
solve them. The real problem is the calculation
of these matrixes L and U.
The simple math behind this problem is the
following
And the calculation dependencies graph is the
following
162
3
0
1
17So, we know that there do exist better
algorithms, but they are quite complex to
understand and the purpose of our submission is
not to get the highest performance results, but
to show the principles of programming in MC
language. So, for simplicity reasons well use
the most simple communication structure where
each process is communicating directly with top,
left, bottom and right processes in the processes
grid. In our case each process will be connected
with their neighbors by bi-directional channels
(BDChannel). Using these bi-directional channels
processes can communicate to each other by
sending and receiving messages.
18The Main method of our program is quite simple.
Actually it is written in pure C code (MC
specific syntax is not used here). First of all
we generate matrix A and vector b, and after that
we instantiate the HPLAlgorithm object and solve
the equation by calling the Solve method. After
that we verify the solution. See comments in the
code to get the better understanding of the code.
19First of all lets look at accessory methods.
These are Verify, GetSubMatrix and
GetSubVector. Verify method verifies the solution
based on the criteria mentioned in the HPC
Challenge Awards Class 2 Specification
20Now lets have a look at the main Solve method.
In this method we are creating p-by-q grid of
bi-directional channels and then launching p q
movable methods giving them as parameters
corresponding parts of matrix a (and if it is
necessary corresponding parts of vector b). Also
all movable methods receive bi-directional
channels pointing to processs neighbors and to
current process itself, as well as the
semi-directional channel to return the result of
computations. Actually only p processes will
return values. These processes are located in the
diagonal of p-by-q processes grid. We also have
here one Get xChannel bound which is used to
receive parts of calculated vector x from running
movable methods and to merge these parts into
the resulting vector.
21And finally here is our movable method
hplDistributed
22(No Transcript)
23The communication scheme is described more
closely in the next slides.
24Step 1 Calculating vector y
Calculate y0
25L00
y0, U00
26L00
y0, U00
27L00
U01
Calculate y1
L10, ySum
y0, U00
28L00
U02
L1011
U0111
L20, ySum
y0, U00
29Calculate y2
30(No Transcript)
31Calculate y3
32(No Transcript)
33Calculate y4
34(No Transcript)
35Calculate y5
36The final matrixes L and U fragments distribution.
L00, U00
L00, U01
L00, U02
L00, U03
L00, U04
L00, U05
L10, U00
L1011, U0111
L1011, U0212
L1011, U0313
L1011, U0414
L1011, U0515
L2021, U0111
L202122, U021222
L202122, U031323
L202122, U041424
L202122, U051525
L20, U00
L3031, U0111
L303132, U021222
L30313233, U03132333
L30313233, U04142434
L30313233, U05152535
L30, U00
L4041, U0111
L404142, U021222
L40414243, U03132333
L4041424344, U0414243444
L4041424344, U0515253545
L40, U00
L5051, U0111
L505152, U021222
L50515253, U03132333
L5051525354, U0414243444
L505152535455, U051525354555
L50, U00
37Step 2 Calculating vector x
Pass x 5 to the main method
Calculate x 5
38Calculate x 4
39Pass x 4 to the main method
40Calculate x3
41Pass x3 to the main method
42Calculate x2
43Pass x2 to the main method
44Calculate x1
45Pass x1 to the main method
46Calculate x0 and pass it to the main method
47Vector x now can be merged from fragments on the
main node!
Calculate x0 and pass it to the main method
48Here are the measurements for algorithm described
in the previous slides. hpl_notparallel.mcs Not
parallel version SKIF 16x2 nodes cluster
http//skif.botik.ru/ vadim_at_skif gfft uname
-a Linux skif 2.4.27 1 SMP Thu Apr 14 152511
MSD 2005 i686 athlon i386 GNU/Linux hpl_7.mcs
P10 Q10 NP32 mono hpl_7.exe N 10 10 /np
32 Note This version includes times needed for
matrix A and vector b generation.
This implementation has a lack that all
communications go through the clusters frontend.
This happens because all bi-directional Channels
were initially created on the frontend machine.
It is possible to reduce the execution time by
using the mutual exchange of bi-directional
channels between neighbor processes (see next
slides to understand how it can be done).
49Step 0 Exchange of bi-directional channels
Each process creates bi-directional channel
locally and pass it to the right process
50Each process passes local bi-directional channel
to the left process
51Each process passes local bi-directional channel
to the bottom process
52Each process passes local bi-directional channel
to the top process
53This is how the previous four slides can be
written in MC language (it should be inserted at
the beginning of the method)
And this is the difference in execution time
54If we compare these two implementations and have
a look at the statistics provided by MC runtime
well see that in hpl_8.mcs during the
calculations of 4000x4000 matrix 974'032'636
bytes of information are transferred between
nodes skif gfft mono hpl_8.exe 4000 10 10 /np
32 _______________________________________________
_ MC Statistics
Number of movable calls 100 Number of channel
messages 10 Number of movable calls (across
network) 100 Number of channel messages (across
network) 10 Total size of movable calls (across
network) 128114740 bytes Total size of channel
messages (across network) 33890 bytes Total time
of movable calls serialization
000014.0937610 Total time of channel messages
serialization 000000.0106820 Total size of
transported messages 974032636 bytes Total time
of transporting messages 000109.7663320 Session
initialization time 000000.4988500 / 0.49885
sec. / 498.85 msec. Total time 002032.4156470
/ 1232.415647 sec. / 1232415.647
msec. ____________________________________________
____ While in hpl_7.mcs during the calculations
of the same 4000x4000 matrix 1'819'549'593 bytes
of information are transferred between
nodes skif gfft mono hpl_7.exe 4000 10 10 /np
32 _______________________________________________
_ MC Statistics
Number of movable calls 100 Number of channel
messages 10 Number of movable calls (across
network) 100 Number of channel messages (across
network) 10 Total size of movable calls (across
network) 128114740 bytes Total size of channel
messages (across network) 33890 bytes Total time
of movable calls serialization
000020.9258120 Total time of channel messages
serialization 000000.0719850 Total size of
transported messages 1819549593 bytes Total time
of transporting messages 000238.6718660 Session
initialization time 000000.4900100 / 0.49001
sec. / 490.01 msec. Total time 002112.4946390
/ 1272.494639 sec. / 1272494.639
msec. ____________________________________________
____
55- Explaining the figures/Limitations of
implementation - 1) Implemented HPL algorithm was selected by the
principle "as simple as possible to understand
and to read the final code". It is possible to
significantly improve the productivity by using
more advanced algorithms of Panel Broadcasting
and Update and Look-Ahead heuristics - 2) MC Runtime system has not been optimized yet
for really big number of processors - it is
working quite good for clusters with number of
processors NPlt16. Note that MC is still a
research project. It is just a matter of time,
before we get the really effective runtime
system. - 3) Currently there are no broadcast operations in
the MC syntax. It looks like that we'll have to
add such capability to the language in the future - 4) HPL requires intensive usage of network
bandwidth. The speedup is possible in case of
using SCI network adapters by MC runtime
(currently in development). In these measurements
we used standard Ethernet adapters - 5) MC is using standard .Net Binary Serializer
for transferring objects from one node to another
one. This operation is quite memory-consuming.
Improved performance can be achieved by writing
custom serializers. - 6) Mono implementation of .Net platform is not
yet as fast as implementation from Microsoft
56Thanks for your time!
MC Homepage http//u.pereslavl.ru/vadim/MCSharp
/ (the site of the project can be temporary down
_at_ October/November due to hardware upgrade works)
Special thanks to
Yury P. Serdyuk for his great work on MC
project and help in preparing this document
Program Systems Institute / University of
Pereslavl for hosting MC project homepage