Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters

About This Presentation

Title:

Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters

Description:

The belief is that native executables are optimized by the C compiler and will ... The executables produced by Hyperion are then executed by the respective native ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 71

Provided by: mathe82

Learn more at: https://www.cs.unh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters

1
Comparing The Performance Of Distributed Shared
Memory And Message Passing Programs Using The
Hyperion Java Virtual Machine On Clusters

Mathew Reno

2
Overview

For this thesis we wanted to evaluate the
performance of the Hyperion distributed virtual
machine, designed at UNH, when compared to a
preexisting parallel computing API.
The results would indicate where Hyperions
strength and weaknesses were and possibly
validate Hyperion as a high-performance computing
alternative.

3
What Is A Cluster?

A cluster is a group of lowcost computers
connected with an offtheshelf network.
The clusters network is isolated from WAN data
traffic and the computers on the cluster are
presented to the user as a single resource.

4
Why Use Clusters?

Clusters are cost effective when compared to
traditional parallel systems.
Clusters can be grown as needed.
Software components are based on standards
allowing portable software to be designed for the
cluster.

5
Cluster Computing

Cluster computing takes advantage of the cluster
by distributing computational workload among
nodes of the cluster, thereby reducing total
computation time.
There are many programming models for
distributing data throughout the cluster.

6
Distributed Shared Memory

Distributed Shared Memory (DSM) allows the user
to view the whole cluster as one resource.
Memory is shared among the nodes. Each node has
access to all other nodes memory as if it owns
it.
Data coordination among nodes is generally hidden
from the user.

7
Message Passing

Message Passing (MP) requires explicit messages
to be employed to distribute data throughout the
cluster.
The programmer must coordinate all data exchanges
when designing the application through a language
level MP API.

8
Related Treadmarks Vs. PVM

Treadmarks (Rice, 1995) implements a DSM model
while PVM implements a MP model. The two
approaches were compared with benchmarks.
On average, PVM was found to perform two times
better the Treadmarks.
Treadmarks suffered from excessive messages that
were required for the requestresponse
communication DSM model employed.
Treadmarks was found to be more natural to
program with saving development time.

9
Hyperion

Hyperion is a distributed Java Virtual Machine
(JVM), designed at UNH.
The Java language provides parallelism through
its threading model. Hyperion extends this model
by distributing the threads among the cluster.
Hyperion implements the DSM model via DSM-PM2,
which allows for lightweight thread creation and
data distribution.

10
Hyperion, Continued

Hyperion has a fixed memory size that it shares
with all threads executing across the cluster.
Hyperion uses pagebased data distribution if a
thread accesses memory it does not have locally,
a pagefault occurs and the memory is transmitted
from the node that owns the memory to the
requesting node a page at a time.

11
Hyperion, Continued

Hyperion translates Java bytecodes into native C
code.
A native executable is generated by a native C
compiler.
The belief is that native executables are
optimized by the C compiler and will benefit the
application by executing faster than interpreted
code.

12
Hyperions Threads

Threads are created in a round robin fashion
among the nodes of the cluster.
Data is transmitted between threads via a
request/response mechanism. This approach
requires two messages.
In order to respond to a request message, a
response thread must be scheduled. This thread
handles the request by sending back the requested
data in a response message.

13
mpiJava

mpiJava is a Java wrapper for the Message Passing
Interface (MPI).
The Java Native Interface (JNI) is used to
translate between Java and native code.
We used MPICH for the native MPI implementation.

14
Clusters

The Star cluster (UNH) consists of 16 PIII
667MHz Linux PCs on a 100Mb Fast Ethernet
network. TCP is communication protocol.
The Paraski cluster (France) consists of 16
PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP
(DSM) and GM (MPI) are the communication
protocols.

15
Clusters, Continued

The implementation of MPICH on BIP was not stable
in time for this thesis. GM had to be used in
place of BIP for MPICH. GM has not been ported to
Hyperion and a port would be unreasonable at this
time.
BIP performs better than GM as the message size
increases.

16
BIP vs. GM Latency (Paraski)
17
DSM MPI In Hyperion

For consistency, mpiJava was ported into
Hyperion.
Both DSM and MPI versions of the benchmarks could
be compiled by Hyperion.
The executables produced by Hyperion are then
executed by the respective native launchers (PM2
and MPICH).

18
Benchmarks

The Java Grande Forum (JGF) developed a suite of
benchmarks to test Java implementations.
We used two of the JGF benchmark suites,
multithreaded and javaMPI.

19
Benchmarks, Continued

Benchmarks used
Fourier coefficient analysis
Lower/upper matrix factorization
Successive over-relaxation
IDEA encryption
Sparse matrix multiplication
Molecular dynamics simulation
3D Ray Tracer
Monte Carlo simulation (only with MPI)

20
Benchmarks And Hyperion

The multi-threaded JGF benchmarks had
unacceptable performance when run out of the
box.
Each benchmark creates all of its data objects on
the root node causing all remote object access to
occur through this one node.
This type of access causes a performance
bottleneck on the root node as it has to service
all the requests while calculating its algorithm
part.
The solution was to modify the benchmarks to be
cluster aware.

21
Hyperion Extensions

Hyperion makes up for Javas limited thread data
management by providing efficient reduce and
broadcast mechanisms.
Hyperion also provides a cluster aware
implementation of arraycopy.

22
Hyperion Extension Reduce

Reduce blocks all enrolled threads until each
thread has the final result of the reduce.
This is done by neighbor threads exchanging their
data for computation, then their neighbors, and
so on until each thread has the same answer.
This operation is faster and scales well as
opposed to performing the calculation serially.
The operation is LogP.

23
Hyperion Extension Broadcast

The broadcast mechanism transmits the same data
to all enrolled threads.
Like reduce, data is distributed to the threads
in a LogP fashion, which scales better than
serial distribution of data.

24
Hyperion Extension Arraycopy

The arraycopy method is part of the Java System
class. The Hyperion version was extended to be
cluster aware.
If data is copied across threads, this version
will send all data as one message instead of
relying on paging mechanisms to access remote
array data.

25
Benchmark Modifications

The multithreaded benchmarks had unacceptable
performance.
The benchmarks were modified in order to reduce
remote object access and root node bottlenecks.
Techniques, such as arraycopy, broadcast and
reduce were employed to improve performance.

26
Experiment

Each benchmark was executed 50 times at each node
size to provide a sample mean.
Node sizes were 1, 2, 4, 8, and 16.
Confidence intervals (95 level) were used to
determine which version, MPI or DSM, performed
better.

27
Results On The Star Cluster
28
Results On The Paraski Cluster
29
Fourier Coefficient Analysis

Calculates the first 10,000 pairs of Fourier
coefficients.
Each node is responsible for calculating its
portion of the coefficient array.
Each node sends back its array portion to the
root node, which accumulates the final array.

30
Fourier DSM Modifications

The original multithreaded version required all
threads to update arrays located on the root
node, causing the root node to be flooded with
requests.
The modified version used arraycopy to distribute
the local arrays back to the root thread arrays.

31
Fourier mpiJava

The mpiJava version is similar to the DSM
version.
Each process is responsible for its portion of
the arrays.
MPI_Ssend and MPI_Recv were called to distribute
the array portions to the root process.

32
Fourier Results
33
Fourier Conclusions

Most of the time in this benchmark is spent in
the computation.
Network communication does not play a significant
role in the overall time.
Both MPI and DSM perform similar on each cluster,
scaling well when more nodes are added.

34
Lower/Upper Factorization

Solves a 500 x 500 linear system with LU
factorization followed with a triangular solve.
The factorization is parallelized while the
triangular solve is computed in serial.

35
LU DSM Modifications

The original version created the matrix on the
root thread and all access was through this
thread, causing performance bottlenecks.
The benchmark was modified to use Hyperions
Broadcast facility to distribute the pivot
information and arraycopy was used to coordinate
the final data for the solve.

36
LU mpiJava

MPI_Bcast is used to distribute the pivot
information.
MPI_Send and MPI_Recv are used so the root
process can acquire the final matrix.

37
LU Results
38
LU Conclusions

While the DSM version uses a similar data
distribution mechanism as the MPI version, there
is significant overhead that is exposed when
executing these methods in large loops.
This overhead is minimized on the Paraski cluster
due to the nature Myrinet and BIP.

39
Successive Over-Relaxation

Performs 100 iterations of SOR on a 1000 x 1000
grid.
A red-black ordering mechanism allows array
rows to be distributed to nodes in blocks.
After initial data distribution, only neighbor
rows need be communicated during the SOR.

40
SOR DSM Modifications

Excessive remote thread object access made it
necessary to modify the benchmark.
Modified version uses arraycopy to update
neighbor rows during the SOR.
When the SOR completes, arraycopy is used to
assemble the final matrix on the root thread.

41
SOR mpiJava

MPI_Sendrecv is used to exchange neighbor rows.
MPI_Ssend and MPI_Recv are used to build the
final matrix on the root process.

42
SOR Results
43
SOR Conclusions

The DSM version requires an extra barrier after
row neighbors are exchanged due to the network
reactivity problem.
A thread must be able to service all requests in
a timely fashion. If the thread is busy
computing, it cannot react quickly enough to
schedule the request thread.
The barrier will block all threads until each
reaches the barrier, which guarantees that all
nodes have their requested data and it is safe to
continue with the computation.

44
IDEA Crypt

Performs IDEA encryption and decryption on a
3,000,000 byte array.
The array is divided among nodes in a block
manner.
Each node encrypts and decrypts its portion.
When complete, the root node collects the
decrypted array for validation.

45
Crypt DSM Modifications

The original created the whole array on the root
thread and required each remote thread to page in
their portions.
The modified version used arraycopy to distribute
each threads portion from the root thread.
When decryption finishes, arraycopy copies back
the decrypted portion to the root thread.

46
Crypt mpiJava

The mpiJava version uses MPI_Ssend to send the
array portions to the remote processes and
MPI_Recv to receive the portions.
When complete, MPI_Ssend is used to send back the
processes portion and MPI_Recv receives each
portion.

47
Crypt Results
48
Crypt Conclusions

Results are similar on both clusters.
There is a slight performance problem with 4 and
8 nodes with the DSM version.
This can be attributed to the placing of a
barrier that causes all threads to block before
computing, in the DSM version, while the MPI
version does not block.

49
Sparse Matrix Multiplication

A 50,000 x 50,000 unstructured matrix stored in
compressed-row format is multiplied over 200
iterations.
Only the final result is communicated as each
node has its own portion of data and initial
distribution is not timed.

50
Sparse DSM Modifications

This benchmark originally produced excessive
network traffic through remote object access.
The modifications involved removing the object
access during the multiplication loop and using
arraycopy to distribute the final result to the
root thread.

51
Sparse mpiJava

This benchmarks only communication is an
MPI_Allreduce, which performs a array reduce
leaving the result on all nodes. This is employed
to obtain the final result of the multiplication.

52
Sparse Results
53
Sparse Conclusions

The results are similar on both clusters.
The DSM version has better performance on both
cluster.
The MPI version uses the MPI_Allreduce method
that places the results on all nodes.
This method adds extra overhead that is not
present in the DSM version, where the results are
just collected on the root node.

54
Molecular Dynamics

This benchmark is a 2048-body code that models
particles interacting under a Lennard-Jones
potential in a cubic spatial volume with periodic
boundary conditions.
Parallelization is provided by dividing the range
of iterations over the particles among nodes.

55
MolDyn DSM Modifications

Significant amount of remote thread object access
necessitated modifications.
Particle forces are updated by remote threads
using arraycopy to send the forces to the root
thread and the root thread serially updates the
forces and sends the new force array back to the
remote threads via arraycopy.

56
MolDyn mpiJava

This version uses six MPI_Allreduce commands to
update various force and movement arrays. This
occurs at every time step.

57
MolDyn Results
58
MolDyn Conclusions

The DSM version must update particle forces
serially on the root thread.
This causes all threads to block while sending
its local forces to the root thread and wait for
the updated forces to be sent back.
The MPI version uses MPI_Allreduce to efficiently
compute the forces among all the processes in a
parallel fashion causing nodes to only block for
its neighbor force updates.

59
Ray Tracer

A scene with 64 spheres is rendered at 150 x 150
pixels.
Parallelization is provided by a cyclic
distribution for load balancing when looping over
the rows of pixels.

60
Ray Tracer DSM Modifications

This benchmark was poorly designed as it created
far too many temporary objects. The DSM version
was heavily modified to eliminate temporary
object creation.
Broadcast is used to transmit the array row
reference to each thread.
Once the rendering is complete, arraycopy is used
to copy the row data to the root thread.
Reduce is used to compute the pixel checksum for
validation purposes.

61
Ray Tracer mpiJava

The mpiJava version was also modified to remove
temporary object creation.
MPI_Reduce is used to generate the pixel
checksum.
MPI_Send and MPI_Recv are used to transmit the
row data to the root process.

62
Ray Tracer Results
63
Ray Tracer Conclusions

The results on both nodes are almost identical.
Very little network communication is required.
Most of the time is spent rendering the scene.

64
Monte Carlo

Uses Monte Carlo techniques to price products
derived from the worth of an underlying asset.
Generates 2,000 samples.
Parallelization is provided by dividing the work
in the principle loop of the Monte Carlo runs in
block fashion and distributing the blocks to
remote nodes.

65
Monte Carlo Lack Of DSM

This benchmark required an unacceptable amount of
memory for Hyperion to handle.
It is embarrassingly parallel and we have other
embarrassingly parallel benchmarks.
We felt that it was unnecessary to develop a
working DSM version from this large set of code.

66
Monte Carlo mpiJava

The mpiJava version provided some insight into
the mpiJava to Hyperion port, specifically in the
object serialization portion.
Monte Carlo relies heavily on object
serialization as Java objects are distributed via
send and receive commands instead of primitive
types.
By creating a working version of Monte Carlo, we
were able to eliminate many subtle mpiJava to
Hyperion port bugs.

67
Monte Carlo Results
68
Monte Carlo Conclusions

The MPI version scales well on both clusters.
Without a DSM version, a comparison is not
possible.

69
Conclusions

The Hyperion system can compete with traditional
parallel programming models.
However, to compete, a Hyperion user cannot
simply write multi-threaded Java code and expect
it to perform well on a cluster.
Users must be aware of how thread creation works
in Hyperion and the effects of remote object
access in order to tune their applications.

70
Conclusions, Continued

With an application developed using Hyperions
thread management and data exchange techniques
(reduce, broadcast, arraycopy), a Hyperion user
can achieve competitive performance.
We feel that methods for performing operations
on groups of threads, such as the above, should
be part of the Java threading API, as they could
be useful even outside a parallel environment.

Write a Comment

User Comments (0)