Title: MPJ%20Express:%20An%20Implementation%20of%20Message%20Passing%20Interface%20(MPI)%20in%20Java
1MPJ Express An Implementation of Message Passing
Interface (MPI) in Java
- Aamir Shafi
- http//mpj-express.org
- http//acet.rdg.ac.uk/projects/mpj
2Writing Parallel Software
- There are mainly two approaches for writing
parallel software - Software that can be executed on parallel
hardware to exploit computational and memory
resources - The first approach is to use messaging libraries
(packages) written in already existing languages
like C, Fortran, and Java - Message Passing Interface (MPI)
- Parallel Virtual Machine (PVM)
- The second and more radical approach is to
provide new languages - HPC has a history of novel parallel languages
- High Performance Fortran (HPF)
- Unified Parallel C (UPC)
- In this talk we talk about an implementation of
MPI in Java called MPJ Express
3Introduction to Java for HPC
- Java was released by Sun in 1996
- A mainstream language in software industry,
- Attractive features include
- Portability,
- Automatic garbage collection,
- Type-safety at compile time and runtime,
- Built-in support for multi-threading
- A possible option to provide nested parallelism
on multi-core systems, - Performance
- Just-In-Time compilers convert source code to
byte code, - Modern JVMs perform compilation from byte code to
native machine code on the fly - But Java has safety features that may limit
performance.
4Introduction to Java for HPC
- Three existing approaches to Java messaging
- Pure Java (Sockets based),
- Java Native Interface (JNI), and
- Remote Method Invocation (RMI),
- mpiJava has been perhaps the most popular Java
messaging system - mpiJava (http//www.hpjava.org/mpiJava.html)
- MPJ/Ibis (http//www.cs.vu.nl/ibis/mpj.html)
- Motivation for a new Java messaging system
- Maintain compatibility with Java threads by
providing thread-safety, - Handle contradicting issues of high-performance
and portability.
5Distributed Memory Cluster
Proc 1
Proc 2
Proc 0
message
LAN Ethernet Myrinet Infiniband etc
Proc 3
Proc 7
Proc 6
Proc 4
Proc 5
6(No Transcript)
7Write machines files
8Bootstrap MPJ Express runtime
9Write Parallel Program
10Compile and Execute
11Introduction to MPJ Express
- MPJ Express is an implementation of a Java
messaging system, based on Java bindings - Will eventually supersede mpiJava.
- Aamir Shafi, Bryan Carpenter, and Mark Baker
- Thread-safe communication devices using Java NIO
and Myrinet - Maintain compatibility with Java threads,
- The buffering layer provides explicit memory
management instead of relying on the garbage
collector, - Runtime system for portable bootstrapping
12James Gosling Says
13Who is using MPJ Express?
- First released in September 2005 under LGPL (an
open-source licence) - Approximately 1000 users all around the world
- Some projects using this software
- Cartablanca is a simulation package that uses
Jacobian-Free-Newton-Krylov (JFNK) methods to
solve non-linear problems - The project is done at Los Alamos National Lab
(LANL) in the US - Researchers at University of Leeds, UK have used
this software in Modelling and Simulation in
e-Social Science (MoSeS) project - Teaching Purposes
- Parallel Programming using Java (PPJ)
- http//www.sc.rwth-aachen.de/Teaching/Labs/PPJ05/
- Parallel Processing SS 2006
- http//tramberend.inform.fh-hannover.de/
14MPJ Express Design
15Presentation Outline
- Implementation Details
- Point-to-point communication
- Communicators, groups, and contexts
- Process topologies
- Derived datatypes
- Collective communications
- MPJ Express Buffering Layer
- Runtime System
- Performance Evaluation
16Java NIO Device
- Uses non-blocking I/O functionality,
- Implements two communication protocols
- Eager-send protocol for small messages,
- Rendezvous protocol for large messages,
- Locks around communication methods results in
deadlocks - In Java, the keyword synchronized ensures that
only one object can call synchronized method at a
time, - A process sending a message to itself using
synchronous send, - Locks for thread-safety
- Writing messages
- A lock for send-communication-sets,
- Locks for destination channels
- One for every destination process,
- Obtained one after the other,
- Reading messages
- A lock for receive-communication-sets.
17Standard mode with eager send protocol (small
messages)
18Standard mode with rendezvous protocol (large
messages)
19MPJ Express Buffering Layer
- MPJ Express requires a buffering layer
- To use Java NIO
- SocketChannels use byte buffers for data
transfer, - To use proprietary networks like Myrinet
efficiently, - Implement derived datatypes,
- Various implementations are possible based on
actual storage medium, - Direct or indirect ByteBuffers,
- An mpjbuf buffer object consists of
- A static buffer to store primitive datatypes,
- A dynamic buffer to store serialized Java
objects, - Creating ByteBuffers on the fly is costly
- Memory management is based on Knuths buddy
algorithm, - Two implementations of memory management.
20MPJ Express Buffering Layer
- Frequent creation and destruction of
communication buffers hurts performance. - To tackle this, MPJ Express requires a buffering
layer - Provides two implementations of Knuths buddy
algorithm, - To use Java NIO and proprietary networks
- Direct ByteBuffers,
- Implement derived datatypes
21Presentation Outline
- Implementation Details
- Point-to-point communication
- Communicators, groups, and contexts
- Process topologies
- Derived datatypes
- Collective communications
- MPJ Express Buffering Layer
- Runtime System
- Performance Evaluation
22Communicators, groups, and contexts
- MPI provides a higher level abstraction to create
parallel libraries - Safe communication space
- Group scope for collective operations
- Process Naming
- Communicators Groups provide
- Process Naming (instead of IP address ports)
- Group scope for collective operations
- Contexts
- Safe communication
23What is a group?
- A data-structure that contains processes
- Main functionality
- Keep track of ranks of processes
- Explanation of figure
- Group A contains eight processes
- Group B and C are created from Group A
- All group operations are local (no communication
with remote processes)
24Example of a group operation(Union)
- Explanation of union operation
- Two processes a and d are in both groups
- Thus, six processes are executing this operation
- Each group has its own view of this group
operations - Apply theory of relativity
- Re-assigning ranks in new groups
- Process 0 in group A is re-assigned rank 0 in
Group C - Process 0 in group B is re-assigned rank 4 in
Group C - If any existing process does not make it into the
new group, it returns MPI.GROUP_EMPTY
25What are communicators?
- A data-structure that contains groups (and thus
processes) - Why is it useful
- Process naming, ranks are names for application
programmers - Easier than IPaddress ports
- Group communications as well as point to point
communication - There are two types of communicators,
- Intracommunicators
- Communication within a group
- Intercommunicators
- Communication between two groups (must be
disjoint)
26What are contexts?
- An unique integer
- An additional tag on the messages
- Each communicator has a distinct context that
provides a safe communication universe - A context is agreed upon by all processes when a
communicator is built - Intracommunicators has two contexts
- One for point-to-point communications
- One for collective communications,
- Intercommunicators has two contexts
- Explained in the coming slides
27Process topologies
- Used to specify processes in a geometric shape
- Virtual topologies have no connection with the
physical layout of machines - Its possible to make use of underlying machine
architecture - These virtual topologies can be assigned to
processes in an Intracommunicator - MPI provides
- Cartesian topology
- Graph topology
28Cartesian topology Mapping four processes onto
2x2 topology
- Each process is assigned a coordinate
- Rank 0 (0,0)
- Rank 1 (1,0)
- Rank 2 (0,1)
- Rank 3 (1,1)
- Uses
- Calculate rank by knowing grid (not globus one!)
position - Calculate grid positions from ranks
- Easier to locate rank of neighbours
- Applications may have communication patterns
- Lots of messaging with immediate neighbours
29Periods in cartesian topology
- Axis 1 (y-axis is periodic)
- Processes in top and bottom rows have valid
neighbours towards top and bottom respectively - Axis 0 (x-axis is non-periodic)
- Processes in right and left column have undefined
neighbour towards right and left respectively
30Derived datatypes
- Besides, basic datatypes, it is possible to
communicate heterogeneous, non-contiguous data. - Contiguous
- Indexed
- Vector
- Struct
31Indexed datatype
- The elements that may form this datatype should
be - Same types
- At non-contiguous locations
- Add flexibility by specifying displacements
- int SIZE 4 int blklen new intDIM,displ
new intDIM - for(i0 iltDIM i)
- blkleniDIM-i displi(iDIM)i
-
- double params new doubleSIZESIZE
- double rparams new doubleSIZESIZE
- Datatype i Datatype.Indexed(blklen, displ,
MPI.INT) - //array_of_block_lengths, array_displacements
- Send(params,0,1,i,dst,tag) //0 is offset, 1 is
count - Recv(rparams,0,1,i,src,tag)
32(No Transcript)
33Presentation Outline
- Implementation Details
- Point-to-point communication
- Communicators, groups, and contexts
- Process topologies
- Derived datatypes
- Collective communications
- Runtime System
- Thread-safety in MPJ Express
- Performance Evaluation
34Collective communications
- Provided as a convenience for application
developers - Save significant development time
- Efficient algorithms may be used
- Stable (tested)
- Built on top of point-to-point communications,
- These operations include
- Broadcast, Barrier, Reduce, Allreduce, Alltoall,
Scatter, Scan, Allscatter - Versions that allows displacements between the
data
35Broadcast, scatter, gather, allgather, alltoall
Image from MPI standard doc
36Reduce collective operations
- MPI.PROD
- MPI.SUM
- MPI.MIN
- MPI.MAX
- MPI.LAND
- MPI.BAND
- MPI.LOR
- MPI.BOR
- MPI.LXOR
- MPI.BXOR
- MPI.MINLOC
- MPI.MAXLOC
37Barrier with Tree Algorithm
38Execution of barrier with eight processes
- Eight processes, thus forms only one group
- Each process exchanges an integer 4 times
- Overlaps communications well
39Intracomm.Bcast( )
- Sends data from a process to all the other
processes - Code from adlib
- A communication library for HPJava
- The current implementation is based on n-ary
tree - Limitation broadcasts only from rank0
- Generated dynamically
- Cost O( log2(N) )
- MPICH1.2.5 uses linear algorithm
- Cost O(N)
- MPICH2 has much improved algorithms
- LAM/MPI uses n-ary trees
- Limitation, broadcast from rank0
40Broadcasting algorithm, total processes8, root0
41Presentation Outline
- Implementation Details
- Point-to-point communication
- Communicators, groups, and contexts
- Process topologies
- Derived datatypes
- Collective communications
- Runtime System
- Thread-safety in MPJ Express
- Performance Evaluation
42The Runtime System
43Thread-safety in MPI
- The MPI 2.0 specification introduced the notion
of thread-compliant MPI implementation, - Four levels of thread-safety
- MPI_THREAD_SINGLE,
- MPI_THREAD_FUNNELED,
- MPI_THREAD_SERIALIZED,
- MPI_THREAD_MULTIPLE,
- A blocked thread should not halt the execution of
other threads, - Issues in Developing Thread-Safe MPI
Implementation by Gropp et al.
44Presentation Outline
- Implementation Details
- Point-to-point communication
- Communicators, groups, and contexts
- Process topologies
- Derived datatypes
- Collective communications
- Runtime System
- Thread-safety in MPJ Express
- Performance Evaluation
45Latency on Fast Ethernet
46Throughput on Fast Ethernet
47Latency on Gigabit Ethernet
48Throughput on GigE
49Choking experience 1
50Latency on Myrinet
51Throughput on Myrinet
52Questions
?