Developing a Scalable Coherent Interface SCI device for MPJ Express presentation

About This Presentation

Transcript and Presenter's Notes

Title: Developing a Scalable Coherent Interface SCI device for MPJ Express

1
Developing a Scalable Coherent Interface (SCI)
device for MPJ Express

Guillermo López Taboada
14th October, 2005
Dept. of Electronics and Systems
University of A Coruña (Spain)
http//www.des.udc.es
Visitor at Distributed Systems Group
http//dsg.port.ac.uk

2
Outline

Introduction
Design of scidev
Implementation issues
Benchmarking
Future work
Conclusions

3
Introduction

The interconnection network and its associated
software libraries play a key role in High
Performance Clustering Technology
Cluster interconnection technologies
Gb 10Gb Ethernet
Myrinet
SCI
Infiniband
Qsnet
Quadrics
GSN - HIPPI
Giganet
Latencies are small (usually under 10us)
Bandwidths are high (usually above 1Gbps)

4
Introduction

SCI (Scalable Coherent Interface)
Latency 1.42 us (theoretical)
Bandwidth 5333 Mbps (bi-directional)
Usually without switch (small clusters)
Topologies 1D (ring) / 2D (torus 2D)

5
Introduction

Example of a 2D torus SCI cluster with FE (admin)

6
Introduction

Software available from Dolphinics

Software available from Scali
ScaIP IP emulation
ScaSISCI SISCI (Sw Infrastructure for SCI)
ScaMPI proprietary MPI implementation

7
Introduction

Javas portability means in networking that only
the widely extended TCP/IP is supported by the
JDK
Previously, IP emulations were used (ScaIP
SCIP) but performance is similar to FE
Now a High Performance Socket Implementation, SCI
SOCKETS
Similar to other Interconnection Tech. Myrinet
(IPoGM-gtGMSockets)

8
Introduction

Several research projects have been trying to get
support in Java for these System Area Networks,
mainly in Myrinet
KaRMI/GM (JavaParty, Univ. Karlsruhe)
Manta/LFC/Panda/Ibis (Univ. Vrije Holland)
Java GM Sockets
RMIX myrinet
mpiJava/MPICH-GM or MPICH-MX
But nothing in SCI

9
Introduction

My PhD Project
Designing Efficient Mechanisms for Java
communications on SCI systems
The motivation is filling the gap between Java
and this high-speed interconnect, which lacks of
sw support for Java
SCI Java Fast Sockets
An SCI communication device, base of a messaging
system
SCI Channel for Java NIO
Wrappers for some libraries
Optimized RMI for High Speed Networks
Low level Java buffering and communication system

10
Introduction

MPJ Express, a referenceimplementation of the
MPI bindings for the Java language, has been
released.
Already mature bindings for C, C, and Fortran,
but ongoing efforts on the Java binding at DSG
A good opportunity to provide SCI support to a
messaging system

11
Outline

Introduction
Design of scidev
Implementation issues
Benchmarking
Future work
Conclusions

12
Design of scidev

Use of Java Native Interface JNI (unavoidable)
In order to provide support and good performance
we have to rely on specific low level libraries
In the presence of SCI hw it should use it
Lost of portability in exchange of higher
performance
Differences between mpiJava and scidev
mpiJava- thin wrapper providing a large number of
Java MPI primitives
scidev- thicker layer providing a small API

13
Design of scidev

Implementing the xdev API
init()
finish()
id()
iprobe(ProcessID srcID, int tag, int context)
irecv(Buffer buf, ProcessID srcID, int tag,
int context, Status status)
isend(Buffer buf, ProcessID destID, int tag,
int context)
and the blocking counterparts of these functions
probe, recv, send issend ssend

14
Design of scidev
15
Design of scidev
JVM
mpjdev
xdev
scidev
mxdev
JNI
O.S
Native Libraries
16
Design of scidev

Native libraries SCILib and SISCI

SCILIB
17
Outline

Introduction
Design of scidev
Implementation issues
Benchmarking
Future work
Conclusions

18
Implementation Issues

Optimizations / initialization process
JNI Caching field identifiers and references to
objects
Sending 2 messages in Long protocol
1st from a 4-byte multiple address and second
from a 128-byte multiple address up to a 128-byte
multiple address (go further the end of the
message raw Buffer has a 2n length)
Algorithm to init the message queues of SCILib
Connect (to nodes with lower rank)
Create (for all nodes, beginning with the
following rank)
Connect (the remaining nodes)
The complexity is O(n)

19
Implementation Issues

Tranport protocols
3 native protocols
Inline 1-113b
Short 114b-64Kb
Long 64Kb-1Mb
scidev fragments messages gt 1MB and is using
Inline for control messages and small
messageslt113b
Short with PIO (Programmed Input-Output) for
messages lt 8Kb
Short with DMA (Direct Memory Access) for
messages 8-64Kb
Long in user level libraries does not use DMA
transfers, so it is replaced by own Long protocol
with DMA tx

20
Implementation Issues

Communications
scidev is based on non-blocking communications
Its coded having niodev as template
Asynchronous sends for messages sizes gt 1MB
Notification strategy
Following the approach of SCI SOCKET, using the
mbox interruption library
Created without transfering the references (SCI
interrupt handlers)
Each interruption (both user_interruptions and
dma_interruptions) register a callback method

21
Implementation Issues

Sending/Receiving
2 threads user and selector thread, synchronized
for reducing latency
1 message queue in which the control messages of
pending communications are kept
Sending directly from the Buffer Direct
ByteBuffer
If selector thread receives a message not posted
-gt creates an intermediate buffer for temporal
storage
If the message has been posted, it copies the
message directly to the Buffer Direct ByteBuffer

22
Implementation Issues
This schema for each pair of nodes
selector thread
user thread
user thread
SBUFFER
RBUFFER
ULL
ULL
LONG
LONG
Intermediate
SHORT
SHORT
Queue
Queue
Queue
Queue
SCI
Inline
Inline
23
Outline

Introduction
Design of scidev
Implementation issues
Benchmarking
Future work
Conclusions

24
Benchmarking

JDK 1.5 on holly. Latency (us).

scidev latency is 33us!

25
Benchmarking

JDK 1.5 on holly. Asymptotic Bandwidths (Mbps).

scidev throughput is 1280 Mbps!

26
Outline

Introduction
Design of scidev
Implementation issues
Benchmarking
Future work
Conclusions

27
Future work

Immediatily
Testing for collective communications (here only
was for point-to-point)
A design with lower interdependence between xdev
and mpjbuf
Get information from different formats of
configuration files in SCI
Benchmarking with MPJ applications and developing
MPJ and xdev applications.
New buffering implementation

28
Future work
Buffering System with Sbuffer and Rbuffer in ULL
(still intermidiate)
SBUFFER
RBUFFER
ULL
ULL
SBUFFER
RBUFFER
LONG
LONG
Intermediate
SHORT
SHORT
Queue
Queue
Queue
Queue
SCI
Inline
Inline
29
Outline

Introduction
Design of scidev
Implementation issues
Benchmarking
Future work
Conclusions

30
Conclusions

Performance is still a problem
Try to avoid control message. Maybe integrating
this data in the ul library
Aim latency 30us Bw 1350 Mbps
Current phase in developing Testing
Hard to do multiple initializations in a single
thread (restart the device)
Design is a bit coupled with MPJ strong
interdependence
Needs evaluation and implementation using a
kernel level library (threads and spawns process
natively)

31
Questions
?
32
Appendix

Visitor at the DSG during summer 05
Pursuing PhD at Univ. of A Coruña (Spain)

33
Appendix

BS in Computing Tech. in 2002 at A Coruña Univ.
Member of the Computer Architecture Group.
Areas of interest of the group
High Performance compilers (automatic detection
of parallelism)
Cluster computing
Grid applications
Management of Parallel/Distributed systems
Fault tolerance in MPI
Computer graphics (rendering, radiosity)
Geographical Information Systems
12 staff members, 8 PhD students

34
Appendix

Computer Architecture Group.
Crossgrid (eu project within Gridstart)

35
Appendix

The Computer Architecture Group is young, has an
average age of 32 years
Some achievements (2000-2004)
Papers in international conferences 102
Papers in Journals 53 (41 in JCR/SCI list)
Regional, national and european funded projects
(/- 1M in 5 years)

36
Gratitudes

DSG for providing full support for my work
Specially Aamir and Raz for late, smoky and
caffeinated DSG office hours
Mark for hosting the visit and his valuable
support
ICG and UoP for the facilities and services
Bryan Carpenter for his rare but valuable
comments, and his help with some JNI pbs.
DXIDI Xunta de Galicia, for funding the visit

37
A Coruña

You will be always welcome to A Coruña!

38
A Coruña

You will be always welcome to A Coruña!

Write a Comment

User Comments (0)

About PowerShow.com

Developing a Scalable Coherent Interface SCI device for MPJ Express PowerPoint PPT Presentation