Developing a Scalable Coherent Interface SCI device for MPJ Express PowerPoint PPT Presentation

presentation player overlay
1 / 38
About This Presentation
Transcript and Presenter's Notes

Title: Developing a Scalable Coherent Interface SCI device for MPJ Express


1
Developing a Scalable Coherent Interface (SCI)
device for MPJ Express
  • Guillermo López Taboada
  • 14th October, 2005
  • Dept. of Electronics and Systems
  • University of A Coruña (Spain)
  • http//www.des.udc.es
  • Visitor at Distributed Systems Group
  • http//dsg.port.ac.uk

2
Outline
  • Introduction
  • Design of scidev
  • Implementation issues
  • Benchmarking
  • Future work
  • Conclusions

3
Introduction
  • The interconnection network and its associated
    software libraries play a key role in High
    Performance Clustering Technology
  • Cluster interconnection technologies
  • Gb 10Gb Ethernet
  • Myrinet
  • SCI
  • Infiniband
  • Qsnet
  • Quadrics
  • GSN - HIPPI
  • Giganet
  • Latencies are small (usually under 10us)
  • Bandwidths are high (usually above 1Gbps)

4
Introduction
  • SCI (Scalable Coherent Interface)
  • Latency 1.42 us (theoretical)
  • Bandwidth 5333 Mbps (bi-directional)
  • Usually without switch (small clusters)
  • Topologies 1D (ring) / 2D (torus 2D)

5
Introduction
  • Example of a 2D torus SCI cluster with FE (admin)

6
Introduction
  • Software available from Dolphinics
  • Software available from Scali
  • ScaIP IP emulation
  • ScaSISCI SISCI (Sw Infrastructure for SCI)
  • ScaMPI proprietary MPI implementation

7
Introduction
  • Javas portability means in networking that only
    the widely extended TCP/IP is supported by the
    JDK
  • Previously, IP emulations were used (ScaIP
    SCIP) but performance is similar to FE
  • Now a High Performance Socket Implementation, SCI
    SOCKETS
  • Similar to other Interconnection Tech. Myrinet
    (IPoGM-gtGMSockets)

8
Introduction
  • Several research projects have been trying to get
    support in Java for these System Area Networks,
    mainly in Myrinet
  • KaRMI/GM (JavaParty, Univ. Karlsruhe)
  • Manta/LFC/Panda/Ibis (Univ. Vrije Holland)
  • Java GM Sockets
  • RMIX myrinet
  • mpiJava/MPICH-GM or MPICH-MX
  • But nothing in SCI

9
Introduction
  • My PhD Project
  • Designing Efficient Mechanisms for Java
    communications on SCI systems
  • The motivation is filling the gap between Java
    and this high-speed interconnect, which lacks of
    sw support for Java
  • SCI Java Fast Sockets
  • An SCI communication device, base of a messaging
    system
  • SCI Channel for Java NIO
  • Wrappers for some libraries
  • Optimized RMI for High Speed Networks
  • Low level Java buffering and communication system

10
Introduction
  • MPJ Express, a referenceimplementation of the
    MPI bindings for the Java language, has been
    released.
  • Already mature bindings for C, C, and Fortran,
    but ongoing efforts on the Java binding at DSG
  • A good opportunity to provide SCI support to a
    messaging system

11
Outline
  • Introduction
  • Design of scidev
  • Implementation issues
  • Benchmarking
  • Future work
  • Conclusions

12
Design of scidev
  • Use of Java Native Interface JNI (unavoidable)
  • In order to provide support and good performance
    we have to rely on specific low level libraries
  • In the presence of SCI hw it should use it
  • Lost of portability in exchange of higher
    performance
  • Differences between mpiJava and scidev
  • mpiJava- thin wrapper providing a large number of
    Java MPI primitives
  • scidev- thicker layer providing a small API

13
Design of scidev
  • Implementing the xdev API
  • init()
  • finish()
  • id()
  • iprobe(ProcessID srcID, int tag, int context)
  • irecv(Buffer buf, ProcessID srcID, int tag,
    int context, Status status)
  • isend(Buffer buf, ProcessID destID, int tag,
    int context)
  • and the blocking counterparts of these functions
    probe, recv, send issend ssend

14
Design of scidev
15
Design of scidev
JVM
mpjdev
xdev
scidev
mxdev
JNI
O.S
Native Libraries
16
Design of scidev
  • Native libraries SCILib and SISCI

SCILIB
17
Outline
  • Introduction
  • Design of scidev
  • Implementation issues
  • Benchmarking
  • Future work
  • Conclusions

18
Implementation Issues
  • Optimizations / initialization process
  • JNI Caching field identifiers and references to
    objects
  • Sending 2 messages in Long protocol
  • 1st from a 4-byte multiple address and second
    from a 128-byte multiple address up to a 128-byte
    multiple address (go further the end of the
    message raw Buffer has a 2n length)
  • Algorithm to init the message queues of SCILib
  • Connect (to nodes with lower rank)
  • Create (for all nodes, beginning with the
    following rank)
  • Connect (the remaining nodes)
  • The complexity is O(n)

19
Implementation Issues
  • Tranport protocols
  • 3 native protocols
  • Inline 1-113b
  • Short 114b-64Kb
  • Long 64Kb-1Mb
  • scidev fragments messages gt 1MB and is using
  • Inline for control messages and small
    messageslt113b
  • Short with PIO (Programmed Input-Output) for
    messages lt 8Kb
  • Short with DMA (Direct Memory Access) for
    messages 8-64Kb
  • Long in user level libraries does not use DMA
    transfers, so it is replaced by own Long protocol
    with DMA tx

20
Implementation Issues
  • Communications
  • scidev is based on non-blocking communications
  • Its coded having niodev as template
  • Asynchronous sends for messages sizes gt 1MB
  • Notification strategy
  • Following the approach of SCI SOCKET, using the
    mbox interruption library
  • Created without transfering the references (SCI
    interrupt handlers)
  • Each interruption (both user_interruptions and
    dma_interruptions) register a callback method

21
Implementation Issues
  • Sending/Receiving
  • 2 threads user and selector thread, synchronized
    for reducing latency
  • 1 message queue in which the control messages of
    pending communications are kept
  • Sending directly from the Buffer Direct
    ByteBuffer
  • If selector thread receives a message not posted
    -gt creates an intermediate buffer for temporal
    storage
  • If the message has been posted, it copies the
    message directly to the Buffer Direct ByteBuffer

22
Implementation Issues
This schema for each pair of nodes
selector thread
user thread
user thread
SBUFFER
RBUFFER
ULL
ULL
LONG
LONG
Intermediate
SHORT
SHORT
Queue
Queue
Queue
Queue
SCI
Inline
Inline
23
Outline
  • Introduction
  • Design of scidev
  • Implementation issues
  • Benchmarking
  • Future work
  • Conclusions

24
Benchmarking
  • JDK 1.5 on holly. Latency (us).
  • scidev latency is 33us!

25
Benchmarking
  • JDK 1.5 on holly. Asymptotic Bandwidths (Mbps).
  • scidev throughput is 1280 Mbps!

26
Outline
  • Introduction
  • Design of scidev
  • Implementation issues
  • Benchmarking
  • Future work
  • Conclusions

27
Future work
  • Immediatily
  • Testing for collective communications (here only
    was for point-to-point)
  • A design with lower interdependence between xdev
    and mpjbuf
  • Get information from different formats of
    configuration files in SCI
  • Benchmarking with MPJ applications and developing
    MPJ and xdev applications.
  • New buffering implementation

28
Future work
Buffering System with Sbuffer and Rbuffer in ULL
(still intermidiate)
SBUFFER
RBUFFER
ULL
ULL
SBUFFER
RBUFFER
LONG
LONG
Intermediate
SHORT
SHORT
Queue
Queue
Queue
Queue
SCI
Inline
Inline
29
Outline
  • Introduction
  • Design of scidev
  • Implementation issues
  • Benchmarking
  • Future work
  • Conclusions

30
Conclusions
  • Performance is still a problem
  • Try to avoid control message. Maybe integrating
    this data in the ul library
  • Aim latency 30us Bw 1350 Mbps
  • Current phase in developing Testing
  • Hard to do multiple initializations in a single
    thread (restart the device)
  • Design is a bit coupled with MPJ strong
    interdependence
  • Needs evaluation and implementation using a
    kernel level library (threads and spawns process
    natively)

31
Questions
?
32
Appendix
  • Visitor at the DSG during summer 05
  • Pursuing PhD at Univ. of A Coruña (Spain)

33
Appendix
  • BS in Computing Tech. in 2002 at A Coruña Univ.
  • Member of the Computer Architecture Group.
  • Areas of interest of the group
  • High Performance compilers (automatic detection
    of parallelism)
  • Cluster computing
  • Grid applications
  • Management of Parallel/Distributed systems
  • Fault tolerance in MPI
  • Computer graphics (rendering, radiosity)
  • Geographical Information Systems
  • 12 staff members, 8 PhD students

34
Appendix
  • Computer Architecture Group.
  • Crossgrid (eu project within Gridstart)

35
Appendix
  • The Computer Architecture Group is young, has an
    average age of 32 years
  • Some achievements (2000-2004)
  • Papers in international conferences 102
  • Papers in Journals 53 (41 in JCR/SCI list)
  • Regional, national and european funded projects
  • (/- 1M in 5 years)

36
Gratitudes
  • DSG for providing full support for my work
  • Specially Aamir and Raz for late, smoky and
    caffeinated DSG office hours
  • Mark for hosting the visit and his valuable
    support
  • ICG and UoP for the facilities and services
  • Bryan Carpenter for his rare but valuable
    comments, and his help with some JNI pbs.
  • DXIDI Xunta de Galicia, for funding the visit

37
A Coruña
  • You will be always welcome to A Coruña!

38
A Coruña
  • You will be always welcome to A Coruña!
Write a Comment
User Comments (0)
About PowerShow.com