Title: Efficient User-Level Networking in Java
1Efficient User-Level Networking in Java
- Chi-Chao Chang
- Dept. of Computer Science
- Cornell University
- (joint work with Thorsten von Eicken and the Safe
Language Kernel group)
2Goal
- High-performance cluster computing with safe
languages - parallel and distributed applications
- communication support for operating systems
- Use off-the-shelf technologies
- User-level network interfaces (UNIs)
- direct, protected access to network devices
- inexpensive clusters
- U-Net (Cornell), Shrimp (Princeton), FM (UIUC),
Hamlyn (HP) - Virtual Interface Architecture (VIA) emerging
UNI standard - Java
- safe better C
- write once run everywhere
- growing interest for high-performance
applications (Java Grande) - Make the performance of UNIs available from Java
- JAVIA a Java interface to VIA
2
3Why a Java Interface to UNI?
- Different approach for providing communication
support for Java - Traditional front-end approach
- pick favorite abstraction (sockets, RMI, MPI) and
Java VM - write a Java front-end to custom or existing
native libraries - good performance, re-use proven code
- magic in native code, no common solution
- Javia exposes UNI to Java
- minimizes amount of unverified code
- isolates bottlenecks in data transfer
- 1. automatic memory management
- 2. object serialization
Java
C
3
4Contribution I
- PROBLEM lack of control over object
lifetime/location due to GC - EFFECT conventional techniques (data copying and
buffer pinning) yield 10 to 40 hit in array
throughput - SOLUTION jbufs explicit, safe buffer management
in Java - SUPPORT modifications to GC
- RESULT BW within 1 of hardware, independent of
xfer size
4
5Contribution II
- PROBLEM linked, typed objects
- EFFECT serialization gtgt send/recv overheads
(1000 cycles) - SOLUTION jstreams in-place object unmarshaling
- SUPPORT object layout information
- RESULT serialization send/recv overheads
- unmarshaling overhead independent of object
size
5
6Outline
- Background
- UNI Virtual Interface Architecture
- Java
- Experimental Setup
- Javia Architecture
- Javia-I native buffers (baseline)
- Javia-II jbufs (buffer management) and jstreams
(marshaling) - Summary and Conclusions
6
7UNI in a Nutshell
- Enabling technology for networks of workstations
- direct, protected access to networking devices
- Traditional
- all communication via OS
- VIA
- connections between virtual interfaces (Vi)
- apps send/recv through Vi, simple mux in NI
- OS only involved in setting up Vis
- Generic Architecture
- implemented in hardware, software or both
7
8VI Structures
- Key Data Structures
- user buffers
- buffer descriptors lt addr, lengt layout exposed
to user - send/recv queues only through API calls
- Structures are
- pinned to physical memory
- address translation in adapter
- Key Points
- direct DMA access to buffers/descr in user-space
- application must allocate, use, re-use, free all
buffers/desc - allocpin, unpinfree are expensive operations,
but re-use is cheap
8
9Java Storage Safety
- class Buffer
- byte data
- Buffer(int n) data new byten
-
- No control over object placement
- Buffer buf new Buffer(1024)
- cannot pin after allocation GC can move objects
- No control over de-allocation
- buf null
- drop all references, call or wait for GC
- Result additional data copying in communication
path
9
10Java Type Safety
- Cannot forge a reference to a Java object
- e.g. cannot cast between byte arrays and objects
- No control over object layout
- field ordering is up to the Java VM
- objects have runtime metadata
- casting with runtime checks
- Object o (Object) new Buffer(1024) / up cast
OK / - Buffer buf (Buffer) o / down cast runtime
check / - array bounds check
- for (int i 0 i lt 1024 i) buf.datai i
- Result expensive object marshaling
10
11Marmot
- Java System from Microsoft Research
- not a VM
- static compiler bytecode (.class) to x86 (.asm)
- linker asm files runtime libraries -gt
executable (.exe) - no dynamic loading of classes
- most Dragon book opts, some OO and Java-specific
opts - Advantages
- source code
- good performance
- two types of non-concurrent GC (copying,
conservative) - native interface close enough to JNI
11
12Example Cluster _at_ Cornell
- Configuration
- 8 P-II 450MHz, 128MB RAM
- 8 1.25 Gbps Giganet GNN-1000 adapter
- one Giganet switch
- total cost 30,000 (w/university discount)
- GNN1000 Adapter
- mux implemented in hardware
- device driver for VI setup
- VIA interface in user-level library (Win32 dll)
- no support for interrupt-driven reception
- Base-line pt-2-pt Performance
- 14?s r/t latency, 16?s with switch
- over 100MBytes/s peak, 85MBytes/s with switch
12
13Outline
- Background
- Javia Architecture
- Javia-I native buffers (baseline)
- Javia-II jbufs and jstreams
- Summary and Conclusions
13
14Javia General Architecture
- Java classes C library
- Javia-I
- baseline implementation
- array transfers only
- no modifications to Marmot
- native library buffer mgmt wrapper calls to
VIA - Javia-II
- array and object transfers
- buffer mgmt in Java
- special support from Marmot
- native library wrapper calls to VI
14
15Javia-I Exploiting Native Buffers
- Basic Asynch Send/Recv
- buffers/descr in native library
- Java send/recv ticket rings mirror VI queues
- of descr/buffers tickets in ring
- Send Critical Path
- get free ticket from ring
- copy from array to buffer
- free ticket
- Recv Critical Path
- obtain corresponding ticket in ring
- copy data from buffer to array
- free ticket from ring
15
16Javia-I Variants
- Two Send Variants
- Sync Send Copy
- goal bypass send ring
- one ticket
- array -gt buffer copy
- wait until send completes
- Sync Send Pin
- goal bypass send ring, avoid copy
- pin array on the fly
- waits until send completes
- unpins after send
- One Recv Variant
- No-Post Recv Alloc
- goal bypass recv ring
- allocate array on the fly, copy data
GC heap
byte array ref
send/recv ticket ring
Vi
Java
C
descriptor
send/recv queue
buffer
VIA
16
17Javia-I Performance
Basic Costs VIA pin unpin (10
10)us Marmot native call 0.28us, locks
0.25us, array alloc 0.75us Latency N
transfer size in bytes 16.5us (25ns)
N raw 38.0us (38ns) N pin(s) 21.5us
(42ns) N copy(s) 18.0us (55ns)
N copy(s)alloc(r) BW 75 to 85 of raw, 6KByte
switch over between copy and pin
17
18jbufs
- Lessons from Javia-I
- managing buffers in C introduces copying and/or
pinning overheads - can be implemented in any off-the-shelf JVM
- Motivation
- eliminate excess per-byte costs in latency
- improve throughput
- jbuf exposes communication buffers to Java
programmers - 1. lifetime control explicit allocation and
de-allocation of jbufs - 2. efficient access direct access to jbuf as
primitive-typed arrays - 3. location control safe de-allocation and
re-use by controlling whether or not a jbuf is
part of the GC heap
18
19jbufs Lifetime Control
public class jbuf public static jbuf alloc(int
bytes)/ allocates jbuf outside of GC heap /
public void free() throws CannotFreeException /
frees jbuf if it can /
C pointer
jbuf
GC heap
- 1. jbuf allocation does not result in a Java
reference to it - cannot directly access the jbuf through the
wrapper object - 2. jbuf is not automatically freed if there are
no Java references to it - free has to be explicitly called
19
20jbufs Efficient Access
public class jbuf / alloc and free omitted
/ public byte toByteArray() throws
TypedException/hands out byte ref/ public
int toIntArray() throws TypedException /hands
out int ref/ . . .
jbuf
Java byte ref
GC heap
- 3. (Memory Safety) jbuf remains allocated as long
as there are array references to it - when can we ever free it?
- 4. (Type Safety) jbuf cannot have two differently
typed references to it at any given time - when can we ever re-use it (e.g. change its
reference type)?
20
21jbufs Location Control
public class jbuf / alloc, free, toArrays
omitted / public void unRef(CallBack cb) /
app intends to free/re-use jbuf /
- Idea Use GC to track references
- unRef application claims it has no references
into the jbuf - jbuf is added to the GC heap
- GC verifies the claim and notifies application
through callback - application can now free or re-use the jbuf
- Required GC support change scope of GC heap
dynamically
21
22jbufs Runtime Checks
toltpgtArray, GC
alloc
toltpgtArray
Unref
refltpgt
free
unRef
GC
to-be unrefltpgt
toltpgtArray, unRef
- Type safety ref and to-be-unref states
parameterized by primitive type - GC transition depends on the type of garbage
collector - non-copying transition only if all refs to array
are dropped before GC - copying transition occurs after every GC
22
23Javia-II Exploiting jbufs
- Send/recv with jbufs
- explicit pinning/unpinning of jbufs
- tickets point to pinned jbufs
- critical path synchronized access to rings, but
no copies - Additional checks
- send posts allowed only if jbuf is in refltpgt
state - recv posts allowed only if jbuf is in unref or
refltpgt state - no outstanding send/recv posts in to-be-unrefltpgt
state
23
24Javia-II Performance
Basic Costs allocation 1.2us, toArray 0.8us,
unRefs 2.5 us Latency (n xfer size) 16.5us
(0.025us) n raw 20.5us (0.025us)
n jbufs 38.0us (0.038us) n pin(s) 21.5us
(0.042us) n copy(s) BW within margin of error
(lt 1)
24
25Parallel Matrix Multiplication
- Goal validate jbufs flexibility and performance
in Java apps - matrices represented as array of jbufs (each jbuf
accessed as array of doubles) - A, B, C distributed across processors (block
columns) - comm phase processor sends local portion of A to
right neighbor, recv new A from left neighbor - comp phase Cloc Cloc Aloc Bloc
- Preliminary Results
- no fancy instruction scheduling in Marmot
- no fancy cache-conscious optimizations
- single processor, 128x128 only 15 Mflops
- cluster, 128x128
- comm time about 10 of total time
- Impact of Jbufs will increase as flops increase
C
A
B
p0 p1 p2 p3
p0 p1 p2 p3
p0 p1 p2 p3
25
26Active Messages
- Goal Exercise jbuf mgmt
- Implemented subset of AM-II over Javiajbufs
- maintains a pool of free recv jbufs
- when msg arrives, jbuf is passed to the handler
- AM calls unRef on jbuf after handler invocation
- if pool is empty, either alloc more jbufs or
invoke GC - no copying in critical path, deferred to GC-time
if needed
class First extends AMHandler private int
first void handler(AMJbuf buf, ) int
tmp buf.toIntArray() first tmp0
class Enqueue extends AMHandler private
Queue q void handler(AMJbuf buf, )
int tmp buf.toIntArray() q.enq(tmp)
26
27AM Preliminary Numbers
- Summary
- AM latency about 15 us higher than Javia
- synch access to buffer pool, endpoint header,
flow control checks, handler id lookup - room for improvement
- AM BW within 5 of peak for 16KByte messages
27
28jstreams
- Goal efficient transmission of arbitrary objects
- assumption optimizing for homogeneous hosts and
Java systems - Idea in-place unmarshaling
- defer copying and allocation to GC-time if needed
- jstream
- R/W access to jbuf through object stream API
- no changes in Javia-II architecture
28
29jstream Implementation
- writeObject
- deep-copy of object, breadth-first
- deals with cyclic data structures
- replace object metadata (e.g. vtable) with 64-bit
class descriptor - readObject
- depth-first traversal from beginning of stream
- swizzle pointers, type-checking, array-bounds
checking - replace class descriptors with metadata
- Required support
- some object layout information (e.g. per-class
pointer-tracking info) - Minimal changes to existing stub compilers (e.g.
rmic) - jstream implements JDK2.0 ObjectStream API
29
30jstreams Safety
31jstream Performance
31
32Status
- Implementation Status
- Javia-I and II complete
- jbufs and jstreams integrated with Marmot copying
collector - Current Work
- finish implementation of AM-II
- full implementation of Java RMI
- integrate jbufs and jstreams with conservative
collector - more investigation into deferred copying in
higher-level protocols
32
33Related Work
- Fast Java RMI Implementations
- Manta (Vrije U) compiler support for marshaling,
Panda communication system - 34 us null, 51 Mbytes/s (85 of raw) on
PII-200/Myrinet, JDK1.4 - KaRMI (Karlsruhe) ground-up implementation
- 117 us null, Alpha 500, Para-station, JDK1.4
- Other front-end approaches
- Java front-end for MPI (IBM), Java-to-PVM
interface (GaTech) - Microsoft J-Direct
- pinned arrays defined using source-level
annotations - JIT produces code to redirect array access
expensive - Comm System Design in Safe Languages (e.g. ML)
- Fox Project (CMU) TCP/IP layer in ML
- Ensemble (Cornell) Horus in ML, buffering
strategies, data path optimizations
33
34Summary
- High-Performance Communication in Java Two
problems - buffer management in the presence of GC
- object marshaling
- Javia Java Interface to VIA
- uses native buffers as baseline implementation
- jbufs safe, explicit control over buffer
placement and lifetime, eliminates bottlenecks in
critical path - jstreams jbuf extension for fast, in-place
unmarshaling of objects - Concluding Remarks
- building blocks for Java apps and communication
software - should be integral part of a high-performance
Java system
34
35Javia-I Interface
- package cornell.slk.javia
- public class ViByteArrayTicket
- private byte data private int len, off, tag
- / public methods to set/get fields /
-
- public class Vi / connection to remote Vi /
- public void sendPost(ViByteArrayTicket t) /
asynch send / - public ViByteArrayTicket sendWait(int timeout)
- public void recvPost(ViByteArrayTicket t) /
async recv / - public ViByteArrayTicket recvWait(int timeout)
- public void send(byte b, int len, int off, int
tag) / sync send / - public byte recv(int timeout) / post-less
recv /
35
36Javia-II Interface
- package cornell.slk.javia
- public class ViJbuf extends jbuf
- public ViJbufTicket register(Vi vi) / reg
pin jbuf / - public void deregister(ViJbufTicket t) / unreg
unpin jbuf / -
- public class ViJbufTicket
- private ViJbuf buf private int len, off, tag
-
- public class Vi
- public void sendBufPost(ViJbufTicket t) /
asynch send / - public ViBufTicket sendBufWait(int usecs)
- public void recvBufPost(ViJbufTicket t) /
async recv / - public ViBufTicket recvBufWait(int usecs)
-
36
37Jbufs Implementation
- alloc/free Win32 VirtualAlloc, VirtualFree
- toByte,Int,...Arrayno alloc/copying
- clearRefs
- modification to stop-and-copy Cheney scan GC
- clearRef adds a jbuf to that list
- after GC, traverse list to invoke callbacks,
delete list
Stack Global
Stack Global
Before GC
After GC
to-space
from-space
from-space
to-space
refd jbufs
unrefd jbufs
37
38State-of-the-Art Matrix Multiplication
Courtesy IBM Research
38