Efficient User-Level Networking in Java

1 / 38

About This Presentation

Title:

Efficient User-Level Networking in Java

Description:

Efficient User-Level Networking in Java Chi-Chao Chang Dept. of Computer Science Cornell University (joint work with Thorsten von Eicken and the Safe Language Kernel ... – PowerPoint PPT presentation

Number of Views:7

Avg rating:3.0/5.0

Slides: 39

Provided by: ChiCha4

Learn more at: http://www.cs.cornell.edu

more less

Transcript and Presenter's Notes

Title: Efficient User-Level Networking in Java

1
Efficient User-Level Networking in Java

Chi-Chao Chang
Dept. of Computer Science
Cornell University
(joint work with Thorsten von Eicken and the Safe
Language Kernel group)

2
Goal

High-performance cluster computing with safe
languages
parallel and distributed applications
communication support for operating systems
Use off-the-shelf technologies
User-level network interfaces (UNIs)
direct, protected access to network devices
inexpensive clusters
U-Net (Cornell), Shrimp (Princeton), FM (UIUC),
Hamlyn (HP)
Virtual Interface Architecture (VIA) emerging
UNI standard
Java
safe better C
write once run everywhere
growing interest for high-performance
applications (Java Grande)
Make the performance of UNIs available from Java
JAVIA a Java interface to VIA

2
3
Why a Java Interface to UNI?

Different approach for providing communication
support for Java
Traditional front-end approach
pick favorite abstraction (sockets, RMI, MPI) and
Java VM
write a Java front-end to custom or existing
native libraries
good performance, re-use proven code
magic in native code, no common solution
Javia exposes UNI to Java
minimizes amount of unverified code
isolates bottlenecks in data transfer
1. automatic memory management
2. object serialization

Java
C
3
4
Contribution I

PROBLEM lack of control over object
lifetime/location due to GC
EFFECT conventional techniques (data copying and
buffer pinning) yield 10 to 40 hit in array
throughput
SOLUTION jbufs explicit, safe buffer management
in Java
SUPPORT modifications to GC
RESULT BW within 1 of hardware, independent of
xfer size

4
5
Contribution II

PROBLEM linked, typed objects
EFFECT serialization gtgt send/recv overheads
(1000 cycles)
SOLUTION jstreams in-place object unmarshaling
SUPPORT object layout information
RESULT serialization send/recv overheads
unmarshaling overhead independent of object
size

5
6
Outline

Background
UNI Virtual Interface Architecture
Java
Experimental Setup
Javia Architecture
Javia-I native buffers (baseline)
Javia-II jbufs (buffer management) and jstreams
(marshaling)
Summary and Conclusions

6
7
UNI in a Nutshell

Enabling technology for networks of workstations
direct, protected access to networking devices

Traditional
all communication via OS
VIA
connections between virtual interfaces (Vi)
apps send/recv through Vi, simple mux in NI
OS only involved in setting up Vis
Generic Architecture
implemented in hardware, software or both

7
8
VI Structures

Key Data Structures
user buffers
buffer descriptors lt addr, lengt layout exposed
to user
send/recv queues only through API calls
Structures are
pinned to physical memory
address translation in adapter

Key Points
direct DMA access to buffers/descr in user-space
application must allocate, use, re-use, free all
buffers/desc
allocpin, unpinfree are expensive operations,
but re-use is cheap

8
9
Java Storage Safety

class Buffer
byte data
Buffer(int n) data new byten
No control over object placement
Buffer buf new Buffer(1024)
cannot pin after allocation GC can move objects
No control over de-allocation
buf null
drop all references, call or wait for GC
Result additional data copying in communication
path

9
10
Java Type Safety

Cannot forge a reference to a Java object
e.g. cannot cast between byte arrays and objects
No control over object layout
field ordering is up to the Java VM
objects have runtime metadata
casting with runtime checks
Object o (Object) new Buffer(1024) / up cast
OK /
Buffer buf (Buffer) o / down cast runtime
check /
array bounds check
for (int i 0 i lt 1024 i) buf.datai i
Result expensive object marshaling

10
11
Marmot

Java System from Microsoft Research
not a VM
static compiler bytecode (.class) to x86 (.asm)
linker asm files runtime libraries -gt
executable (.exe)
no dynamic loading of classes
most Dragon book opts, some OO and Java-specific
opts
Advantages
source code
good performance
two types of non-concurrent GC (copying,
conservative)
native interface close enough to JNI

11
12
Example Cluster _at_ Cornell

Configuration
8 P-II 450MHz, 128MB RAM
8 1.25 Gbps Giganet GNN-1000 adapter
one Giganet switch
total cost 30,000 (w/university discount)
GNN1000 Adapter
mux implemented in hardware
device driver for VI setup
VIA interface in user-level library (Win32 dll)
no support for interrupt-driven reception
Base-line pt-2-pt Performance
14?s r/t latency, 16?s with switch
over 100MBytes/s peak, 85MBytes/s with switch

12
13
Outline

Background
Javia Architecture
Javia-I native buffers (baseline)
Javia-II jbufs and jstreams
Summary and Conclusions

13
14
Javia General Architecture

Java classes C library
Javia-I
baseline implementation
array transfers only
no modifications to Marmot
native library buffer mgmt wrapper calls to
VIA
Javia-II
array and object transfers
buffer mgmt in Java
special support from Marmot
native library wrapper calls to VI

14
15
Javia-I Exploiting Native Buffers

Basic Asynch Send/Recv
buffers/descr in native library
Java send/recv ticket rings mirror VI queues
of descr/buffers tickets in ring
Send Critical Path
get free ticket from ring
copy from array to buffer
free ticket
Recv Critical Path
obtain corresponding ticket in ring
copy data from buffer to array
free ticket from ring

15
16
Javia-I Variants

Two Send Variants
Sync Send Copy
goal bypass send ring
one ticket
array -gt buffer copy
wait until send completes
Sync Send Pin
goal bypass send ring, avoid copy
pin array on the fly
waits until send completes
unpins after send
One Recv Variant
No-Post Recv Alloc
goal bypass recv ring
allocate array on the fly, copy data

GC heap
byte array ref
send/recv ticket ring
Vi
Java
C
descriptor
send/recv queue
buffer
VIA
16
17
Javia-I Performance
Basic Costs VIA pin unpin (10
10)us Marmot native call 0.28us, locks
0.25us, array alloc 0.75us Latency N
transfer size in bytes 16.5us (25ns)
N raw 38.0us (38ns) N pin(s) 21.5us
(42ns) N copy(s) 18.0us (55ns)
N copy(s)alloc(r) BW 75 to 85 of raw, 6KByte
switch over between copy and pin
17
18
jbufs

Lessons from Javia-I
managing buffers in C introduces copying and/or
pinning overheads
can be implemented in any off-the-shelf JVM
Motivation
eliminate excess per-byte costs in latency
improve throughput
jbuf exposes communication buffers to Java
programmers
1. lifetime control explicit allocation and
de-allocation of jbufs
2. efficient access direct access to jbuf as
primitive-typed arrays
3. location control safe de-allocation and
re-use by controlling whether or not a jbuf is
part of the GC heap

18
19
jbufs Lifetime Control
public class jbuf public static jbuf alloc(int
bytes)/ allocates jbuf outside of GC heap /
public void free() throws CannotFreeException /
frees jbuf if it can /
C pointer
jbuf
GC heap

1. jbuf allocation does not result in a Java
reference to it
cannot directly access the jbuf through the
wrapper object
2. jbuf is not automatically freed if there are
no Java references to it
free has to be explicitly called

19
20
jbufs Efficient Access
public class jbuf / alloc and free omitted
/ public byte toByteArray() throws
TypedException/hands out byte ref/ public
int toIntArray() throws TypedException /hands
out int ref/ . . .
jbuf
Java byte ref
GC heap

3. (Memory Safety) jbuf remains allocated as long
as there are array references to it
when can we ever free it?
4. (Type Safety) jbuf cannot have two differently
typed references to it at any given time
when can we ever re-use it (e.g. change its
reference type)?

20
21
jbufs Location Control
public class jbuf / alloc, free, toArrays
omitted / public void unRef(CallBack cb) /
app intends to free/re-use jbuf /

Idea Use GC to track references
unRef application claims it has no references
into the jbuf
jbuf is added to the GC heap
GC verifies the claim and notifies application
through callback
application can now free or re-use the jbuf
Required GC support change scope of GC heap
dynamically

21
22
jbufs Runtime Checks
toltpgtArray, GC
alloc
toltpgtArray
Unref
refltpgt
free
unRef
GC
to-be unrefltpgt
toltpgtArray, unRef

Type safety ref and to-be-unref states
parameterized by primitive type
GC transition depends on the type of garbage
collector
non-copying transition only if all refs to array
are dropped before GC
copying transition occurs after every GC

22
23
Javia-II Exploiting jbufs

Send/recv with jbufs
explicit pinning/unpinning of jbufs
tickets point to pinned jbufs
critical path synchronized access to rings, but
no copies
Additional checks
send posts allowed only if jbuf is in refltpgt
state
recv posts allowed only if jbuf is in unref or
refltpgt state
no outstanding send/recv posts in to-be-unrefltpgt
state

23
24
Javia-II Performance
Basic Costs allocation 1.2us, toArray 0.8us,
unRefs 2.5 us Latency (n xfer size) 16.5us
(0.025us) n raw 20.5us (0.025us)
n jbufs 38.0us (0.038us) n pin(s) 21.5us
(0.042us) n copy(s) BW within margin of error
(lt 1)
24
25
Parallel Matrix Multiplication

Goal validate jbufs flexibility and performance
in Java apps
matrices represented as array of jbufs (each jbuf
accessed as array of doubles)
A, B, C distributed across processors (block
columns)
comm phase processor sends local portion of A to
right neighbor, recv new A from left neighbor
comp phase Cloc Cloc Aloc Bloc
Preliminary Results
no fancy instruction scheduling in Marmot
no fancy cache-conscious optimizations
single processor, 128x128 only 15 Mflops
cluster, 128x128
comm time about 10 of total time
Impact of Jbufs will increase as flops increase

C
A
B

p0 p1 p2 p3
p0 p1 p2 p3
p0 p1 p2 p3
25
26
Active Messages

Goal Exercise jbuf mgmt
Implemented subset of AM-II over Javiajbufs
maintains a pool of free recv jbufs
when msg arrives, jbuf is passed to the handler
AM calls unRef on jbuf after handler invocation
if pool is empty, either alloc more jbufs or
invoke GC
no copying in critical path, deferred to GC-time
if needed

class First extends AMHandler private int
first void handler(AMJbuf buf, ) int
tmp buf.toIntArray() first tmp0
class Enqueue extends AMHandler private
Queue q void handler(AMJbuf buf, )
int tmp buf.toIntArray() q.enq(tmp)

26
27
AM Preliminary Numbers

Summary
AM latency about 15 us higher than Javia
synch access to buffer pool, endpoint header,
flow control checks, handler id lookup
room for improvement
AM BW within 5 of peak for 16KByte messages

27
28
jstreams

Goal efficient transmission of arbitrary objects
assumption optimizing for homogeneous hosts and
Java systems
Idea in-place unmarshaling
defer copying and allocation to GC-time if needed
jstream
R/W access to jbuf through object stream API
no changes in Javia-II architecture

28
29
jstream Implementation

writeObject
deep-copy of object, breadth-first
deals with cyclic data structures
replace object metadata (e.g. vtable) with 64-bit
class descriptor
readObject
depth-first traversal from beginning of stream
swizzle pointers, type-checking, array-bounds
checking
replace class descriptors with metadata
Required support
some object layout information (e.g. per-class
pointer-tracking info)
Minimal changes to existing stub compilers (e.g.
rmic)
jstream implements JDK2.0 ObjectStream API

29
30
jstreams Safety
31
jstream Performance
31
32
Status

Implementation Status
Javia-I and II complete
jbufs and jstreams integrated with Marmot copying
collector
Current Work
finish implementation of AM-II
full implementation of Java RMI
integrate jbufs and jstreams with conservative
collector
more investigation into deferred copying in
higher-level protocols

32
33
Related Work

Fast Java RMI Implementations
Manta (Vrije U) compiler support for marshaling,
Panda communication system
34 us null, 51 Mbytes/s (85 of raw) on
PII-200/Myrinet, JDK1.4
KaRMI (Karlsruhe) ground-up implementation
117 us null, Alpha 500, Para-station, JDK1.4
Other front-end approaches
Java front-end for MPI (IBM), Java-to-PVM
interface (GaTech)
Microsoft J-Direct
pinned arrays defined using source-level
annotations
JIT produces code to redirect array access
expensive
Comm System Design in Safe Languages (e.g. ML)
Fox Project (CMU) TCP/IP layer in ML
Ensemble (Cornell) Horus in ML, buffering
strategies, data path optimizations

33
34
Summary

High-Performance Communication in Java Two
problems
buffer management in the presence of GC
object marshaling
Javia Java Interface to VIA
uses native buffers as baseline implementation
jbufs safe, explicit control over buffer
placement and lifetime, eliminates bottlenecks in
critical path
jstreams jbuf extension for fast, in-place
unmarshaling of objects
Concluding Remarks
building blocks for Java apps and communication
software
should be integral part of a high-performance
Java system

34
35
Javia-I Interface

package cornell.slk.javia
public class ViByteArrayTicket
private byte data private int len, off, tag
/ public methods to set/get fields /
public class Vi / connection to remote Vi /
public void sendPost(ViByteArrayTicket t) /
asynch send /
public ViByteArrayTicket sendWait(int timeout)
public void recvPost(ViByteArrayTicket t) /
async recv /
public ViByteArrayTicket recvWait(int timeout)
public void send(byte b, int len, int off, int
tag) / sync send /
public byte recv(int timeout) / post-less
recv /

35
36
Javia-II Interface

package cornell.slk.javia
public class ViJbuf extends jbuf
public ViJbufTicket register(Vi vi) / reg
pin jbuf /
public void deregister(ViJbufTicket t) / unreg
unpin jbuf /
public class ViJbufTicket
private ViJbuf buf private int len, off, tag
public class Vi
public void sendBufPost(ViJbufTicket t) /
asynch send /
public ViBufTicket sendBufWait(int usecs)
public void recvBufPost(ViJbufTicket t) /
async recv /
public ViBufTicket recvBufWait(int usecs)

36
37
Jbufs Implementation

alloc/free Win32 VirtualAlloc, VirtualFree
toByte,Int,...Arrayno alloc/copying
clearRefs
modification to stop-and-copy Cheney scan GC
clearRef adds a jbuf to that list
after GC, traverse list to invoke callbacks,
delete list

Stack Global
Stack Global
Before GC
After GC
to-space
from-space
from-space
to-space
refd jbufs
unrefd jbufs
37
38
State-of-the-Art Matrix Multiplication
Courtesy IBM Research
38

Write a Comment

User Comments (0)