Efficient User-Level Networking in Java

1 / 38
About This Presentation
Title:

Efficient User-Level Networking in Java

Description:

Efficient User-Level Networking in Java Chi-Chao Chang Dept. of Computer Science Cornell University (joint work with Thorsten von Eicken and the Safe Language Kernel ... – PowerPoint PPT presentation

Number of Views:7
Avg rating:3.0/5.0
Slides: 39
Provided by: ChiCha4

less

Transcript and Presenter's Notes

Title: Efficient User-Level Networking in Java


1
Efficient User-Level Networking in Java
  • Chi-Chao Chang
  • Dept. of Computer Science
  • Cornell University
  • (joint work with Thorsten von Eicken and the Safe
    Language Kernel group)

2
Goal
  • High-performance cluster computing with safe
    languages
  • parallel and distributed applications
  • communication support for operating systems
  • Use off-the-shelf technologies
  • User-level network interfaces (UNIs)
  • direct, protected access to network devices
  • inexpensive clusters
  • U-Net (Cornell), Shrimp (Princeton), FM (UIUC),
    Hamlyn (HP)
  • Virtual Interface Architecture (VIA) emerging
    UNI standard
  • Java
  • safe better C
  • write once run everywhere
  • growing interest for high-performance
    applications (Java Grande)
  • Make the performance of UNIs available from Java
  • JAVIA a Java interface to VIA

2
3
Why a Java Interface to UNI?
  • Different approach for providing communication
    support for Java
  • Traditional front-end approach
  • pick favorite abstraction (sockets, RMI, MPI) and
    Java VM
  • write a Java front-end to custom or existing
    native libraries
  • good performance, re-use proven code
  • magic in native code, no common solution
  • Javia exposes UNI to Java
  • minimizes amount of unverified code
  • isolates bottlenecks in data transfer
  • 1. automatic memory management
  • 2. object serialization

Java
C
3
4
Contribution I
  • PROBLEM lack of control over object
    lifetime/location due to GC
  • EFFECT conventional techniques (data copying and
    buffer pinning) yield 10 to 40 hit in array
    throughput
  • SOLUTION jbufs explicit, safe buffer management
    in Java
  • SUPPORT modifications to GC
  • RESULT BW within 1 of hardware, independent of
    xfer size

4
5
Contribution II
  • PROBLEM linked, typed objects
  • EFFECT serialization gtgt send/recv overheads
    (1000 cycles)
  • SOLUTION jstreams in-place object unmarshaling
  • SUPPORT object layout information
  • RESULT serialization send/recv overheads
  • unmarshaling overhead independent of object
    size

5
6
Outline
  • Background
  • UNI Virtual Interface Architecture
  • Java
  • Experimental Setup
  • Javia Architecture
  • Javia-I native buffers (baseline)
  • Javia-II jbufs (buffer management) and jstreams
    (marshaling)
  • Summary and Conclusions

6
7
UNI in a Nutshell
  • Enabling technology for networks of workstations
  • direct, protected access to networking devices
  • Traditional
  • all communication via OS
  • VIA
  • connections between virtual interfaces (Vi)
  • apps send/recv through Vi, simple mux in NI
  • OS only involved in setting up Vis
  • Generic Architecture
  • implemented in hardware, software or both

7
8
VI Structures
  • Key Data Structures
  • user buffers
  • buffer descriptors lt addr, lengt layout exposed
    to user
  • send/recv queues only through API calls
  • Structures are
  • pinned to physical memory
  • address translation in adapter
  • Key Points
  • direct DMA access to buffers/descr in user-space
  • application must allocate, use, re-use, free all
    buffers/desc
  • allocpin, unpinfree are expensive operations,
    but re-use is cheap

8
9
Java Storage Safety
  • class Buffer
  • byte data
  • Buffer(int n) data new byten
  • No control over object placement
  • Buffer buf new Buffer(1024)
  • cannot pin after allocation GC can move objects
  • No control over de-allocation
  • buf null
  • drop all references, call or wait for GC
  • Result additional data copying in communication
    path

9
10
Java Type Safety
  • Cannot forge a reference to a Java object
  • e.g. cannot cast between byte arrays and objects
  • No control over object layout
  • field ordering is up to the Java VM
  • objects have runtime metadata
  • casting with runtime checks
  • Object o (Object) new Buffer(1024) / up cast
    OK /
  • Buffer buf (Buffer) o / down cast runtime
    check /
  • array bounds check
  • for (int i 0 i lt 1024 i) buf.datai i
  • Result expensive object marshaling

10
11
Marmot
  • Java System from Microsoft Research
  • not a VM
  • static compiler bytecode (.class) to x86 (.asm)
  • linker asm files runtime libraries -gt
    executable (.exe)
  • no dynamic loading of classes
  • most Dragon book opts, some OO and Java-specific
    opts
  • Advantages
  • source code
  • good performance
  • two types of non-concurrent GC (copying,
    conservative)
  • native interface close enough to JNI

11
12
Example Cluster _at_ Cornell
  • Configuration
  • 8 P-II 450MHz, 128MB RAM
  • 8 1.25 Gbps Giganet GNN-1000 adapter
  • one Giganet switch
  • total cost 30,000 (w/university discount)
  • GNN1000 Adapter
  • mux implemented in hardware
  • device driver for VI setup
  • VIA interface in user-level library (Win32 dll)
  • no support for interrupt-driven reception
  • Base-line pt-2-pt Performance
  • 14?s r/t latency, 16?s with switch
  • over 100MBytes/s peak, 85MBytes/s with switch

12
13
Outline
  • Background
  • Javia Architecture
  • Javia-I native buffers (baseline)
  • Javia-II jbufs and jstreams
  • Summary and Conclusions

13
14
Javia General Architecture
  • Java classes C library
  • Javia-I
  • baseline implementation
  • array transfers only
  • no modifications to Marmot
  • native library buffer mgmt wrapper calls to
    VIA
  • Javia-II
  • array and object transfers
  • buffer mgmt in Java
  • special support from Marmot
  • native library wrapper calls to VI

14
15
Javia-I Exploiting Native Buffers
  • Basic Asynch Send/Recv
  • buffers/descr in native library
  • Java send/recv ticket rings mirror VI queues
  • of descr/buffers tickets in ring
  • Send Critical Path
  • get free ticket from ring
  • copy from array to buffer
  • free ticket
  • Recv Critical Path
  • obtain corresponding ticket in ring
  • copy data from buffer to array
  • free ticket from ring

15
16
Javia-I Variants
  • Two Send Variants
  • Sync Send Copy
  • goal bypass send ring
  • one ticket
  • array -gt buffer copy
  • wait until send completes
  • Sync Send Pin
  • goal bypass send ring, avoid copy
  • pin array on the fly
  • waits until send completes
  • unpins after send
  • One Recv Variant
  • No-Post Recv Alloc
  • goal bypass recv ring
  • allocate array on the fly, copy data

GC heap
byte array ref
send/recv ticket ring
Vi
Java
C
descriptor
send/recv queue
buffer
VIA
16
17
Javia-I Performance
Basic Costs VIA pin unpin (10
10)us Marmot native call 0.28us, locks
0.25us, array alloc 0.75us Latency N
transfer size in bytes 16.5us (25ns)
N raw 38.0us (38ns) N pin(s) 21.5us
(42ns) N copy(s) 18.0us (55ns)
N copy(s)alloc(r) BW 75 to 85 of raw, 6KByte
switch over between copy and pin
17
18
jbufs
  • Lessons from Javia-I
  • managing buffers in C introduces copying and/or
    pinning overheads
  • can be implemented in any off-the-shelf JVM
  • Motivation
  • eliminate excess per-byte costs in latency
  • improve throughput
  • jbuf exposes communication buffers to Java
    programmers
  • 1. lifetime control explicit allocation and
    de-allocation of jbufs
  • 2. efficient access direct access to jbuf as
    primitive-typed arrays
  • 3. location control safe de-allocation and
    re-use by controlling whether or not a jbuf is
    part of the GC heap

18
19
jbufs Lifetime Control
public class jbuf public static jbuf alloc(int
bytes)/ allocates jbuf outside of GC heap /
public void free() throws CannotFreeException /
frees jbuf if it can /
C pointer
jbuf
GC heap
  • 1. jbuf allocation does not result in a Java
    reference to it
  • cannot directly access the jbuf through the
    wrapper object
  • 2. jbuf is not automatically freed if there are
    no Java references to it
  • free has to be explicitly called

19
20
jbufs Efficient Access
public class jbuf / alloc and free omitted
/ public byte toByteArray() throws
TypedException/hands out byte ref/ public
int toIntArray() throws TypedException /hands
out int ref/ . . .
jbuf
Java byte ref
GC heap
  • 3. (Memory Safety) jbuf remains allocated as long
    as there are array references to it
  • when can we ever free it?
  • 4. (Type Safety) jbuf cannot have two differently
    typed references to it at any given time
  • when can we ever re-use it (e.g. change its
    reference type)?

20
21
jbufs Location Control
public class jbuf / alloc, free, toArrays
omitted / public void unRef(CallBack cb) /
app intends to free/re-use jbuf /
  • Idea Use GC to track references
  • unRef application claims it has no references
    into the jbuf
  • jbuf is added to the GC heap
  • GC verifies the claim and notifies application
    through callback
  • application can now free or re-use the jbuf
  • Required GC support change scope of GC heap
    dynamically

21
22
jbufs Runtime Checks
toltpgtArray, GC
alloc
toltpgtArray
Unref
refltpgt
free
unRef
GC
to-be unrefltpgt
toltpgtArray, unRef
  • Type safety ref and to-be-unref states
    parameterized by primitive type
  • GC transition depends on the type of garbage
    collector
  • non-copying transition only if all refs to array
    are dropped before GC
  • copying transition occurs after every GC

22
23
Javia-II Exploiting jbufs
  • Send/recv with jbufs
  • explicit pinning/unpinning of jbufs
  • tickets point to pinned jbufs
  • critical path synchronized access to rings, but
    no copies
  • Additional checks
  • send posts allowed only if jbuf is in refltpgt
    state
  • recv posts allowed only if jbuf is in unref or
    refltpgt state
  • no outstanding send/recv posts in to-be-unrefltpgt
    state

23
24
Javia-II Performance
Basic Costs allocation 1.2us, toArray 0.8us,
unRefs 2.5 us Latency (n xfer size) 16.5us
(0.025us) n raw 20.5us (0.025us)
n jbufs 38.0us (0.038us) n pin(s) 21.5us
(0.042us) n copy(s) BW within margin of error
(lt 1)
24
25
Parallel Matrix Multiplication
  • Goal validate jbufs flexibility and performance
    in Java apps
  • matrices represented as array of jbufs (each jbuf
    accessed as array of doubles)
  • A, B, C distributed across processors (block
    columns)
  • comm phase processor sends local portion of A to
    right neighbor, recv new A from left neighbor
  • comp phase Cloc Cloc Aloc Bloc
  • Preliminary Results
  • no fancy instruction scheduling in Marmot
  • no fancy cache-conscious optimizations
  • single processor, 128x128 only 15 Mflops
  • cluster, 128x128
  • comm time about 10 of total time
  • Impact of Jbufs will increase as flops increase

C
A
B


p0 p1 p2 p3
p0 p1 p2 p3
p0 p1 p2 p3
25
26
Active Messages
  • Goal Exercise jbuf mgmt
  • Implemented subset of AM-II over Javiajbufs
  • maintains a pool of free recv jbufs
  • when msg arrives, jbuf is passed to the handler
  • AM calls unRef on jbuf after handler invocation
  • if pool is empty, either alloc more jbufs or
    invoke GC
  • no copying in critical path, deferred to GC-time
    if needed

class First extends AMHandler private int
first void handler(AMJbuf buf, ) int
tmp buf.toIntArray() first tmp0
class Enqueue extends AMHandler private
Queue q void handler(AMJbuf buf, )
int tmp buf.toIntArray() q.enq(tmp)

26
27
AM Preliminary Numbers
  • Summary
  • AM latency about 15 us higher than Javia
  • synch access to buffer pool, endpoint header,
    flow control checks, handler id lookup
  • room for improvement
  • AM BW within 5 of peak for 16KByte messages

27
28
jstreams
  • Goal efficient transmission of arbitrary objects
  • assumption optimizing for homogeneous hosts and
    Java systems
  • Idea in-place unmarshaling
  • defer copying and allocation to GC-time if needed
  • jstream
  • R/W access to jbuf through object stream API
  • no changes in Javia-II architecture

28
29
jstream Implementation
  • writeObject
  • deep-copy of object, breadth-first
  • deals with cyclic data structures
  • replace object metadata (e.g. vtable) with 64-bit
    class descriptor
  • readObject
  • depth-first traversal from beginning of stream
  • swizzle pointers, type-checking, array-bounds
    checking
  • replace class descriptors with metadata
  • Required support
  • some object layout information (e.g. per-class
    pointer-tracking info)
  • Minimal changes to existing stub compilers (e.g.
    rmic)
  • jstream implements JDK2.0 ObjectStream API

29
30
jstreams Safety
31
jstream Performance
31
32
Status
  • Implementation Status
  • Javia-I and II complete
  • jbufs and jstreams integrated with Marmot copying
    collector
  • Current Work
  • finish implementation of AM-II
  • full implementation of Java RMI
  • integrate jbufs and jstreams with conservative
    collector
  • more investigation into deferred copying in
    higher-level protocols

32
33
Related Work
  • Fast Java RMI Implementations
  • Manta (Vrije U) compiler support for marshaling,
    Panda communication system
  • 34 us null, 51 Mbytes/s (85 of raw) on
    PII-200/Myrinet, JDK1.4
  • KaRMI (Karlsruhe) ground-up implementation
  • 117 us null, Alpha 500, Para-station, JDK1.4
  • Other front-end approaches
  • Java front-end for MPI (IBM), Java-to-PVM
    interface (GaTech)
  • Microsoft J-Direct
  • pinned arrays defined using source-level
    annotations
  • JIT produces code to redirect array access
    expensive
  • Comm System Design in Safe Languages (e.g. ML)
  • Fox Project (CMU) TCP/IP layer in ML
  • Ensemble (Cornell) Horus in ML, buffering
    strategies, data path optimizations

33
34
Summary
  • High-Performance Communication in Java Two
    problems
  • buffer management in the presence of GC
  • object marshaling
  • Javia Java Interface to VIA
  • uses native buffers as baseline implementation
  • jbufs safe, explicit control over buffer
    placement and lifetime, eliminates bottlenecks in
    critical path
  • jstreams jbuf extension for fast, in-place
    unmarshaling of objects
  • Concluding Remarks
  • building blocks for Java apps and communication
    software
  • should be integral part of a high-performance
    Java system

34
35
Javia-I Interface
  • package cornell.slk.javia
  • public class ViByteArrayTicket
  • private byte data private int len, off, tag
  • / public methods to set/get fields /
  • public class Vi / connection to remote Vi /
  • public void sendPost(ViByteArrayTicket t) /
    asynch send /
  • public ViByteArrayTicket sendWait(int timeout)
  • public void recvPost(ViByteArrayTicket t) /
    async recv /
  • public ViByteArrayTicket recvWait(int timeout)
  • public void send(byte b, int len, int off, int
    tag) / sync send /
  • public byte recv(int timeout) / post-less
    recv /

35
36
Javia-II Interface
  • package cornell.slk.javia
  • public class ViJbuf extends jbuf
  • public ViJbufTicket register(Vi vi) / reg
    pin jbuf /
  • public void deregister(ViJbufTicket t) / unreg
    unpin jbuf /
  • public class ViJbufTicket
  • private ViJbuf buf private int len, off, tag
  • public class Vi
  • public void sendBufPost(ViJbufTicket t) /
    asynch send /
  • public ViBufTicket sendBufWait(int usecs)
  • public void recvBufPost(ViJbufTicket t) /
    async recv /
  • public ViBufTicket recvBufWait(int usecs)

36
37
Jbufs Implementation
  • alloc/free Win32 VirtualAlloc, VirtualFree
  • toByte,Int,...Arrayno alloc/copying
  • clearRefs
  • modification to stop-and-copy Cheney scan GC
  • clearRef adds a jbuf to that list
  • after GC, traverse list to invoke callbacks,
    delete list

Stack Global
Stack Global
Before GC
After GC
to-space
from-space
from-space
to-space
refd jbufs
unrefd jbufs
37
38
State-of-the-Art Matrix Multiplication
Courtesy IBM Research
38
Write a Comment
User Comments (0)