Title: UNet: A UserLevel Network Interface
1Proc. of the 15th ACM Symposium on Operating
Systems Principles Copper Mountain, Colorado,
December 3-6, 1995
U-Net A User-Level Network Interface for
Parallel and Distributed Computing
Thorsten von Eicken Anindya Basu, Vineet Buch,
and Werner Vogels
Cornell University
Presented by JaeMyoung KIM (959027)
2Abstract
- U-Net communication architecture
- provide processes with a virtual view of network
interface to enable user-level access to
high-speed communication devices - implemented on workstation using COTS ATM comm.
HW - remove the kernel from the comm. path
- provide full protection
- U-Net model
- allow for the construction of protocols at
user level whose performance is only limited by
the capabilities of network - flexible in the sense that TCP, UDP, and AM can
be implemented efficiently - U-Net prototype
- 8-node ATM cluster of standard workstations
- 65 ms round-trip latency
- 15 MB/s bandwidth 120 Mbps
- TCP performance at max network bandwidth
- performance equivalent to CS-2 and CM-5
supercomputers on a set of Split-C benchmarks
3Introduction
- Bottleneck shift
- the limited bandwidth of network fabrics -gt
software path - the kernel involves several copies and crosses
multiple levels of abstraction - elude the per-message overhead and concentrate on
peak bandwidths of long data streams - video playback
- DSM, RPC, remote object-oriented method
invocations, and distributed cooperative file
caches system - Integrating application specific information into
protocol processing - allows for higher efficiency and greater
flexibility in protocol cost management. - move parts of the protocol processing into user
space - be able to efficiently utilize the network and to
couple the communication and computation
effectively - Goals
- to remove the kernel completely from the critical
path - to allow the communication layers used by each
process to be tailored to its demands - Key issues
- multiplexing the network among processes
- providing protection
- managing limited communication resources
- designing an efficient yet versatile programming
interface - U-Net architecture on an off-the-shelf hardware
platform
4Motivation and related work
- U-Net architecture focusing
- provide low-latency communication in local
settings - exploit the full network bandwidth with small
messages - facilitate the use of novel communication
protocols - low communication latencies
- processing overhead network latency
- end-to-end latency
- for large messages transmission time
- for small messages processing overhead
- example
- OO technology 100 bytes
- electronic workplace req - 20 80 , rep -
40200 - caching techniques in modern distributed system
- multi-round protocol
- RPC style of interaction
- rfs NFS traffic - 200 bytes
- networks of workstations
- small-message bandwidth
- provide full network bandwidth with as small
messages as possible
5- Communication protocol and interface flexibility
- lack of support for the integration of kernel and
application buffer management -gt high processing
overheads - new protocol design technique
- Application Level Framing
- Integrated Layer
- Compiler assisted protocol development
- blocking RPC
- pre-allocate memory for the reply
- Towards a new networking architecture
- to simply remove the kernel from the critical
path of sending and receiving messages - system call overhead streamline the buffer mgt
at user level - approach
- to mux/demux directly into the network interface
(NI) - to move all buffer mgt and protocol processing to
user-level - goals
- selecting a good virtual NI abstraction to
present to processes - providing support for legacy protocols
- enforcing protection without kernel intervention
6- Related work
- message demultiplexor in the microkernel for
Mach3 OS - application device channel abstraction at Univ.
of Arizona - HP Bristol
- CM-5, CS-2, SP-2, T3D
- custom hardware and are somewhat constrained to
the controlled environment of a multiprocessor - Successive simplifications and generalizations of
shared memory - Shrimp memory-based network access model
- legacy protocols, long data streams, or rpc
- U-Net design goals
- focus on low latency and high bandwidth using
small messages - the emphasis on protocol design and integration
flexibility - the desire to meet the first two goals on widely
available standard workstations using
off-the-shelf communication hardware.
7The user-level NI architecture
- Sending and receiving messages
- three main building blocks
- endpoints virtual device interface
- communication segments
- message queues holds descriptor for messages
- Sending messages
- user constructs msg in buffer area, pushes Tx
descriptor onto send queue - receiving messages
- Msg arrives, data in buffer from from queue, Rx
descriptor pushed onto recv queue - polling block waiting
- event-driven upcall registration
- Rx Q is non-empty or almost full
- all messages pending disable upcall by processes
8- Multiplexing and demultiplexing messages
- use a tag depends on the network
substrate(ATM-VCI) - parallel-process id tag
- operating system service
- determining the correction tag
- route discovery, switch path setup, etc
- necessary authentication and authorization checks
- endpoints and communication channels
- protection boundaries among multiple processes
- extended boundaries across the network
- Zero-copy vs. true zero-copy
- data can be sent/arrived directly the application
data structures without intermediate buffering - I/O bus addressing NI functionality
- base-level U-Net architecture one copy
- not support the memory mapping
- copy cost is not a dominant factor
- direct-access U-Net architecture
- specify an offset the destination communication
segments - Base-level U-Net architecture
9Two U-Net implementations
- U-Net using the SBA-100
- operates using programmed I/O to store cells into
a 36-cell deep output FIFO and to retrieve
incoming cells from a 292-cell deep input FIFO - HW CRC calculation
- no DMA, no payload CRC calculation
- implemented in the kernel by providing emulated
U-Net endpoints to the applications - a loadable device driver
- user-level library implementing the AAL5 SAR
layer - evaluation
- two 60Mhz SPARCstation-20s running SunOS 4.1.3
- 140Mbit/s TAXI fibers leading to a Fore ASX-200
- end-to-end round-trip time of a single-cell
message 66ms - bandwidth 6.8MBytes/s(54.4Mbps) for packets of
1KBytes
- U-Net using the SBA-200
- uses custom firmware to implement the base-level
architecture directly on the SBA-200 - SBA-200
- a 25Mhz Intel i960 processor, 256Kbytes of memory
- a DMA-capable I/O bus (Sbus) interface
- a simple FIFO interface to the ATM fiber , an
AAL5 CRC generator
10- poor Fore firmware operation and performance
- measured round-trip time approximately 160ms
- max bandwidth 13Mbytes/sec. using 4Kbyte
packet(15.2Mbytes/sec peak fiber bandwidth.) - message data structure too complex
- off-load backfired
- modification considerations
- to virtualize the host-i960 interface such that
multiple user processes can communicate with the
i960 concurrently - to minimize the number of host and i960 accesses
across the I/O bus directly - Performance
- round-trip time
- 65ms for a one-cell message due to the
optimization - Longer messages start at 120ms for 48 bytes and
cost roughly an extra 6ms per additional cell
(i.e., 48 bytes) - bandwidth
- Mbytes/sec for message sizes varying from 4 bytes
to 5Kbytes - packet sizes as low as 800 bytes, the fiber can
be saturated - Dynamic allocation of DMA address space
- challenging problem (efficiently and simply)
11U-Net AM implementation performance
- Generic Active Messages (GAM) 1.1 specification
- a set of primitives
- initialize the GAM interface
- send request and reply messages
- perform block gets and stores.
- provide reliable message delivery
- AM
- low-cost RPC mechanism for parallel machines
- header of message is user-space address of msg
handler - provides reliable message-passing and flow
control - Four different micro-benchmarks
- single-cell round trip time
- start at 71ms over raw U-Net is about 6ms
- block transfer round-trip time
- 135ms N 0.2ms
- per-byte cost is higher than for Raw U-Net
- block store bandwidth
- 80 of the AAL-5 limit with blocks of about
2Kbytes. - dip in performance at 4164 bytes
12Split-C application benchmarks
- Split-C
- novel parallel language based on C
- use of global pointers to access other proc addr
space - assumes homogeneous address space across all
nodes - implemented over AM
- seven programs
- matrix multiply
- uses matrices of 4x4 blocks with 128x128 double
floats - The main loop multiplies two blocks while it
prefetches the two blocks needed in the next
iteration - CPU, bandwidth
- small message, bulk transfers
13TCP/IP and UDP/IP protocols
- bandwidth and latency as a function of appl.
message size. - poor performance of kernelized UDP and TCP
protocols in SunOS with the vendor supplied ATM
driver - max bandwidth of UDP gt 8KB
- max achievable bandwidth of TCP not more than
55 - round-trip latency
- small size lower than Ethernet
- TCP and UDP over U-Net implementation
- implemented for U-Net using the base-level U-Net
functionality - close to the raw U-Net performance limits
- goals
- to support the implementation of traditional
protocols - to create a test environment for traditional
benchmarks - no need of secure U-Net multiplexor
- a single channel to carry all IP traffic between
two applications - protocol execution environment
- TCP/IP suite of protocols over high-speed network
- particular implementation integration into OS
- Fore driver software
- generic low-performance buffer strategies of the
BSD based kernel - use of generalized buffer and timer mechanisms
14- Message handling and staging
- remove all copy operations from the protocol path
- allows for the buffering and staging strategies
to depend on the of the application instead of
the scarce kernel network buffers
- the restricted size of the socket receive buffer
- max. 52Kbytes in SunOS
- deficiencies in the BSD kernel buffer (mbuf)
mechanism - Saw-tooth behavior
- alternative kernel buffering scheme
- removing kernel-application copies ???
- scatter-gather message mechanism
- Application controlled flow-control and feedback
- integration of the comm. subsystem into the
application - the state of the transmission queues
back-pressure mechanism - retransmission counters, round-trip timers,
buffer allocation statistics, etc - IP
- IP over U-Net MTU 9KB
- UDP
- additional layer of demultiplexing over IP based
on ports ids - some protection against corruption with a 16-bit
checksum - TCP
- reliability a simple acknowledge scheme
- flow control the use of advertised receive
windows
15- Bandwidth
- U-Net TCP 14-15 MB/s an 8Kbyte window
- kernel TCP/ATM combination 10 MB/s with a 64K
window
- TCP tuning
- TCP over high-speed networks thru WAN - high
latency - tuning factors
- size of the segments that are transmitted
- 2048 byte segments
- growth of the window size
- bad ratio between the granularity of the protocol
timers and the round-trip time estimates.