UNet: A UserLevel Network Interface - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

UNet: A UserLevel Network Interface

Description:

video playback ... free queue. descriptor for free buffer of storing arriving messages ... to virtualize the host-i960 interface such that multiple user processes can ... – PowerPoint PPT presentation

Number of Views:235

Avg rating:3.0/5.0

Slides: 16

Provided by: camarsK

Category:

more less

Transcript and Presenter's Notes

Title: UNet: A UserLevel Network Interface

1
Proc. of the 15th ACM Symposium on Operating
Systems Principles Copper Mountain, Colorado,
December 3-6, 1995
U-Net A User-Level Network Interface for
Parallel and Distributed Computing
Thorsten von Eicken Anindya Basu, Vineet Buch,
and Werner Vogels
Cornell University
Presented by JaeMyoung KIM (959027)
2
Abstract

U-Net communication architecture
provide processes with a virtual view of network
interface to enable user-level access to
high-speed communication devices
implemented on workstation using COTS ATM comm.
HW
remove the kernel from the comm. path
provide full protection
U-Net model
allow for the construction of protocols at
user level whose performance is only limited by
the capabilities of network
flexible in the sense that TCP, UDP, and AM can
be implemented efficiently
U-Net prototype
8-node ATM cluster of standard workstations
65 ms round-trip latency
15 MB/s bandwidth 120 Mbps
TCP performance at max network bandwidth
performance equivalent to CS-2 and CM-5
supercomputers on a set of Split-C benchmarks

3
Introduction

Bottleneck shift
the limited bandwidth of network fabrics -gt
software path
the kernel involves several copies and crosses
multiple levels of abstraction
elude the per-message overhead and concentrate on
peak bandwidths of long data streams
video playback
DSM, RPC, remote object-oriented method
invocations, and distributed cooperative file
caches system
Integrating application specific information into
protocol processing
allows for higher efficiency and greater
flexibility in protocol cost management.
move parts of the protocol processing into user
space
be able to efficiently utilize the network and to
couple the communication and computation
effectively
Goals
to remove the kernel completely from the critical
path
to allow the communication layers used by each
process to be tailored to its demands
Key issues
multiplexing the network among processes
providing protection
managing limited communication resources
designing an efficient yet versatile programming
interface
U-Net architecture on an off-the-shelf hardware
platform

4
Motivation and related work

U-Net architecture focusing
provide low-latency communication in local
settings
exploit the full network bandwidth with small
messages
facilitate the use of novel communication
protocols
low communication latencies
processing overhead network latency
end-to-end latency
for large messages transmission time
for small messages processing overhead
example
OO technology 100 bytes
electronic workplace req - 20 80 , rep -
40200
caching techniques in modern distributed system
multi-round protocol
RPC style of interaction
rfs NFS traffic - 200 bytes
networks of workstations
small-message bandwidth
provide full network bandwidth with as small
messages as possible

Communication protocol and interface flexibility
lack of support for the integration of kernel and
application buffer management -gt high processing
overheads
new protocol design technique
Application Level Framing
Integrated Layer
Compiler assisted protocol development
blocking RPC
pre-allocate memory for the reply
Towards a new networking architecture
to simply remove the kernel from the critical
path of sending and receiving messages
system call overhead streamline the buffer mgt
at user level
approach
to mux/demux directly into the network interface
(NI)
to move all buffer mgt and protocol processing to
user-level
goals
selecting a good virtual NI abstraction to
present to processes
providing support for legacy protocols
enforcing protection without kernel intervention

Related work
message demultiplexor in the microkernel for
Mach3 OS
application device channel abstraction at Univ.
of Arizona
HP Bristol
CM-5, CS-2, SP-2, T3D
custom hardware and are somewhat constrained to
the controlled environment of a multiprocessor
Successive simplifications and generalizations of
shared memory
Shrimp memory-based network access model
legacy protocols, long data streams, or rpc
U-Net design goals
focus on low latency and high bandwidth using
small messages
the emphasis on protocol design and integration
flexibility
the desire to meet the first two goals on widely
available standard workstations using
off-the-shelf communication hardware.

7
The user-level NI architecture

Sending and receiving messages
three main building blocks
endpoints virtual device interface
communication segments
message queues holds descriptor for messages

Sending messages
user constructs msg in buffer area, pushes Tx
descriptor onto send queue
receiving messages
Msg arrives, data in buffer from from queue, Rx
descriptor pushed onto recv queue
polling block waiting
event-driven upcall registration
Rx Q is non-empty or almost full
all messages pending disable upcall by processes

Multiplexing and demultiplexing messages
use a tag depends on the network
substrate(ATM-VCI)
parallel-process id tag
operating system service
determining the correction tag
route discovery, switch path setup, etc
necessary authentication and authorization checks
endpoints and communication channels
protection boundaries among multiple processes
extended boundaries across the network
Zero-copy vs. true zero-copy
data can be sent/arrived directly the application
data structures without intermediate buffering
I/O bus addressing NI functionality
base-level U-Net architecture one copy
not support the memory mapping
copy cost is not a dominant factor
direct-access U-Net architecture
specify an offset the destination communication
segments
Base-level U-Net architecture

9
Two U-Net implementations

U-Net using the SBA-100
operates using programmed I/O to store cells into
a 36-cell deep output FIFO and to retrieve
incoming cells from a 292-cell deep input FIFO
HW CRC calculation
no DMA, no payload CRC calculation
implemented in the kernel by providing emulated
U-Net endpoints to the applications
a loadable device driver
user-level library implementing the AAL5 SAR
layer
evaluation
two 60Mhz SPARCstation-20s running SunOS 4.1.3
140Mbit/s TAXI fibers leading to a Fore ASX-200

end-to-end round-trip time of a single-cell
message 66ms
bandwidth 6.8MBytes/s(54.4Mbps) for packets of
1KBytes

U-Net using the SBA-200
uses custom firmware to implement the base-level
architecture directly on the SBA-200
SBA-200
a 25Mhz Intel i960 processor, 256Kbytes of memory
a DMA-capable I/O bus (Sbus) interface
a simple FIFO interface to the ATM fiber , an
AAL5 CRC generator

poor Fore firmware operation and performance
measured round-trip time approximately 160ms
max bandwidth 13Mbytes/sec. using 4Kbyte
packet(15.2Mbytes/sec peak fiber bandwidth.)
message data structure too complex
off-load backfired
modification considerations
to virtualize the host-i960 interface such that
multiple user processes can communicate with the
i960 concurrently
to minimize the number of host and i960 accesses
across the I/O bus directly
Performance
round-trip time
65ms for a one-cell message due to the
optimization
Longer messages start at 120ms for 48 bytes and
cost roughly an extra 6ms per additional cell
(i.e., 48 bytes)
bandwidth
Mbytes/sec for message sizes varying from 4 bytes
to 5Kbytes
packet sizes as low as 800 bytes, the fiber can
be saturated
Dynamic allocation of DMA address space
challenging problem (efficiently and simply)

11
U-Net AM implementation performance

Generic Active Messages (GAM) 1.1 specification
a set of primitives
initialize the GAM interface
send request and reply messages
perform block gets and stores.
provide reliable message delivery
AM
low-cost RPC mechanism for parallel machines
header of message is user-space address of msg
handler
provides reliable message-passing and flow
control
Four different micro-benchmarks
single-cell round trip time
start at 71ms over raw U-Net is about 6ms
block transfer round-trip time
135ms N 0.2ms
per-byte cost is higher than for Raw U-Net
block store bandwidth
80 of the AAL-5 limit with blocks of about
2Kbytes.
dip in performance at 4164 bytes

12
Split-C application benchmarks

Split-C
novel parallel language based on C
use of global pointers to access other proc addr
space
assumes homogeneous address space across all
nodes
implemented over AM

seven programs
matrix multiply
uses matrices of 4x4 blocks with 128x128 double
floats
The main loop multiplies two blocks while it
prefetches the two blocks needed in the next
iteration
CPU, bandwidth
small message, bulk transfers

13
TCP/IP and UDP/IP protocols

bandwidth and latency as a function of appl.
message size.
poor performance of kernelized UDP and TCP
protocols in SunOS with the vendor supplied ATM
driver
max bandwidth of UDP gt 8KB
max achievable bandwidth of TCP not more than
55
round-trip latency
small size lower than Ethernet

TCP and UDP over U-Net implementation
implemented for U-Net using the base-level U-Net
functionality
close to the raw U-Net performance limits
goals
to support the implementation of traditional
protocols
to create a test environment for traditional
benchmarks
no need of secure U-Net multiplexor
a single channel to carry all IP traffic between
two applications
protocol execution environment
TCP/IP suite of protocols over high-speed network
particular implementation integration into OS
Fore driver software
generic low-performance buffer strategies of the
BSD based kernel
use of generalized buffer and timer mechanisms

Message handling and staging
remove all copy operations from the protocol path
allows for the buffering and staging strategies
to depend on the of the application instead of
the scarce kernel network buffers

the restricted size of the socket receive buffer
max. 52Kbytes in SunOS
deficiencies in the BSD kernel buffer (mbuf)
mechanism
Saw-tooth behavior
alternative kernel buffering scheme
removing kernel-application copies ???
scatter-gather message mechanism

Application controlled flow-control and feedback
integration of the comm. subsystem into the
application
the state of the transmission queues
back-pressure mechanism
retransmission counters, round-trip timers,
buffer allocation statistics, etc
IP
IP over U-Net MTU 9KB
UDP
additional layer of demultiplexing over IP based
on ports ids
some protection against corruption with a 16-bit
checksum
TCP
reliability a simple acknowledge scheme
flow control the use of advertised receive
windows

Bandwidth
U-Net TCP 14-15 MB/s an 8Kbyte window
kernel TCP/ATM combination 10 MB/s with a 64K
window

TCP tuning
TCP over high-speed networks thru WAN - high
latency
tuning factors
size of the segments that are transmitted
2048 byte segments
growth of the window size
bad ratio between the granularity of the protocol
timers and the round-trip time estimates.

Write a Comment

User Comments (0)