UNet: A UserLevel Network Interface - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

UNet: A UserLevel Network Interface

Description:

video playback ... free queue. descriptor for free buffer of storing arriving messages ... to virtualize the host-i960 interface such that multiple user processes can ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 16
Provided by: camarsK
Category:

less

Transcript and Presenter's Notes

Title: UNet: A UserLevel Network Interface


1
Proc. of the 15th ACM Symposium on Operating
Systems Principles Copper Mountain, Colorado,
December 3-6, 1995
U-Net A User-Level Network Interface for
Parallel and Distributed Computing
Thorsten von Eicken Anindya Basu, Vineet Buch,
and Werner Vogels
Cornell University
Presented by JaeMyoung KIM (959027)
2
Abstract
  • U-Net communication architecture
  • provide processes with a virtual view of network
    interface to enable user-level access to
    high-speed communication devices
  • implemented on workstation using COTS ATM comm.
    HW
  • remove the kernel from the comm. path
  • provide full protection
  • U-Net model
  • allow for the construction of protocols at
    user level whose performance is only limited by
    the capabilities of network
  • flexible in the sense that TCP, UDP, and AM can
    be implemented efficiently
  • U-Net prototype
  • 8-node ATM cluster of standard workstations
  • 65 ms round-trip latency
  • 15 MB/s bandwidth 120 Mbps
  • TCP performance at max network bandwidth
  • performance equivalent to CS-2 and CM-5
    supercomputers on a set of Split-C benchmarks

3
Introduction
  • Bottleneck shift
  • the limited bandwidth of network fabrics -gt
    software path
  • the kernel involves several copies and crosses
    multiple levels of abstraction
  • elude the per-message overhead and concentrate on
    peak bandwidths of long data streams
  • video playback
  • DSM, RPC, remote object-oriented method
    invocations, and distributed cooperative file
    caches system
  • Integrating application specific information into
    protocol processing
  • allows for higher efficiency and greater
    flexibility in protocol cost management.
  • move parts of the protocol processing into user
    space
  • be able to efficiently utilize the network and to
    couple the communication and computation
    effectively
  • Goals
  • to remove the kernel completely from the critical
    path
  • to allow the communication layers used by each
    process to be tailored to its demands
  • Key issues
  • multiplexing the network among processes
  • providing protection
  • managing limited communication resources
  • designing an efficient yet versatile programming
    interface
  • U-Net architecture on an off-the-shelf hardware
    platform

4
Motivation and related work
  • U-Net architecture focusing
  • provide low-latency communication in local
    settings
  • exploit the full network bandwidth with small
    messages
  • facilitate the use of novel communication
    protocols
  • low communication latencies
  • processing overhead network latency
  • end-to-end latency
  • for large messages transmission time
  • for small messages processing overhead
  • example
  • OO technology 100 bytes
  • electronic workplace req - 20 80 , rep -
    40200
  • caching techniques in modern distributed system
  • multi-round protocol
  • RPC style of interaction
  • rfs NFS traffic - 200 bytes
  • networks of workstations
  • small-message bandwidth
  • provide full network bandwidth with as small
    messages as possible

5
  • Communication protocol and interface flexibility
  • lack of support for the integration of kernel and
    application buffer management -gt high processing
    overheads
  • new protocol design technique
  • Application Level Framing
  • Integrated Layer
  • Compiler assisted protocol development
  • blocking RPC
  • pre-allocate memory for the reply
  • Towards a new networking architecture
  • to simply remove the kernel from the critical
    path of sending and receiving messages
  • system call overhead streamline the buffer mgt
    at user level
  • approach
  • to mux/demux directly into the network interface
    (NI)
  • to move all buffer mgt and protocol processing to
    user-level
  • goals
  • selecting a good virtual NI abstraction to
    present to processes
  • providing support for legacy protocols
  • enforcing protection without kernel intervention

6
  • Related work
  • message demultiplexor in the microkernel for
    Mach3 OS
  • application device channel abstraction at Univ.
    of Arizona
  • HP Bristol
  • CM-5, CS-2, SP-2, T3D
  • custom hardware and are somewhat constrained to
    the controlled environment of a multiprocessor
  • Successive simplifications and generalizations of
    shared memory
  • Shrimp memory-based network access model
  • legacy protocols, long data streams, or rpc
  • U-Net design goals
  • focus on low latency and high bandwidth using
    small messages
  • the emphasis on protocol design and integration
    flexibility
  • the desire to meet the first two goals on widely
    available standard workstations using
    off-the-shelf communication hardware.

7
The user-level NI architecture
  • Sending and receiving messages
  • three main building blocks
  • endpoints virtual device interface
  • communication segments
  • message queues holds descriptor for messages
  • Sending messages
  • user constructs msg in buffer area, pushes Tx
    descriptor onto send queue
  • receiving messages
  • Msg arrives, data in buffer from from queue, Rx
    descriptor pushed onto recv queue
  • polling block waiting
  • event-driven upcall registration
  • Rx Q is non-empty or almost full
  • all messages pending disable upcall by processes

8
  • Multiplexing and demultiplexing messages
  • use a tag depends on the network
    substrate(ATM-VCI)
  • parallel-process id tag
  • operating system service
  • determining the correction tag
  • route discovery, switch path setup, etc
  • necessary authentication and authorization checks
  • endpoints and communication channels
  • protection boundaries among multiple processes
  • extended boundaries across the network
  • Zero-copy vs. true zero-copy
  • data can be sent/arrived directly the application
    data structures without intermediate buffering
  • I/O bus addressing NI functionality
  • base-level U-Net architecture one copy
  • not support the memory mapping
  • copy cost is not a dominant factor
  • direct-access U-Net architecture
  • specify an offset the destination communication
    segments
  • Base-level U-Net architecture

9
Two U-Net implementations
  • U-Net using the SBA-100
  • operates using programmed I/O to store cells into
    a 36-cell deep output FIFO and to retrieve
    incoming cells from a 292-cell deep input FIFO
  • HW CRC calculation
  • no DMA, no payload CRC calculation
  • implemented in the kernel by providing emulated
    U-Net endpoints to the applications
  • a loadable device driver
  • user-level library implementing the AAL5 SAR
    layer
  • evaluation
  • two 60Mhz SPARCstation-20s running SunOS 4.1.3
  • 140Mbit/s TAXI fibers leading to a Fore ASX-200
  • end-to-end round-trip time of a single-cell
    message 66ms
  • bandwidth 6.8MBytes/s(54.4Mbps) for packets of
    1KBytes
  • U-Net using the SBA-200
  • uses custom firmware to implement the base-level
    architecture directly on the SBA-200
  • SBA-200
  • a 25Mhz Intel i960 processor, 256Kbytes of memory
  • a DMA-capable I/O bus (Sbus) interface
  • a simple FIFO interface to the ATM fiber , an
    AAL5 CRC generator

10
  • poor Fore firmware operation and performance
  • measured round-trip time approximately 160ms
  • max bandwidth 13Mbytes/sec. using 4Kbyte
    packet(15.2Mbytes/sec peak fiber bandwidth.)
  • message data structure too complex
  • off-load backfired
  • modification considerations
  • to virtualize the host-i960 interface such that
    multiple user processes can communicate with the
    i960 concurrently
  • to minimize the number of host and i960 accesses
    across the I/O bus directly
  • Performance
  • round-trip time
  • 65ms for a one-cell message due to the
    optimization
  • Longer messages start at 120ms for 48 bytes and
    cost roughly an extra 6ms per additional cell
    (i.e., 48 bytes)
  • bandwidth
  • Mbytes/sec for message sizes varying from 4 bytes
    to 5Kbytes
  • packet sizes as low as 800 bytes, the fiber can
    be saturated
  • Dynamic allocation of DMA address space
  • challenging problem (efficiently and simply)

11
U-Net AM implementation performance
  • Generic Active Messages (GAM) 1.1 specification
  • a set of primitives
  • initialize the GAM interface
  • send request and reply messages
  • perform block gets and stores.
  • provide reliable message delivery
  • AM
  • low-cost RPC mechanism for parallel machines
  • header of message is user-space address of msg
    handler
  • provides reliable message-passing and flow
    control
  • Four different micro-benchmarks
  • single-cell round trip time
  • start at 71ms over raw U-Net is about 6ms
  • block transfer round-trip time
  • 135ms N 0.2ms
  • per-byte cost is higher than for Raw U-Net
  • block store bandwidth
  • 80 of the AAL-5 limit with blocks of about
    2Kbytes.
  • dip in performance at 4164 bytes

12
Split-C application benchmarks
  • Split-C
  • novel parallel language based on C
  • use of global pointers to access other proc addr
    space
  • assumes homogeneous address space across all
    nodes
  • implemented over AM
  • seven programs
  • matrix multiply
  • uses matrices of 4x4 blocks with 128x128 double
    floats
  • The main loop multiplies two blocks while it
    prefetches the two blocks needed in the next
    iteration
  • CPU, bandwidth
  • small message, bulk transfers

13
TCP/IP and UDP/IP protocols
  • bandwidth and latency as a function of appl.
    message size.
  • poor performance of kernelized UDP and TCP
    protocols in SunOS with the vendor supplied ATM
    driver
  • max bandwidth of UDP gt 8KB
  • max achievable bandwidth of TCP not more than
    55
  • round-trip latency
  • small size lower than Ethernet
  • TCP and UDP over U-Net implementation
  • implemented for U-Net using the base-level U-Net
    functionality
  • close to the raw U-Net performance limits
  • goals
  • to support the implementation of traditional
    protocols
  • to create a test environment for traditional
    benchmarks
  • no need of secure U-Net multiplexor
  • a single channel to carry all IP traffic between
    two applications
  • protocol execution environment
  • TCP/IP suite of protocols over high-speed network
  • particular implementation integration into OS
  • Fore driver software
  • generic low-performance buffer strategies of the
    BSD based kernel
  • use of generalized buffer and timer mechanisms

14
  • Message handling and staging
  • remove all copy operations from the protocol path
  • allows for the buffering and staging strategies
    to depend on the of the application instead of
    the scarce kernel network buffers
  • the restricted size of the socket receive buffer
  • max. 52Kbytes in SunOS
  • deficiencies in the BSD kernel buffer (mbuf)
    mechanism
  • Saw-tooth behavior
  • alternative kernel buffering scheme
  • removing kernel-application copies ???
  • scatter-gather message mechanism
  • Application controlled flow-control and feedback
  • integration of the comm. subsystem into the
    application
  • the state of the transmission queues
    back-pressure mechanism
  • retransmission counters, round-trip timers,
    buffer allocation statistics, etc
  • IP
  • IP over U-Net MTU 9KB
  • UDP
  • additional layer of demultiplexing over IP based
    on ports ids
  • some protection against corruption with a 16-bit
    checksum
  • TCP
  • reliability a simple acknowledge scheme
  • flow control the use of advertised receive
    windows

15
  • Bandwidth
  • U-Net TCP 14-15 MB/s an 8Kbyte window
  • kernel TCP/ATM combination 10 MB/s with a 64K
    window
  • TCP tuning
  • TCP over high-speed networks thru WAN - high
    latency
  • tuning factors
  • size of the segments that are transmitted
  • 2048 byte segments
  • growth of the window size
  • bad ratio between the granularity of the protocol
    timers and the round-trip time estimates.
Write a Comment
User Comments (0)
About PowerShow.com