Title: Internet Scale Overlay Hosting
1Internet Scale Overlay Hosting
Jon Turner withPatrick Crowley, John DeHart,
Brandon Heller, Fred Kuhns, Sailesh Kumar, John
Lockwood, Jing Lu, Mike Wilson, Charlie Wiseman
and Dave Zar
2Overview
- Overlay networks are key tool for overcoming
Internet limitations - CDNs use overlay methods to enhance performance
- also useful for voice, video-streaming,
multi-player games - myriad other capabilities demonstrated using
overlays - Overlay hosting services can enable more
widespread use of overlays - PlanetLab has demonstrated potential in research
space - nothing exactly comparable yet in commercial
space - although utility computing is similar in spirit
- need more integrated and scalable platforms
- internet-scale traffic volumes
- low latency for delay-sensitive traffic
- flexible resource allocation and traffic isolation
3Overlay Hosting Service
hostingplatform
overlaynetwork
overlaynode
provisioned backbone
access via Internet
- Flexible platforms shared by multiple overlays
- Provisioned backbone, internet for access
4Overlay Hosting Platform
- Processing Engines (PEs) implement overlay nodes
- GPE conventional server blade
- NPE network processor blade
- nearly 4 Mp/s per NP vs 50 Kp/s
- 100 ms latency vs. 1-300 ms
- shared or dedicated
- IO Cards terminate external links, mux/demux
streams - Shared PEs managed by substrate
- Dedicated PEs may be fully controlled by overlay
- switch and IO Cards provide protection and
isolation - PEs in larger overlay nodes linked by logical
switch - allows scaling up for higher throughput
5PlanetLab
- Canonical overlay hosting service, using PC
platform - Applications run as user-space processes in
virtual machines - Effective and important research testbed
- But, low throughput and widely variable latency
limits its potential as service deployment
platform
obtains slice descriptions from PlanetLab database
standard socket programs
token bucket-based scheduling of VMs
6Supercharging PlanetLab
slow-path runs in a standard PlanetLab environment
exceptional packets forwarded to slow-path
existing PlanetLab applications can run unchanged
on GPE
fast-path handles most traffic
fast-path runs on a network processor
7SPP Components
conventional server which coordinates system
components and synchronizes with PlanetLab
conventional server blades supporting standard
PlanetLab environment
blade containing 10GE data switch and 1GE control
switch
dual Intel IXP 2850 blade which forwards packets
to correct PEs
dual Intel IXP 2850 blades supporting application
fast-paths
8ATCA Boards
- Radisys switch blade
- up to 16 slot chassis
- 10 GbE fabric switch
- 1 GbE control switch
- full VLAN support
- Scaling up
- 5x10 GbE to front
- 2 more to back
- Radisys NP blades
- for LC and NPE
- dual IXP 2850 NPs
- 3xRDRAM
- 4xSRAM
- shared TCAM
- 2x10GbE to backplane
- 10x1GbE external IO(or 1x10GbE)
- Intel server blades
- for CP and GPE
- dual Xeons (2 GHz)
- 4x1GbE
- on-board disk
- Advanced Mezzanine Card slot
9What You Need to Build Your Own
10IXP 2850 Overview
- 16 multi-threaded MicroEngines (MEs)
- 8 thread contexts with rapid switching capability
- fast nearest-neighbor connections for pipelined
apps - 3 SDRAM and 4 SRAM channels (optional TCAM)
- Management Processor (MP) for control
11Pipelining Multi-threading
- Limited program store per ME
- parallelize by dividing program among pipeline
stages - Use multi-threading to hide memory latency
- high latency to off-chip memory (gt100 cycles)
- modest locality of reference in net workloads
- interleave memory accesses to keep processor busy
- sequenced hand-offs between threads maintains
order - works well when limited processing time variation
12NPE Hosting Multiple Apps
- Parse and Header Format include slice-specific
code - parse extracts header fields to form lookup key
- Hdr Format makes required changes to header
fields - Lookup uses opaque key for TCAM lookup
- Multiple static code options can be supported
- multiple slices per code option
- each has own filters, queues and block of private
memory
13Sharing the NPE
each application has private queues
each application has private lookup entries
forms key for lookup
formats outgoing packet headers
14System Control
CP
GPE
Local Node Manager
Global Node Manager
PLC
VM
Control Interface
Internet
VM
Local Resource Manager
Global Resource Manager
User
...
Control Switch
- Instantiate new application
- Open socket
- Instantiate fast-path
NPE
Fast-path Manager
Line Card Manager
LC
Data Interfaces
Fast-path
Filter
Filter
...
SPP
15Evaluation
- Slice 1 IPv4
- packets arrive/depart in UDP tunnels
- Slice 2 Internet Indirection Infrastructure
(i3) - packets contain triggers matched to IP addresses
- no match at local node results in Chord forwarding
16IPv4 Throughput Comparison
10x improvement for 1400 byte payloads
NPE almost keeps up with full line rate for 0
byte payloads
80x improvement for 0 byte payloads
17So, what this means is...
price-performance advantage of gt15X also, big
power and space advantage
18IPv4 Latency Comparison
- 8 IPv4 instances
- Measured ping delay against background traffic
19IPv4/i3 Fast-Path Throughput Comparison
IPv4 0B payload
i3 0B payload
IPv4 40B payload
i3 40B payload
- Constant input rate of 5 Gb/s
20Scaling Up
- 14 slot chassis
- 3 Line Cards
- 2 switch blades
- 9 processing blades (NP or server)
- Multi-chassis systems
- direct connection using expansion ports up to 7
chasses - indirect connection using separate 10 GbE
switches up to 24 chasses
21Other ATCA Components
22Open Network Lab
- Internet-accessible networking lab
(onl.wustl.edu) - built around set of extensible gigabit routers
- intuitive Remote Lab Interface makes it easy to
get started - extensive facilities for performance monitoring
- Expansion underway
- 14 new Network Processor (NP) based routers
- packet processing implemented in software for
greater flexibility - high performance plugin subsystem for user-added
features - support larger experiments and more concurrent
users - 70 new rack-mount computers to serve as end
systems - 4 stackable 48 port GbE switches for configuring
experiments
23Sample ONL Session
Bandwidth Usage
Routing Table
Queue Lengths
Network Configuration
ssh window to host showing ping delays
PacketLosses
Queue Parameters
RouterPluginCommands
24ONL NP Router
25Expanded ONL Configuration
26Equipment Photos
27Summary
- Next step add netFPGA to SPP and ONL
- Interesting time for networking research
- highly capable subsystem components readily
available - many vendors, variety of products
- greater opportunity for network service
innovation - Growing role of multi-core processors
- to use them effectively, must design for
parallelism - requires deeper understanding of performance
- Conventional servers have dreadful performance on
IO-intensive applications - partly hardware, but mostly software
- to fix, need to push fast-path down into drivers
and program for multi-core parallelism