Title: Cluster Based Scalable Routing CBSR
1Cluster Based Scalable Routing(CBSR)
- Yi Yang
- Jiuliu Lu
- Jing Li
- Dan Fowlkes
2Presentation Outline
- Introduction
- Objectives
- Motivation
- Related Work SPINE
- Myrinet and GM
3Presentation Outline (cont.)
- Cluster-Based Scalable Router (CBSR)
- System Architecture
- System Set-up
- Implementation and test
- Summary Future Work
4Introduction
-
- network transmission speeds continue to
improve - demand for network usage also increasing as
more and more people (and their toasters and,
pretty soon, we imagine, their pets) get on-line.
5Introduction (cont.)
- network performance dependent on more than
transmission speed alone in order to maximize
overall network performance other components of
the network must be fine-tuned as well. - bottleneck used to be routers
- now specially designed high-speed routers are
available -- and EXPENSIVE
6Introduction (cont.)
- with currently falling prices for PCs, there may
be a cost-effective alternative to shelling out
the big bucks for one of these routers ---gt
perform high-speed routing using clusters of
workstations!
7Objectives
- demonstrate feasibility of using a cluster of
workstations to do routing - in order to simply, we concentrate on creating
an implementation that worked rather than one
that could compete with commercial
specially- designed high-speed router
8Objectives (cont.)
-
- ensure that system is scalable (non- scalable
routers of rather limited use) - gauge performance of network
9Motivation
- Yi Yang Get a good grade.
- Jiuliu Lu Get a good grade.
- Jing Li Curiosity. (auditing the class)
- Dan Ooooh... gummy bears..... err, i mean
Get a good grade.
10Motivation
- To keep from having to shell out the big money
for a specially designed high-speed router by
using clusters of workstations to achieve the
same functionality.
11Related Work SPINE
- Guiding Principle move application- specific
functionality directly onto the network
interface - should improve overall system performance by
reducing the I/O related data and control
transfers to the host system
12Related Work SPINE (cont.)
- they migrate an application's I/O specific
functionality into device extensions - extension "code that is logically part of the
application, but runs directly on the network
interface."
13Related Work SPINE (cont.)
- defines interfaces which enable OSs to compute
directly on an intelligent network interface -
- Aim efficiently implement methods (3), crucial
to efficient I/O, in order to offer developers
an architecture geared towards I/O intensive
applications
14RW SPINE, Method 1
- Device-to-device transfers.
- avoid extra copies of data so bandwidth needs in
and out of host memory and over a shared bus
significantly reduced - intelligent devices can process data prior to
transferring it to a peer device in order to
avoid unnecessary control transfers to the host
system
15RW SPINE, Method 2
- Host/Device protocol partitioning.
-
- system performance can be through quality of
service, packet filtering, and low-level
protocol support for application-specific
multicast.
16RW SPINE, Method 3
- Device-level memory management.
- allow direct transfers between network interface
and application buffers
17Myrinet
- "Myrinet is a cost-effective, high-performance,
packet-communication and switching technology
that is widely used to interconnect clusters of
workstations, PCS, or single-board computers." - two-fold benefit
18Myrinet, Benefit 1
- high performance
- distribute demanding computational tasks across
array of cost-effective hosts - given good sized array, benefits are competitive
with high-speed routers - provide both high data-rate and low latency
communication between host processes in order to
support tightly coupled distributed computations
19Myrinet, Benefit 2
- high availability
- achieved by allowing each computation to proceed
with a subset of the hosts.
20Myrinet (cont.)
-
- can construct router out of cluster of
workstations using conventional network such as
Ethernet, but this router" would provide neither
the performance nor features necessary for
high-performance / high-availability clustering.
21Myrinet (cont.)
-
- packets used by Myrinet are not of fixed length
- may be used to encapsulate other types of
packets without need for an adaption layer
(including IP packets)
22Myrinet (cont.)
- can carry packets of many types and protocols
concurrently b/c each of these packets is
identified by type - in this way, Myrinet has support for several
software interfaces.
23GM
- message-based communication system for Myrinet
- designed to keep the CPU overhead and latency
low, the bandwidth high, and to be portable
24GM Advantages over other Messaging Sys.
- extremely low overhead (approximately 1 ms per
packet) on all architectures - on systems supporting memory protection,
includes the functionality to provide
simultaneous memory-protected user-level
OS-bypass network interface access to several
user-level applications simultaneously
25GM Advantages over other Messaging Sys. (cont.)
- provides hosts with reliable, ordered delivery
despite possible faults in the network - able to detect and retransmit both lost and
corrupted packets
26GM Advantages over other Messaging Sys. (cont.)
- reroutes packets around any network faults when
there exists an alternate route - catastrophic network errors are nonfatal -
undeliverable packets are returned to the client
with an error indication
27GM Advantages over other Messaging Sys. (cont.)
- able to support clusters of over 10,000 nodes
- allows efficient deadlock-free bounded-memory
forwarding through two levels of message priority - automatically maps Myrinet networks.
28System Architecture
29System Architecture (cont.)
- Scalability
- A CBSR may have variable number of workstations
(routing machines). - A workstation may have variable number of NICs
(network interfaces).
30System Architecture (cont.)
31System Architecture (cont.)
- IP reads the IP headers of datagrams.
- SAN doesnt extract information from IP
datagrams. IP tells it to which interface a
datagram is sent. - IP datagram is transmitted over SAN intactly
- Myrinet SAN forwards message with DMA
32System Architecture (cont.)
33System Architecture (cont.)
- IP treats datagrams from NICs and SAN equally.
This approach needs to look up routing table
twice. - SAN informs IP which interface a datagram should
go. Routing table is involved only once.
34System Architecture (cont.)
35System Architecture (cont.)
36System Set-up
37System Set-up (cont.)
- FreeBSD 4.0 current
- GM API 1.1.1
- Assume the availability of IP and IP routing
table - Implement forwarding over Myrinet SAN
38More about GM
- How does GM transmit packets?
39More about GM(cont.)
40More about GM(cont.)
41Implementation of CBSR
Myrinet
Keys
Receiving Event Processing
Token Based
Host
Interface
42Flow of Code
- GM setup initialize, open port
- Prepare buffer
- sending buffer , receiving buffer
- sending content (for test)
- Receiving event processing
- Sending and sending event processing
- GM shutdown
43Result of Test
44Result of Test (cont.)
45Result of Test (cont.)
- Small Size
- 20,000 packets/s
- 2.5 MB/s (20 Mb/s)
- Middle Size
- 15,000 packets/s
- 15 MB/s (120 Mb/s)
- Big Size
- 8,000 packets/s
- 32 MB/s (256 Mb/s)
46Result of Test (cont.)
- Comparison to SPINE
- 11,800 packets/s
- 0 load on CPU
- including routing
47Next step
- Integration of routing forwarding
- Multiple ports
- Threads?
- Test of scalability
48Issue challenge
- Low efficiency sending side
CPU
Myrinet
Internet
NIC
LANai
Memory
Memory
49Issue challenge (cont.)
- Low efficiency receiving side
CPU
Internet
Myrinet
NIC
LANai
Memory
Memory
50Issue challenge (cont.)
CPU
CPU
Internet
Myrinet
Internet
51Summary
- General idea of scalable routing and related work
- Myrinet and GM
- CBSR
- Architecture
- Implementation
- Test Result
- Future work
52Thanks
- Prof. Vahdat
- Andrew Gallatin
- Prachi, Marty, Kisley
- Marc Fiuczynski (U. of Washington)
53(No Transcript)