Title: Clusters Networks II: Protection and Performance
1Clusters Networks II Protection and Performance
- Andrew Chien
- High Performance Distributed Computing (CSE225)
- January 14, 1999
2Announcements/Review
- Class on Tuesday January 19th is cancelled.
- Catch up on reading -)
- Next class, Thursday, January 21
- Last Time
- Efficient aggregation in Clusters
- High Performance Communication
- Coordinated Scheduling (coarse and fine grained)
- Uniform Resource Access
- High Performance Communication
3Todays Outline
- Multi-process Protection in High Performance
Networks - U-Net User-level Network Interface
- Virtual Interface Architecture
- Delivering Performance to Applications
- FM 2.0 API and Layer Composition
4U-Net Multiprocess Protection
Application
Kernel
Network
NIC
- Traditional networking view OS mediated
communication - Problem System call overhead limits performance
- 10 - 50 ms fast trap in current day systems
- 100s of ms in some cases
- 50 ms gt what kind of bandwidth limits in a Gbps
network? - 100 bytes gt 20mbps 250 bytes gt ?? 500 bytes
100Mbps - 1KB gt 200Mbps 2KB gt ?? Full Bandwidth gt
5KB or 6KB
5U-Net OS Bypass and Virtualization
Application
NIC
Application
NIC
Network
Application
NIC
Application
NIC
Kernel
NIC
- Each process gets a virtual network interface
(memory mapped protection) - Runs protocols, buffer management, etc. in user
space - Whats hard about this?
6Providing Network Protection
- How to avoid interference, preserve network data
integrity, avoid spoofing? - Traditional model depends on Kernel to
authenticate and route each packet - Idea Division of effort
- Kernel sets up routes and connections between
virtual interfaces - Packet tagging (done by the interfaces) and
mux/demultiplexing ensure that users can only - send packets to authorized connections
- receive packets from the places authorized to do
so
7U-Net Endpoints (virtual interfaces)
Endpoint
- Each endpoint is a virtualized NI buffer pool
- Connected by the kernel agent to another
endpoint (bidirectional connections) - Communication segments are pinned DMA regions,
buffer pool management done by network interface - Notification done by polling or event-driven
(upcall)
8Network Communication
Application
Application
Kernel
Kernel
- Applications communication through endpoints
- Kernel operations are NOT along the data movement
path
9U-Net Performance
- Raw U-Net
- Benefits of OS bypass
- 65 ms RT latency, 120 ms for 32 bytes and then
amortize to link speed - gt reduced overhead to 30 ms
10U-Net vs. Fore Firmware
- Lesson commercial products are not always
well-designed / mature - Snapshot of state of art, and often very
constrained by other circumstance
11U-Net w/ Active Messages and IP Protocols
- Overheads for IP significantly higher (2 - 2.5x)
- Reduces deliverable bandwidth fraction for short
messages - Peak Bandwidth achievable (for this network)
12U-Net Summary
- Demonstrated partition of kernel managed
connection setup - User-level communication
- User-level buffer management and protocols
- Demonstrated reasonable performance
13Virtual Interface Architecture
- VIA Project Intel/Compaq/Microsoft
- catalyze and industry standard for high
performance cluster communication - capitalize on the technical advances in academic
research to reduce communication overhead and
deliver the performance - started Dec 1996, standard in Dec 1997
- technical work paper design, emulator, lots of
political wrangling amongst the companies
(billions at stake) - designed to provide user-level communication to
multiple operating systems -- WinNT, Novell
Netware, Unix - designed to provide this user-level interface
independent of the underlying interconnection
fabric (ATM, GigE, Myrinet, Giganet, etc.)
14VIA Basic Ideas
- Endpoints a la U-Net
- hardware supported NIC virtualization
- send, receive buffer pools (registered memory
with interface) - doorbells for notification between host and the
NIC - polling and interrupt based notification (user
selectable) - Network Reliability Attributes (failure
semantics) - Read and Write RDMA Operations (and ordering)
- Memory protection attributes
- Group notification (shared completion queues)
15 16Delivering Gigabit Performance FM 2.0
17FM 1.x Evaluation
- MPI on FM (Fall 1995)
- BSD Sockets on FM (December 1995)
18MPI-FM Efficiency
- Problems excessive copies (pacing API), hard
to program (interleaving)
19MPI-FM Performance (initial)
- Problems FM 1.x (and AM) API is a poor design
for composition - How can we design an API that makes it easy to
deliver performance? - Key issues
- Eliminate copying for header attach/remove
(Gather-scatter) - Eliminate copying from network overrun (Receiver
flow control) - Ease programming effort for interleaved PTUs
(Handler multithreading) - All needed to deliver performance to the
application layers - Partial changes enabled
- MPI on FM 1.1 (19ms, 17.5MB/s) JPDC97
- Sockets on FM 1.1 (35ms, 17.5MB/s)
20Illinois Fast Messages 2.x
- Gather-scatter interface enables efficient
layering, data movement without copies
(packetization invisible) - Multithreading provides sequential view of
message reception (packetization and interleaving
invisible) - Bonuses Multiprocess, dynamic network namespaces
21Receiver Flow Control
- Receiver determines data pacing from network
subsystem - Lower-levels provide communication/computation
overlap - Provides a simple composition model (examples)
- Leverages reliable delivery and flow control at
the lower level
22FM 2.x API
- Sending
- FM_begin_message(NodeID,Handler,size)
- FM_send_piece(stream,buffer,size) // gather
- FM_end_message()
- Receiving
- FM_receive(buffer,size) // scatter
- FM_extract(total_bytes) // rcvr flow control
- Implementation
- C parser rewrites code
- Logical thread for each message receive
- OS thread safe
23Send Example (List Send)
extern FM_handler myhandler void
sendlist(unsigned int dest, Node
nodep, unsigned int elts)
FM_stream mystream unsigned int databytes
eltssizeof(int) while (!(mystreamFM_begin_me
ssage(dest,databytes,
myhandler))) while (nodep)
FM_send_piece(mystream,nodep-gtdata,sizeof(int))
nodep nodep-gtnext FM_end_message(mystr
eam)
24Handler Example (MPI)
pragma FM_declare_handler int myhandler(FM_stream
str, unsigned int sender) struct header
myheader int msglen FM_receive(myheader,st
r,sizeof(struct header)) msglen
myheader.length if (myheader.littlemsg)
FM_receive(littlebuf,str,msglen) else
FM_receive(findbigbuffer(msglen),str,msglen)
return FM_CONTINUE
25Platform Upgrade PCs
Pentium Pro/II.
NIC
1280Mbps
Sparc -gt x86
2x
2x
PCI
P6 bus
- PCs exploit cost advantages, eliminate PIO
problem (graphics driven) - Faster network cards and links (2x)
- OS Windows NT and Linux (goal widespread use)
26FM 2.x Performance
- Latency 11ms, BW 77MB/s, N1/2 lt200 bytes
- Fast in absolute terms (compares to MPPs,
internal memory BW) - Delivers a large fraction of hardware performance
for short messages - Performance bottleneck has moved inside the
system!
27Performance Implications
- Typical packet distributions
- 80-90 of packets lt 200 bytes
- gt FM2.x delivers 40MB/s 320Mbps _at_ 256 bytes
- a Fast UDP delivers 2MB/s _at_ 256 bytes
- 20x superior bandwidth/overhead
- gt Of course, these are not directly comparable.
28FM 2.x Evaluation (MPI)
- MPI-FM 70MB/s, 17ms latency, 5.1ms overhead
- Peak BW IBM SP2, Short messages much better
- Latency SGI O2K
- FM 77MB/s, 11ms latency, 4.1ms overhead
29FM2.x Evaluation (MPI) cont.
- High Transfer Efficiency, approaches 100
- Other systems much lower even at 1KB (100Mbit
40, 1Gbit 5)
30FM 2.0 Summary
- APIs and Guarantees matter for delivering
performance - Layer composition is a critical issue in software
communication architectures - What are the equivalent concepts for other types
of Grid performance? (usable computation,
memory, etc?) - What are the right metrics to drive this? N1/2
for parallelism?
31Overall Summary
- User-level network interfaces
- Separation of connection setup
- User protocol processing and buffer management
- Embodied in U-Net and VIA (and FM)
- VIA fault tolerance and RDMA operations
- Delivering communication performance
- Depends on APIs and guarantees
- Usable performance is critical question
- Generalizations to Grid resource abstractions?
32Next Time (January 21st)
- Reading Assignments
- Grid Book, Chapters 11 (Globus Toolkit) and 9.4 -
9.6 (Legion) - Globus High Level Vision
- The Globus Project A Status Report. I. Foster,
C. Kesselman, Proc. IPPS/SPDP '98 Heterogeneous
Computing Workshop, pg. 4-18, 1998. - Globus A Metacomputing Infrastructure Toolkit.
I. Foster, C. Kesselman, Intl J. Supercomputer
Applications, 11(2)115-128, 1997. - Globus Papers available from http//www.globus.or
g/documentation/papers.html
33Further reading (will be assigned next)
- A Directory Service for Configuring
High-Performance Distributed Computations. S.
Fitzgerald, I. Foster, C. Kesselman, G. von
Laszewski, W. Smith, S. Tuecke. Proc. 6thIEEE
Symp. on High-Performance Distributed Computing,
pg. 365-375, 1997. - Usage of LDAP in Globus. I. Foster, G. von
Laszewski. - A Fault Detection Service for Wide Area
Distributed Computations. P. Stelling, I.
Foster, C. Kesselman, C.Lee, G. von Laszewski,
Proc. 7th IEEE Symp. on High Performance
Distributed Computing, 1998.