Clusters Networks II: Protection and Performance - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Clusters Networks II: Protection and Performance

Description:

... Project: Intel ... OS: Windows NT and Linux (goal: widespread use) NIC. 1280Mbps. P6 bus. PCI ... from: http://www.globus.org/documentation/papers.html ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 34

Provided by: Andre524

Category:

more less

Transcript and Presenter's Notes

Title: Clusters Networks II: Protection and Performance

1
Clusters Networks II Protection and Performance

Andrew Chien
High Performance Distributed Computing (CSE225)
January 14, 1999

2
Announcements/Review

Class on Tuesday January 19th is cancelled.
Catch up on reading -)
Next class, Thursday, January 21
Last Time
Efficient aggregation in Clusters
High Performance Communication
Coordinated Scheduling (coarse and fine grained)
Uniform Resource Access
High Performance Communication

3
Todays Outline

Multi-process Protection in High Performance
Networks
U-Net User-level Network Interface
Virtual Interface Architecture
Delivering Performance to Applications
FM 2.0 API and Layer Composition

4
U-Net Multiprocess Protection
Application
Kernel
Network
NIC

Traditional networking view OS mediated
communication
Problem System call overhead limits performance
10 - 50 ms fast trap in current day systems
100s of ms in some cases
50 ms gt what kind of bandwidth limits in a Gbps
network?
100 bytes gt 20mbps 250 bytes gt ?? 500 bytes
100Mbps
1KB gt 200Mbps 2KB gt ?? Full Bandwidth gt
5KB or 6KB

5
U-Net OS Bypass and Virtualization
Application
NIC
Application
NIC
Network
Application
NIC
Application
NIC
Kernel
NIC

Each process gets a virtual network interface
(memory mapped protection)
Runs protocols, buffer management, etc. in user
space
Whats hard about this?

6
Providing Network Protection

How to avoid interference, preserve network data
integrity, avoid spoofing?
Traditional model depends on Kernel to
authenticate and route each packet
Idea Division of effort
Kernel sets up routes and connections between
virtual interfaces
Packet tagging (done by the interfaces) and
mux/demultiplexing ensure that users can only
send packets to authorized connections
receive packets from the places authorized to do
so

7
U-Net Endpoints (virtual interfaces)
Endpoint

Each endpoint is a virtualized NI buffer pool
Connected by the kernel agent to another
endpoint (bidirectional connections)
Communication segments are pinned DMA regions,
buffer pool management done by network interface
Notification done by polling or event-driven
(upcall)

8
Network Communication
Application
Application
Kernel
Kernel

Applications communication through endpoints
Kernel operations are NOT along the data movement
path

9
U-Net Performance

Raw U-Net
Benefits of OS bypass
65 ms RT latency, 120 ms for 32 bytes and then
amortize to link speed
gt reduced overhead to 30 ms

10
U-Net vs. Fore Firmware

Lesson commercial products are not always
well-designed / mature
Snapshot of state of art, and often very
constrained by other circumstance

11
U-Net w/ Active Messages and IP Protocols

Overheads for IP significantly higher (2 - 2.5x)
Reduces deliverable bandwidth fraction for short
messages
Peak Bandwidth achievable (for this network)

12
U-Net Summary

Demonstrated partition of kernel managed
connection setup
User-level communication
User-level buffer management and protocols
Demonstrated reasonable performance

13
Virtual Interface Architecture

VIA Project Intel/Compaq/Microsoft
catalyze and industry standard for high
performance cluster communication
capitalize on the technical advances in academic
research to reduce communication overhead and
deliver the performance
started Dec 1996, standard in Dec 1997
technical work paper design, emulator, lots of
political wrangling amongst the companies
(billions at stake)
designed to provide user-level communication to
multiple operating systems -- WinNT, Novell
Netware, Unix
designed to provide this user-level interface
independent of the underlying interconnection
fabric (ATM, GigE, Myrinet, Giganet, etc.)

14
VIA Basic Ideas

Endpoints a la U-Net
hardware supported NIC virtualization
send, receive buffer pools (registered memory
with interface)
doorbells for notification between host and the
NIC
polling and interrupt based notification (user
selectable)
Network Reliability Attributes (failure
semantics)
Read and Write RDMA Operations (and ordering)
Memory protection attributes
Group notification (shared completion queues)

VIA VIPL Overview Slides

16
Delivering Gigabit Performance FM 2.0
17
FM 1.x Evaluation

MPI on FM (Fall 1995)
BSD Sockets on FM (December 1995)

18
MPI-FM Efficiency

Problems excessive copies (pacing API), hard
to program (interleaving)

19
MPI-FM Performance (initial)

Problems FM 1.x (and AM) API is a poor design
for composition
How can we design an API that makes it easy to
deliver performance?
Key issues
Eliminate copying for header attach/remove
(Gather-scatter)
Eliminate copying from network overrun (Receiver
flow control)
Ease programming effort for interleaved PTUs
(Handler multithreading)
All needed to deliver performance to the
application layers
Partial changes enabled
MPI on FM 1.1 (19ms, 17.5MB/s) JPDC97
Sockets on FM 1.1 (35ms, 17.5MB/s)

20
Illinois Fast Messages 2.x

Gather-scatter interface enables efficient
layering, data movement without copies
(packetization invisible)
Multithreading provides sequential view of
message reception (packetization and interleaving
invisible)
Bonuses Multiprocess, dynamic network namespaces

21
Receiver Flow Control

Receiver determines data pacing from network
subsystem
Lower-levels provide communication/computation
overlap
Provides a simple composition model (examples)
Leverages reliable delivery and flow control at
the lower level

22
FM 2.x API

Sending
FM_begin_message(NodeID,Handler,size)
FM_send_piece(stream,buffer,size) // gather
FM_end_message()
Receiving
FM_receive(buffer,size) // scatter
FM_extract(total_bytes) // rcvr flow control
Implementation
C parser rewrites code
Logical thread for each message receive
OS thread safe

23
Send Example (List Send)
extern FM_handler myhandler void
sendlist(unsigned int dest, Node
nodep, unsigned int elts)
FM_stream mystream unsigned int databytes
eltssizeof(int) while (!(mystreamFM_begin_me
ssage(dest,databytes,
myhandler))) while (nodep)
FM_send_piece(mystream,nodep-gtdata,sizeof(int))
nodep nodep-gtnext FM_end_message(mystr
eam)
24
Handler Example (MPI)
pragma FM_declare_handler int myhandler(FM_stream
str, unsigned int sender) struct header
myheader int msglen FM_receive(myheader,st
r,sizeof(struct header)) msglen
myheader.length if (myheader.littlemsg)
FM_receive(littlebuf,str,msglen) else
FM_receive(findbigbuffer(msglen),str,msglen)
return FM_CONTINUE
25
Platform Upgrade PCs
Pentium Pro/II.
NIC
1280Mbps
Sparc -gt x86
2x
2x
PCI
P6 bus

PCs exploit cost advantages, eliminate PIO
problem (graphics driven)
Faster network cards and links (2x)
OS Windows NT and Linux (goal widespread use)

26
FM 2.x Performance

Latency 11ms, BW 77MB/s, N1/2 lt200 bytes
Fast in absolute terms (compares to MPPs,
internal memory BW)
Delivers a large fraction of hardware performance
for short messages
Performance bottleneck has moved inside the
system!

27
Performance Implications

Typical packet distributions
80-90 of packets lt 200 bytes
gt FM2.x delivers 40MB/s 320Mbps _at_ 256 bytes
a Fast UDP delivers 2MB/s _at_ 256 bytes
20x superior bandwidth/overhead
gt Of course, these are not directly comparable.

28
FM 2.x Evaluation (MPI)

MPI-FM 70MB/s, 17ms latency, 5.1ms overhead
Peak BW IBM SP2, Short messages much better
Latency SGI O2K
FM 77MB/s, 11ms latency, 4.1ms overhead

29
FM2.x Evaluation (MPI) cont.

High Transfer Efficiency, approaches 100
Other systems much lower even at 1KB (100Mbit
40, 1Gbit 5)

30
FM 2.0 Summary

APIs and Guarantees matter for delivering
performance
Layer composition is a critical issue in software
communication architectures
What are the equivalent concepts for other types
of Grid performance? (usable computation,
memory, etc?)
What are the right metrics to drive this? N1/2
for parallelism?

31
Overall Summary

User-level network interfaces
Separation of connection setup
User protocol processing and buffer management
Embodied in U-Net and VIA (and FM)
VIA fault tolerance and RDMA operations
Delivering communication performance
Depends on APIs and guarantees
Usable performance is critical question
Generalizations to Grid resource abstractions?

32
Next Time (January 21st)

Reading Assignments
Grid Book, Chapters 11 (Globus Toolkit) and 9.4 -
9.6 (Legion)
Globus High Level Vision
The Globus Project A Status Report. I. Foster,
C. Kesselman, Proc. IPPS/SPDP '98 Heterogeneous
Computing Workshop, pg. 4-18, 1998.
Globus A Metacomputing Infrastructure Toolkit.
I. Foster, C. Kesselman, Intl J. Supercomputer
Applications, 11(2)115-128, 1997.
Globus Papers available from http//www.globus.or
g/documentation/papers.html

33
Further reading (will be assigned next)

A Directory Service for Configuring
High-Performance Distributed Computations. S.
Fitzgerald, I. Foster, C. Kesselman, G. von
Laszewski, W. Smith, S. Tuecke. Proc. 6thIEEE
Symp. on High-Performance Distributed Computing,
pg. 365-375, 1997.
Usage of LDAP in Globus. I. Foster, G. von
Laszewski.
A Fault Detection Service for Wide Area
Distributed Computations. P. Stelling, I.
Foster, C. Kesselman, C.Lee, G. von Laszewski,
Proc. 7th IEEE Symp. on High Performance
Distributed Computing, 1998.