Title: SCTP versus TCP for MPI
1SCTP versus TCP for MPI
- Brad Penoff, Humaira Kamal, Alan Wagner
- Department of Computer Science
- University of British Columbia
2Outline
- Self Introduction
- Research background
- Research presentation
- SCTP MPI background
- MPI over SCTP design
- Design features
- Results
- Conclusions
3Who am I?
- Born and raised in Columbus area
- OSU alumni
- Europa alumni
- Worked a few years
- Grad student finishing my MSc at UBC
4UBC
5Who do I work with?
- Alan Wagner (Prof, UBC)
- Humaira Kamal (PhD, UBC)
- Mike Yao Chen Tsai (MSc, UBC)
- Edith Vong (BSc, UBC)
- Randall Stewart (Cisco)
6What field do we work in?
- Parallel computing
- Concurrently utilize multiple resources
7What field do we work in?
- Parallel computing
- Concurrently utilize multiple resources
1 cook
8What field do we work in?
- Parallel computing
- Concurrently utilize multiple resources
1 cook
vs 8 cooks
9What field do we work in?
- Parallel computing
- Concurrently utilize multiple resources
10What field do we work in?
- Message passing programming model
- Message Passing Interface (MPI)
- Standardized API for applications
11What field do we work in?
- Middleware for MPI
- Glues necessary components together for parallel
environment
12What field do we work in?
- Middleware for MPI
- Glues necessary components together for parallel
environment
?
13What field do we work in?
- Parallel library component
- Implements MPI API for various interconnects
- Shared memory
- Myrinet
- Infiniband
- Specialized hardware (BlueGene/L, ASCI Red, etc)
14What field do we work in?
- TCP/IP protocol stack interconnect
- Stream Control Transmission Protocol
15SCTP versus TCP for MPI
- Brad Penoff, Humaira Kamal, Alan Wagner
- Department of Computer Science
- University of British Columbia
- Supercomputing 2005, Seattle, Washington USA
16What is MPI and SCTP?
- Message Passing Interface (MPI)
- Library that is widely used to parallelize
scientific and compute-intensive programs - Stream Control Transmission Protocol (SCTP)
- General purpose unicast transport protocol for IP
network data communications - Recently standardized by IETF
- Can be used anywhere TCP is used
17What is MPI and SCTP?
- Message Passing Interface (MPI)
- Library that is widely used to parallelize
scientific and compute-intensive programs - Stream Control Transmission Protocol (SCTP)
- General purpose unicast transport protocol for IP
network data communications - Recently standardized by IETF
- Can be used anywhere TCP is used
- Question
- Can we take advantage of SCTP features to better
support parallel applications using MPI?
18Communicating MPI Processes
TCP is often used as transport protocol for MPI
SCTP
SCTP
19SCTP Key Features
- Reliable in-order delivery, flow control, full
duplex transfer. - Selective ACK is built-in the protocol
- TCP-like congestion control
20SCTP Key Features
- Message oriented
- Use of associations
- Multihoming
- Multiple streams within an association
21Associations and Multihoming
- Primary address
- Heartbeats
- Retransmissions
- Failover
- User adjustable controls
- CMT
22Logical View of Multiple Streams in an Association
23Partially Ordered User Messages Sent on Different
Streams
24Partially Ordered User Messages Sent on Different
Streams
25Partially Ordered User Messages Sent on Different
Streams
26Partially Ordered User Messages Sent on Different
Streams
27Partially Ordered User Messages Sent on Different
Streams
28Partially Ordered User Messages Sent on Different
Streams
29Partially Ordered User Messages Sent on Different
Streams
30Partially Ordered User Messages Sent on Different
Streams
31Partially Ordered User Messages Sent on Different
Streams
32Partially Ordered User Messages Sent on Different
Streams
33Partially Ordered User Messages Sent on Different
Streams
34Partially Ordered User Messages Sent on Different
Streams
Can be received in the same order as it was sent
(required in TCP).
35Partially Ordered User Messages Sent on Different
Streams
36Partially Ordered User Messages Sent on Different
Streams
37Partially Ordered User Messages Sent on Different
Streams
38Partially Ordered User Messages Sent on Different
Streams
39MPI API Implementaion
MPI_Send(msg,count,type,dest-rank,tag,context)
MPI_Recv(msg,count,type,source-rank,tag,context)
- Message matching is done based on Tag, Rank and
Context (TRC). - Combinations such as blocking, non-blocking,
synchronous, asynchronous, buffered, unbuffered. - Use of wildcards for receive
40MPI Messages Using Same Context, Two Processes
41MPI Messages Using Same Context, Two Processes
Out of order messages with same tags violate MPI
semantics
42MPI API Implementation
- Request Progression Layer
- Short Messages vs. Long Messages
43MPI over SCTP Design and Implementation
- LAM (Local Area Multi-computer) is an open source
implementation of MPI library. - Origins at Ohio Supercomputing Center
- We redesigned LAM TCP RPI module to use SCTP.
- RPI module is responsible maintaining state
information of all requests. -
44MPI over SCTP Design and Implementation
- Challenges
- Lack of documentation
- Code examination
- Our document is linked-off LAM/MPI website
- Extensive instrumentation
- Diagnostic traces
- Identification of problems in SCTP protocol
-
45Using SCTP for MPI
- Striking similarities between SCTP and MPI
46Implementation Issues
- Maintaining State Information
- Maintain state appropriately for each request
function to work with the one-to-many style. - Message Demultiplexing
- Extend RPI initialization to map associations to
rank. - Demultiplexing of each incoming message to direct
it to the proper receive function. - Concurrency and SCTP Streams
- Consistently map MPI tag-rank-context to SCTP
streams, maintaining proper MPI semantics. - Resource Management
- Make RPI more message-driven.
- Eliminate the use of the select() system call,
making the implementation more scalable. - Eliminating the need to maintain a large number
of socket descriptors.
47Implementation Issues
- Eliminating Race Conditions
- Finding solutions for race conditions due to
added concurrency. - Use of barrier after association setup phase.
- Reliability
- Modify out-of-band daemons and request
progression interface (RPI) to use a common
transport layer protocol to allow for all
components of LAM to multihome successfully. - Support for large messages
- Devised a long-message protocol to handle
messages larger than socket send buffer. - Experiments with different SCTP stacks
48Features of Design
- Scalability
- Head-of-Line Blocking
49Scalability
TCP
50Scalability
SCTP
51Head-of-Line Blocking
52 53 54 55 56 57 58 59Limitations
- Comprehensive CRC32c checksum offload to NIC
not yet commonly available - SCTP bundles messages together so it might not
always be able to pack a full MTU - SCTP stack is in early stages and will improve
over time - Performance is stack dependant (Linux lksctp
stack ltlt FreeBSD KAME stack)
60Experiments
- Controlled environment - Eight nodes -Dummynet
- Used standard benchmarks as well as real world
programs - Fair comparison
- Buffer sizes, Nagle disabled, SACK ON, No
multihoming, CRC32c OFF
61Experiments Benchmarks
MPBench Ping Pong Test under No Loss
62NAS Benchmarks
- The NAS benchmarks approximate real world
parallel scientific applications - We experimented with a suite of 7 benchmarks, 4
data set sizes - SCTP performance comparable to TCP for large
datasets.
63Latency Tolerant Programs
- Bulk Farm Processor program
- Real-world application
- Non-blocking communication
- Overlap computation with communication
- Use of multiple tags
64Farm Program - Short Messages
65Head-of-line blocking Short messages
66Conclusions
- SCTP is a better suited for MPI
- Avoids unnecessary head-of-line blocking due to
use of streams - Increased fault tolerance in presence of
multihomed hosts - In-built security features
- Robust under loss
- SCTP might be key to moving MPI programs from
LANs to WANs.
67Future Work
- Release LAM SCTP RPI module at SC05
- Incorporate our work into Open MPI and/or MPICH2
- Modify real applications to use tags as streams
68Thank you!
- More information about our work is at
- http//www.cs.ubc.ca/labs/dsg/mpi-sctp/
69Extra Slides
70Partially Ordered User Messages Sent on Different
Streams
71Added Security
User data can be piggy-backed on third and fourth
leg
SCTPs Use of Signed Cookie
72Added Security
- 32 bit Verification Tag reset attack
- Autoclose feature
- No half-closed state
73Farm Program - Long Messages
74Head-of-line blocking Long messages
75Experiments Benchmarks
- SCTP outperformed TCP under loss for ping pong
test.
76Experiments Benchmarks
- SCTP outperformed TCP under loss for ping pong
test.
77Experiments Benchmarks
- SCTP outperformed TCP under loss for ping pong
test.