Title: SCTP versus TCP for MPI
1SCTP versus TCP for MPI
- Brad Penoff, Humaira Kamal, Alan Wagner
- Department of Computer Science
- University of British Columbia
Distributed Research Group
SC-2005 Nov 16
2What is SCTP?
- Stream Control Transmission Protocol (SCTP)
- General purpose unicast transport protocol for IP
network data communications - Recently standardized by IETF
- Can be used anywhere TCP is used
3What is SCTP?
- Stream Control Transmission Protocol (SCTP)
- General purpose unicast transport protocol for IP
network data communications - Recently standardized by IETF
- Can be used anywhere TCP is used
- Question
- Can we take advantage of SCTP features to better
support parallel applications using MPI?
4Communicating MPI Processes
TCP is often used as transport protocol for MPI
SCTP
SCTP
5Overview of SCTP
6SCTP Key Features
- Reliable in-order delivery, flow control, full
duplex transfer. - TCP-like congestion control
- Selective ACK is built-in the protocol
7SCTP Key Features
- Message oriented
- Use of associations
- Multihoming
- Multiple streams within an association
8Associations and Multihoming
Endpoint X
Endpoint Y
Association
NIC
1
NIC
2
NIC
3
NIC
4
Network
207
.
10
.
x
.
x
IP
207
.
10
.
3
.
20
IP
207
.
10
.
40
.
1
Network
168
.
1
.
x
.
x
IP
168
.
1
.
140
.
10
IP
168
.
1
.
10
.
30
9Logical View of Multiple Streams in an Association
10Partially Ordered User Messages Sent on Different
Streams
11Partially Ordered User Messages Sent on Different
Streams
12Partially Ordered User Messages Sent on Different
Streams
13Partially Ordered User Messages Sent on Different
Streams
14Partially Ordered User Messages Sent on Different
Streams
15Partially Ordered User Messages Sent on Different
Streams
16Partially Ordered User Messages Sent on Different
Streams
17Partially Ordered User Messages Sent on Different
Streams
18Partially Ordered User Messages Sent on Different
Streams
19Partially Ordered User Messages Sent on Different
Streams
20Partially Ordered User Messages Sent on Different
Streams
21Partially Ordered User Messages Sent on Different
Streams
Can be received in the same order as it was sent
(required in TCP).
22Partially Ordered User Messages Sent on Different
Streams
23Partially Ordered User Messages Sent on Different
Streams
24Partially Ordered User Messages Sent on Different
Streams
25Partially Ordered User Messages Sent on Different
Streams
26Partially Ordered User Messages Sent on Different
Streams
27Partially Ordered User Messages Sent on Different
Streams
28Partially Ordered User Messages Sent on Different
Streams
Delivery constraints A must be before C and C
must be before D
29MPI Point-to-Point Overview
30MPI Point-to-Point
MPI_Send(msg,count,type,dest-rank,tag,context)
MPI_Recv(msg,count,type,source-rank,tag,context)
- Message matching is done based on Tag, Rank and
Context (TRC). - Combinations such as blocking, non-blocking,
synchronous, asynchronous, buffered, unbuffered. - Use of wildcards for receive
31MPI Messages Using Same Context, Two Processes
32MPI Messages Using Same Context, Two Processes
Out of order messages with same tags violate MPI
semantics
33Using SCTP for MPI
- Striking similarities between SCTP and MPI
34SCTP-based MPI
35MPI over SCTP Design and Implementation
- LAM (Local Area Multi-computer) is an open source
implementation of MPI library. - We redesigned LAM TCP RPI module to use SCTP.
- RPI module is responsible maintaining state
information of all requests. -
36Implementation Issues
- Maintaining State Information
- Maintain state appropriately for each request
function to work with the one-to-many style. - Message Demultiplexing
- Extend RPI initialization to map associations to
rank. - Demultiplexing of each incoming message to direct
it to the proper receive function. - Concurrency and SCTP Streams
- Consistently map MPI tag-rank-context to SCTP
streams, maintaining proper MPI semantics. - Resource Management
- Make RPI more message-driven.
- Eliminate the use of the select() system call,
making the implementation more scalable. - Eliminating the need to maintain a large number
of socket descriptors.
37Implementation Issues
- Eliminating Race Conditions
- Finding solutions for race conditions due to
added concurrency. - Use of barrier after association setup phase.
- Reliability
- Modify out-of-band daemons and request
progression interface (RPI) to use a common
transport layer protocol to allow for all
components of LAM to multihome successfully. - Support for large messages
- Devised a long-message protocol to handle
messages larger than socket send buffer. - Experiments with different SCTP stacks
38Features of Design
- Head-of-Line Blocking Avoidance
- Scalability, 1 socket per process
- Multihoming
- Added Security
39Head-of-Line Blocking
40 41 42 43 44 45 46 47Performance
48SCTP Performance
- SCTP stack is in early stages and will improve
over time - Performance is stack dependant (Linux lksctp
stack ltlt FreeBSD KAME stack)
- SCTP bundles messages together so it might not
always be able to pack a full MTU - Comprehensive
CRC32c checksum offload to NIC not yet commonly
available
49Experiments
- MPBench Ping-pong comparison
- NAS Parallel benchmarks
- Task Farm Program
8 nodes, Dummynet, fair comparison Same socket
buffer sizes, Nagle disabled, SACK ON, No
multihoming, CRC32c OFF
50Experiments Ping-pong
MPBench Ping Pong Test under No Loss
51Experiments NAS
52Experiments Task Farm
- Non-blocking communication
- Overlap computation with communication
- Use of multiple tags
53Task Farm - Short Messages
54Task Farm - Head-of-line blocking
55Conclusions
- SCTP is a better match for MPI
- Avoids unnecessary head-of-line blocking due to
use of streams - Increased fault tolerance in presence of
multihomed hosts - Built-in security features
- Improved congestion control
- SCTP may enable more MPI programs to execute in
LAN and WAN environments.
56Future Work
- Release our LAM SCTP RPI module
- Modify real applications to use tags as streams
- Continue to look for opportunities to take
advantage of standard IP transport protocols for
MPI
57Thank you!
- More information about our work is at
- http//www.cs.ubc.ca/labs/dsg/mpi-sctp/
Or Google sctp mpi
58Extra Slides
59Associations and Multihoming
Endpoint X
Endpoint Y
NIC
1
NIC
2
NIC
3
NIC
4
Network
207
.
10
.
x
.
x
IP
207
.
10
.
3
.
20
IP
207
.
10
.
40
.
1
Network
168
.
1
.
x
.
x
IP
168
.
1
.
140
.
10
IP
168
.
1
.
10
.
30
60MPI over SCTP Design and Implementation
- Challenges
- Lack of documentation
- Code examination
- Our document is linked-off LAM/MPI website
- Extensive instrumentation
- Diagnostic traces
- Identification of problems in SCTP protocol
-
61MPI API Implementation
- Request Progression Layer
- Short Messages vs. Long Messages
62Partially Ordered User Messages Sent on Different
Streams
63Added Security
User data can be piggy-backed on third and fourth
leg
SCTPs Use of Signed Cookie
64Added Security
- 32 bit Verification Tag reset attack
- Autoclose feature
- No half-closed state
65NAS Benchmarks
- The NAS benchmarks approximate real world
parallel scientific applications - We experimented with a suite of 7 benchmarks, 4
data set sizes - SCTP performance comparable to TCP for large
datasets.
66Farm Program - Long Messages
67Head-of-line blocking Long messages
68Experiments Benchmarks
- SCTP outperformed TCP under loss for ping pong
test.
69Experiments Benchmarks
- SCTP outperformed TCP under loss for ping pong
test.
70Experiments Benchmarks
- SCTP outperformed TCP under loss for ping pong
test.
71(No Transcript)