SCTP versus TCP for MPI - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

SCTP versus TCP for MPI

Description:

General purpose unicast transport protocol for IP network data communications ... Performance is stack dependant (Linux lksctp stack FreeBSD KAME stack) ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 72
Provided by: KAM152
Category:
Tags: mpi | sctp | tcp | kame | versus

less

Transcript and Presenter's Notes

Title: SCTP versus TCP for MPI


1
SCTP versus TCP for MPI
  • Brad Penoff, Humaira Kamal, Alan Wagner
  • Department of Computer Science
  • University of British Columbia

Distributed Research Group
SC-2005 Nov 16
2
What is SCTP?
  • Stream Control Transmission Protocol (SCTP)
  • General purpose unicast transport protocol for IP
    network data communications
  • Recently standardized by IETF
  • Can be used anywhere TCP is used

3
What is SCTP?
  • Stream Control Transmission Protocol (SCTP)
  • General purpose unicast transport protocol for IP
    network data communications
  • Recently standardized by IETF
  • Can be used anywhere TCP is used
  • Question
  • Can we take advantage of SCTP features to better
    support parallel applications using MPI?

4
Communicating MPI Processes
TCP is often used as transport protocol for MPI
SCTP
SCTP
5
Overview of SCTP
6
SCTP Key Features
  • Reliable in-order delivery, flow control, full
    duplex transfer.
  • TCP-like congestion control
  • Selective ACK is built-in the protocol

7
SCTP Key Features
  • Message oriented
  • Use of associations
  • Multihoming
  • Multiple streams within an association

8
Associations and Multihoming
Endpoint X
Endpoint Y
Association
NIC
1
NIC
2
NIC
3
NIC
4
Network
207
.
10
.
x
.
x
IP

207
.
10
.
3
.
20
IP

207
.
10
.
40
.
1
Network
168
.
1
.
x
.
x
IP

168
.
1
.
140
.
10
IP

168
.
1
.
10
.
30
9
Logical View of Multiple Streams in an Association
10
Partially Ordered User Messages Sent on Different
Streams
11
Partially Ordered User Messages Sent on Different
Streams
12
Partially Ordered User Messages Sent on Different
Streams
13
Partially Ordered User Messages Sent on Different
Streams
14
Partially Ordered User Messages Sent on Different
Streams
15
Partially Ordered User Messages Sent on Different
Streams
16
Partially Ordered User Messages Sent on Different
Streams
17
Partially Ordered User Messages Sent on Different
Streams
18
Partially Ordered User Messages Sent on Different
Streams
19
Partially Ordered User Messages Sent on Different
Streams
20
Partially Ordered User Messages Sent on Different
Streams
21
Partially Ordered User Messages Sent on Different
Streams
Can be received in the same order as it was sent
(required in TCP).
22
Partially Ordered User Messages Sent on Different
Streams
23
Partially Ordered User Messages Sent on Different
Streams
24
Partially Ordered User Messages Sent on Different
Streams
25
Partially Ordered User Messages Sent on Different
Streams
26
Partially Ordered User Messages Sent on Different
Streams
27
Partially Ordered User Messages Sent on Different
Streams
28
Partially Ordered User Messages Sent on Different
Streams
Delivery constraints A must be before C and C
must be before D
29
MPI Point-to-Point Overview
30
MPI Point-to-Point
MPI_Send(msg,count,type,dest-rank,tag,context)
MPI_Recv(msg,count,type,source-rank,tag,context)
  • Message matching is done based on Tag, Rank and
    Context (TRC).
  • Combinations such as blocking, non-blocking,
    synchronous, asynchronous, buffered, unbuffered.
  • Use of wildcards for receive

31
MPI Messages Using Same Context, Two Processes
32
MPI Messages Using Same Context, Two Processes
Out of order messages with same tags violate MPI
semantics
33
Using SCTP for MPI
  • Striking similarities between SCTP and MPI

34
SCTP-based MPI
35
MPI over SCTP Design and Implementation
  • LAM (Local Area Multi-computer) is an open source
    implementation of MPI library.
  • We redesigned LAM TCP RPI module to use SCTP.
  • RPI module is responsible maintaining state
    information of all requests.

36
Implementation Issues
  • Maintaining State Information
  • Maintain state appropriately for each request
    function to work with the one-to-many style.
  • Message Demultiplexing
  • Extend RPI initialization to map associations to
    rank.
  • Demultiplexing of each incoming message to direct
    it to the proper receive function.
  • Concurrency and SCTP Streams
  • Consistently map MPI tag-rank-context to SCTP
    streams, maintaining proper MPI semantics.
  • Resource Management
  • Make RPI more message-driven.
  • Eliminate the use of the select() system call,
    making the implementation more scalable.
  • Eliminating the need to maintain a large number
    of socket descriptors.

37
Implementation Issues
  • Eliminating Race Conditions
  • Finding solutions for race conditions due to
    added concurrency.
  • Use of barrier after association setup phase.
  • Reliability
  • Modify out-of-band daemons and request
    progression interface (RPI) to use a common
    transport layer protocol to allow for all
    components of LAM to multihome successfully.
  • Support for large messages
  • Devised a long-message protocol to handle
    messages larger than socket send buffer.
  • Experiments with different SCTP stacks

38
Features of Design
  • Head-of-Line Blocking Avoidance
  • Scalability, 1 socket per process
  • Multihoming
  • Added Security

39
Head-of-Line Blocking
40

41

42

43

44

45

46

47
Performance
48
SCTP Performance
  • SCTP stack is in early stages and will improve
    over time
  • Performance is stack dependant (Linux lksctp
    stack ltlt FreeBSD KAME stack)

- SCTP bundles messages together so it might not
always be able to pack a full MTU - Comprehensive
CRC32c checksum offload to NIC not yet commonly
available
49
Experiments
  • MPBench Ping-pong comparison
  • NAS Parallel benchmarks
  • Task Farm Program

8 nodes, Dummynet, fair comparison Same socket
buffer sizes, Nagle disabled, SACK ON, No
multihoming, CRC32c OFF
50
Experiments Ping-pong
MPBench Ping Pong Test under No Loss
51
Experiments NAS
52
Experiments Task Farm
  • Non-blocking communication
  • Overlap computation with communication
  • Use of multiple tags

53
Task Farm - Short Messages
54
Task Farm - Head-of-line blocking
55
Conclusions
  • SCTP is a better match for MPI
  • Avoids unnecessary head-of-line blocking due to
    use of streams
  • Increased fault tolerance in presence of
    multihomed hosts
  • Built-in security features
  • Improved congestion control
  • SCTP may enable more MPI programs to execute in
    LAN and WAN environments.

56
Future Work
  • Release our LAM SCTP RPI module
  • Modify real applications to use tags as streams
  • Continue to look for opportunities to take
    advantage of standard IP transport protocols for
    MPI

57
Thank you!
  • More information about our work is at
  • http//www.cs.ubc.ca/labs/dsg/mpi-sctp/

Or Google sctp mpi
58
Extra Slides
59
Associations and Multihoming
Endpoint X
Endpoint Y
NIC
1
NIC
2
NIC
3
NIC
4
Network
207
.
10
.
x
.
x
IP

207
.
10
.
3
.
20
IP

207
.
10
.
40
.
1
Network
168
.
1
.
x
.
x
IP

168
.
1
.
140
.
10
IP

168
.
1
.
10
.
30
60
MPI over SCTP Design and Implementation
  • Challenges
  • Lack of documentation
  • Code examination
  • Our document is linked-off LAM/MPI website
  • Extensive instrumentation
  • Diagnostic traces
  • Identification of problems in SCTP protocol

61
MPI API Implementation
  • Request Progression Layer
  • Short Messages vs. Long Messages

62
Partially Ordered User Messages Sent on Different
Streams
63
Added Security
User data can be piggy-backed on third and fourth
leg
SCTPs Use of Signed Cookie
64
Added Security
  • 32 bit Verification Tag reset attack
  • Autoclose feature
  • No half-closed state

65
NAS Benchmarks
  • The NAS benchmarks approximate real world
    parallel scientific applications
  • We experimented with a suite of 7 benchmarks, 4
    data set sizes
  • SCTP performance comparable to TCP for large
    datasets.

66
Farm Program - Long Messages
67
Head-of-line blocking Long messages
68
Experiments Benchmarks
  • SCTP outperformed TCP under loss for ping pong
    test.

69
Experiments Benchmarks
  • SCTP outperformed TCP under loss for ping pong
    test.

70
Experiments Benchmarks
  • SCTP outperformed TCP under loss for ping pong
    test.

71
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com