Title: MPI
1MPI _at_
- Brad Penoff, Camilo Rostoker, Alan Wagner,
- Mike Tsai, Humaira Kamal, Edith Vong
- Department of Computer Science
- University of British Columbia
March 15, 2006
2Overview
- What is MPI and its role within HPC?
- What is SCTP and how can it help MPI?
- MPI middleware the good and the bad.
- How do we use MPI?
- What is the future for IP protocols in HPC?
3Overview
- What is MPI and its role within HPC?
- What is SCTP and how can it help MPI?
- MPI middleware the good and the bad.
- How do we use MPI?
- What is the future for IP protocols in HPC?
4Some HPC goals
- To solve large problems involving large
computations on large datasets - To enable new types of analysis
- To utilize all available resources
- Processors
- Networks
- I/O means
- To scale
5One approach within HPC
- Parallel programming
- Models for explicitly expressing a task whose
parts can be effectively ran simultaneously - The most well-known use of a model
- MPI (message-passing interface)
- API designed 10 years ago by committee
- Sometimes called the assembly language of
parallel processing
6Middleware for MPI
- Glues necessary components together for parallel
environment - Attempts to allow for portability with maximal
performance
7Communication component
- Implements MPI API for various interconnects
- Shared memory
- Myrinet
- Infiniband
- Specialized hardware (BlueGene/L, ASCI Red, XD1,
etc.) - Standard TCP/IP transport protocols
8TCP/IP protocol stack
- About 40 of machines in the Top500 use TCP
- SCTP was yet to be used for MPI
9Overview
- What is MPI and its role within HPC?
- What is SCTP and how can it help MPI?
- MPI middleware the good and the bad.
- How do we use MPI?
- What is the future for IP protocols in HPC?
10What is SCTP?
- Stream Control Transmission Protocol
- General purpose unicast transport protocol for IP
network data communications - Recently standardized by IETF
- Can be used anywhere TCP is used
11SCTP Key Similarities
- Reliable in-order delivery, flow control, full
duplex transfer. - TCP-like congestion control
- Selective ACK is built-in the protocol
12SCTP Key Differences
- Message oriented
- Added security
- Multihoming, use of associations
- Multiple streams within an association
13Associations and Multihoming
Endpoint X
Endpoint Y
Association
NIC
1
NIC
2
NIC
3
NIC
4
Network
207
.
10
.
x
.
x
IP
207
.
10
.
3
.
20
IP
207
.
10
.
40
.
1
Network
168
.
1
.
x
.
x
IP
168
.
1
.
140
.
10
IP
168
.
1
.
10
.
30
14Logical View of Multiple Streams in an Association
15Partially Ordered User Messages Sent on Different
Streams
16Partially Ordered User Messages Sent on Different
Streams
17Partially Ordered User Messages Sent on Different
Streams
18Partially Ordered User Messages Sent on Different
Streams
19Partially Ordered User Messages Sent on Different
Streams
20Partially Ordered User Messages Sent on Different
Streams
21Partially Ordered User Messages Sent on Different
Streams
22Partially Ordered User Messages Sent on Different
Streams
23Partially Ordered User Messages Sent on Different
Streams
24Partially Ordered User Messages Sent on Different
Streams
25Partially Ordered User Messages Sent on Different
Streams
26Partially Ordered User Messages Sent on Different
Streams
Can be received in the same order as it was sent
(required in TCP).
27Partially Ordered User Messages Sent on Different
Streams
28Partially Ordered User Messages Sent on Different
Streams
29Partially Ordered User Messages Sent on Different
Streams
30Partially Ordered User Messages Sent on Different
Streams
31Partially Ordered User Messages Sent on Different
Streams
32Partially Ordered User Messages Sent on Different
Streams
33Partially Ordered User Messages Sent on Different
Streams
Delivery constraints A must be before C and C
must be before D
34Available SCTP stacks
- BSD / Mac OS X
- LKSCTP Linux Kernel 2.4.23 and later
- Solaris 10
- HP OpenCall SS7
- OpenSS7
- Other implementations listed on sctp.org for
Windows, AIX, VxWorks, etc.
35Upcoming annual SCTP Interop
- July 30 Aug 4, 2006 to be held at UBC
- Vendors and implementers test their stacks
- Performance
- Interoperability
36MPI over SCTP
- LAM and MPICH2 are two popular open source
implementations of the MPI library. - We redesigned LAM to use SCTP and take advantage
of its additional features. - Future plans include SCTP support within MPICH2.
-
37How can SCTP help MPI?
- A redesign for SCTP thins the MPI middlewares
communication component. - Use of one-to-many socket-style scales well.
- SCTP adds resilience to MPI programs.
- Avoids unnecessary head-of-line blocking with
streams - Increased fault tolerance in presence of
multihomed hosts - Built-in security features
- Improved congestion control
Full Results Presented _at_
38Overview
- What is MPI and its role within HPC?
- What is SCTP and how can it help MPI?
- MPI middleware the good and the bad.
- How do we use MPI?
- What is the future for IP protocols in HPC?
39Good of IP-based MPI Middleware
- Ubiquitous
- its EVERYWHERE
- Cheap
- popularity drives down costs
- Well-known
- leverage network research
- Portable
- heterogeneous environments
- Seamlessly connects across networks
- SMP, cluster, LAN, WAN
40Bad of IP-based MPI Middleware
- Control-driven, Event/Interrupt Mismatch
- NIC/OS interrupt driven
- User-space usually control-driven
- Flow control
- Stuck with transport level flow control
- Must multiplex incoming message flows
- How to handle unexpected messages?
- Excess system calls
- Context switch for crossing kernel boundary
41Ugly of MPI Middleware
- Generalizing the parallel environment
- Trade-offs with portability and performance
- Byzantine agreement
- Has a remote process died or is it just busy?
- Parallel debugging across a network
42Overview
- What is MPI and its role within HPC?
- What is SCTP and how can it help MPI?
- MPI middleware the good and the bad.
- How do we use MPI?
- What is the future for IP protocols in HPC?
43MPI Applications
44Forget the Grid? Lets just use MPI
- Can utilize heterogeneous resources and networks
by focusing on IP-based protocols (Grid-lite). - Result Need to design applications to be more
flexible in high latency/high loss environments.
45Latency-Tolerant Applications
- Processor Farm Applications
- mpiBLAST
- Parallel workflow environment
- Computational Finance
- Gene expression network analysis
46mpiBLAST
- MPI version of popular bioinfomatics search tool
- Conforms to parallel farm model
47Modifying mpiBLAST for WAN (1)
- Progress multiple independent tasks at once
- Buffer separate state in case of message loss
- Each task has its own tag (i.e. SCTP stream)
- Batch initial work REQuest messages
48Modifying mpiBLAST for WAN (2)
- Avoid synchrony (worsened with latency)
- Additional use of asynchronous MPI calls
- Use message size that can use eager send within
the library implementation (i.e. no rendezvous) - Have application protocol do less handshaking
49Overview
- What is MPI and its role within HPC?
- What is SCTP and how can it help MPI?
- How do we use MPI?
- MPI middleware the good and the bad.
- What is the future for IP protocols in HPC?
50The Problem
X
last mile
51The Problem
last inch
52 Memory copying
01010
Zero-copy and RDMA?
53Copying - Dont Do It!
Hennessy and Patterson, 1996
54 Protocol processing
01010
01010
01010
01010
Rule of thumb 1 Hz for 1bps
May be 5-6 times more for smaller messages, and
does not seem to be scaling well as processor
speeds increase.
55Where to do the processing?
- On-chip
- Separate processor core in the chip
- Kernel / User space
- On the NIC (TOE, TOERDMA)
56iWarp
- IETF initiative to support zero-copy and TCP
off-load - A richer interface (like zero-copy, RDMA)
- Maintains compatibility with existing TCP/IP
- In 2005, 42.4 of TOP500 machines used Ethernet
with most using regular Ge adaptors.
57Other solutions
- Infiniband
- Specialized
- Designed from the start to support RDMA
- Level5
- User-space memory mapped, Ethernet to NIC
- Provides protection on the board
- Trying to speed up and integrate the I/O onto the
memory bus or a faster interface
58RDMA-TCP-SCTP and NIC
SCTP better suited -message based -does
framing -multistreaming -multihoming
59Convergence?
- Everything over IP
- IP/IB 10Gb InfiniBand only .7x or 2x that of
standard 1 Gb Ethernet (Egenera-white paper) - Latency difference 7 microseconds (InfiniBand)
versus 65 microseconds (regular Ge Ethernet)
(dual 3GHz Xeons) - Level 5, sub 10 microsecond. (user level stacks).
60Laptop clustering event
- Live-linux clustering party
- Bring your laptop, with the recommended
live-linux distribution - We will provide the application
- Hoping for April 11th.
61Thank you!
- More information about our work is at
- http//www.cs.ubc.ca/labs/dsg/mpi-sctp/
Or Google sctp mpi
62Extra slides
63MPI Point-to-Point
MPI_Send(msg,cnt,type,dst-rank,tag,context)
MPI_Recv(msg,cnt,type,src-rank,tag,context)
- Message matching is done based on Tag, Rank and
Context (TRC). - Combinations such as blocking, non-blocking,
synchronous, asynchronous, buffered, unbuffered. - Use of wildcards for receive
64MPI Messages Using Same Context, Two Processes
65MPI Messages Using Same Context, Two Processes
Out of order messages with same tags violate MPI
semantics
66Using SCTP for MPI
- TRC-to-stream map matches MPI semantics