Title: Optimizing Threaded MPI Execution on SMP Clusters
1Optimizing Threaded MPI Execution on SMP Clusters
- Hong Tang and Tao Yang
- Department of Computer Science
- University of California, Santa Barbara
2Parallel Computation on SMP Clusters
- Massively Parallel Machines ? SMP Clusters
- Commodity Components Off-the-shelf Processors
Fast Network (Myrinet, Fast/GigaBit Ethernet) - Parallel Programming Model for SMP Clusters
- MPI Portability, Performance, Legacy Programs
- MPIVariations MPIMultithreading, MPIOpenMP
3Threaded MPI Execution
- MPI Paradigm Separated Address Spaces for
Different MPI Nodes - Natural Solution MPI Nodes ? Processes
- What if we map MPI nodes to threads?
- Faster synchronization among MPI nodes running on
the same machine. - Demonstrated in previous work PPoPP 99 for a
single shared memory machine. (Developed
techniques to safely execute MPI programs using
threads.) - Threaded MPI Execution on SMP Clusters
- Intra-Machine Comm. through Shared Memory
- Inter-Machine Comm. through Network
4Threaded MPI Execution Benefits Inter-Machine
Communication
- Common Intuition
- Our Findings
Inter-machine communication cost is dominated by
network delay, so the advantage of executing MPI
nodes as threads diminishes.
Using threads can significantly reduce the
buffering and orchestration overhead for
inter-machine communications.
5Related Work
- MPI on Network Clusters
- MPICH a portable MPI implementation.
- LAM/MPI communication through a standalone RPI
server. - Collective Communication Optimization
- SUN-MPI and MPI-StarT modify MPICH ADI layer
target for SMP clusters. - MagPIe target for SMP clusters connected
through WAN. - Lower Communication Layer Optimization
- MPI-FM and MPI-AM.
- Threaded Execution of Message Passing Programs
- MPI-Lite, LPVM, TPVM.
6Background MPICH Design
7MPICH Communication Structure
MPICH with shared memory
8TMPI Communication Structure
?
?
?
9Comparison of TMPI and MPICH
- Drawbacks of MPICH w/ Shared Memory
- Intra-node communication limited by shared memory
size. - Busy polling to check messages from either daemon
or local peer. - Cannot do automatic resource clean-up.
- Drawbacks of MPICH w/o Shared Memory
- Big overhead for intra-node communication.
- Too many daemon processes and open connections.
- Drawbacks of both MPICH Systems
- Extra data copying for inter-machine
communication.
10TMPI Communication Design
11Separation of Point-to-Point and Collective
Communication Channels
- Observations MPI Point-to-point Communication
and Collective Communication Semantics are
Different. - Separated channels for pt2pt and collective comm.
- Eliminate daemon intervention for collective
communication. - Less effective for MPICH no sharing of ports
among processes.
Point-to-point Collective
Unknown Source (MPI_ANY_SOURCE) Determined Source (Ancestor in the spanning tree.)
Out-of-order (Message Tag) In order delivery
Asynchronous (Non-block Receive) Synchronous
12Hierarchy-Aware Collective Communication
- Observation Two level communication hierarchy.
- Inside an SMP node shared memory (10-8 sec)
- Between SMP nodes network (10-6 sec)
- Idea Building the communication spanning tree in
two steps - Choose a root MPI node on each cluster node and
build a spanning tree among all the cluster
nodes. - Second, all other MPI nodes connect to the local
root node.
13Adaptive Buffer Management
- Question How do we manage temporary buffering of
message data when the remote receiver is not
ready to accept data? - Choices
- Send the data with the request eager push.
- Send request only and send data when the receiver
is ready three-phase protocol. - TMPI adapt between both methods.
14Experimental Study
- Goal Illustrate the advantage of threaded MPI
execution on SMP clusters. - Hardware Setting
- A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB
main memory and 2 fast Ethernet cards per
machine. - Software Setting
- OS RedHat Linux 6.0, kernel version 2.2.15 w/
channel bonding enabled. - Process-based MPI System MPICH 1.2
- Thread-based MPI System TMPI (45 functions in
MPI 1.1 standard)
15Inter-Cluster-Node Point-to-Point
- Ping-ping, TMPI vs MPICH w/ shared memory
(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
700
20
600
)
18
s
m
16
500
TMPI
Transfer Rate (MB)
14
Round Trip Time (
MPICH
400
12
TMPI
MPICH
10
300
8
200
0
200
400
600
800
1000
0
200
400
600
800
1000
Message Size (bytes)
Message Size (KB)
16Intra-Cluster-Node Point-to-Point
- Ping-pong, TMPI vs MPICH1 (MPICH w/ shared
memory) and MPICH2 (MPICH w/o shared memory)
(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
180
TMPI
MPICH1
160
MPICH2
200
)
s
m
140
TMPI
MPICH1
120
MPICH2
150
Transfer Rate (MB)
100
Round Trip Time (
80
100
60
40
50
20
0
200
400
600
800
1000
0
200
400
600
800
1000
Message Size (bytes)
Message Size (KB)
17Collective Communication
- Reduce, Bcast, Allreduce.
- TMPI / MPICH_SHM / MPICH_NOSHM
- Three node distributions, three root node
settings.
(us) root Reduce Bcast Allreduce
4x1 same 9/121/4384 10/137/7913 160 /175/627
4x1 rotate 33/81/3699 129/91/4238 160 /175/627
4x1 combo 25/102/3436 17/32/966 160 /175/627
1x4 same 28/1999/1844 21/1610/1551 571/675/775
1x4 rotate 146/1944/1878 164/1774/1834 571/675/775
1x4 combo 167/1977/1854 43/409/392 571/675/775
4x4 same 39/2532/4809 56/2792/10246 736/1412/19914
4x4 rotate 161/1718/8566 216/2204/8036 736/1412/19914
4x4 combo 141/2242/8515 62/489/2054 736/1412/19914
18Macro-Benchmark Performance
19Conclusions
- Great Advantage of Threaded MPI Execution on SMP
Clusters - Micro-benchmark 70 times faster than MPICH.
- Macro-benchmark 100 faster than MPICH.
- Optimization Techniques
- Separated Collective and Point-to-Point
Communication Channels - Adaptive Buffer Management
- Hierarchy-Aware Communications
http//www.cs.ucsb.edu/projects/tmpi/
20Background Safe Execution of MPI Programs using
Threads
- Program Transformation Eliminate global and
static variables (called permanent variables). - Thread-Specific Data (TSD)
- Each thread can associate a pointer-sized data
variable with a commonly defined key value (an
integer). With the same key, different threads
can set/get the values of their own copy of the
data variable. - TSD-based Transformation Each permanent variable
declaration is replaced with a KEY declaration.
Each node associates its private copy of the
permanent variable with the corresponding key. In
places where global variables are referenced, use
the global keys to retrieve the per-thread copies
of the variables.
21Program Transformation An Example