Optimizing Threaded MPI Execution on SMP Clusters - PowerPoint PPT Presentation

About This Presentation

Title:

Optimizing Threaded MPI Execution on SMP Clusters

Description:

Optimizing Threaded MPI Execution on SMP Clusters. Hong Tang ... MagPIe target for SMP clusters connected through WAN. Lower Communication Layer Optimization ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 22

Provided by: Hong48

Learn more at: https://sites.cs.ucsb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing Threaded MPI Execution on SMP Clusters

1
Optimizing Threaded MPI Execution on SMP Clusters

Hong Tang and Tao Yang
Department of Computer Science
University of California, Santa Barbara

2
Parallel Computation on SMP Clusters

Massively Parallel Machines ? SMP Clusters
Commodity Components Off-the-shelf Processors
Fast Network (Myrinet, Fast/GigaBit Ethernet)
Parallel Programming Model for SMP Clusters
MPI Portability, Performance, Legacy Programs
MPIVariations MPIMultithreading, MPIOpenMP

3
Threaded MPI Execution

MPI Paradigm Separated Address Spaces for
Different MPI Nodes
Natural Solution MPI Nodes ? Processes
What if we map MPI nodes to threads?
Faster synchronization among MPI nodes running on
the same machine.
Demonstrated in previous work PPoPP 99 for a
single shared memory machine. (Developed
techniques to safely execute MPI programs using
threads.)
Threaded MPI Execution on SMP Clusters
Intra-Machine Comm. through Shared Memory
Inter-Machine Comm. through Network

4
Threaded MPI Execution Benefits Inter-Machine
Communication

Common Intuition
Our Findings

Inter-machine communication cost is dominated by
network delay, so the advantage of executing MPI
nodes as threads diminishes.
Using threads can significantly reduce the
buffering and orchestration overhead for
inter-machine communications.
5
Related Work

MPI on Network Clusters
MPICH a portable MPI implementation.
LAM/MPI communication through a standalone RPI
server.
Collective Communication Optimization
SUN-MPI and MPI-StarT modify MPICH ADI layer
target for SMP clusters.
MagPIe target for SMP clusters connected
through WAN.
Lower Communication Layer Optimization
MPI-FM and MPI-AM.
Threaded Execution of Message Passing Programs
MPI-Lite, LPVM, TPVM.

6
Background MPICH Design
7
MPICH Communication Structure
MPICH with shared memory
8
TMPI Communication Structure
?
?
?
9
Comparison of TMPI and MPICH

Drawbacks of MPICH w/ Shared Memory
Intra-node communication limited by shared memory
size.
Busy polling to check messages from either daemon
or local peer.
Cannot do automatic resource clean-up.
Drawbacks of MPICH w/o Shared Memory
Big overhead for intra-node communication.
Too many daemon processes and open connections.
Drawbacks of both MPICH Systems
Extra data copying for inter-machine
communication.

10
TMPI Communication Design
11
Separation of Point-to-Point and Collective
Communication Channels

Observations MPI Point-to-point Communication
and Collective Communication Semantics are
Different.
Separated channels for pt2pt and collective comm.
Eliminate daemon intervention for collective
communication.
Less effective for MPICH no sharing of ports
among processes.

Point-to-point Collective
Unknown Source (MPI_ANY_SOURCE) Determined Source (Ancestor in the spanning tree.)
Out-of-order (Message Tag) In order delivery
Asynchronous (Non-block Receive) Synchronous
12
Hierarchy-Aware Collective Communication

Observation Two level communication hierarchy.
Inside an SMP node shared memory (10-8 sec)
Between SMP nodes network (10-6 sec)
Idea Building the communication spanning tree in
two steps
Choose a root MPI node on each cluster node and
build a spanning tree among all the cluster
nodes.
Second, all other MPI nodes connect to the local
root node.

13
Adaptive Buffer Management

Question How do we manage temporary buffering of
message data when the remote receiver is not
ready to accept data?
Choices
Send the data with the request eager push.
Send request only and send data when the receiver
is ready three-phase protocol.
TMPI adapt between both methods.

14
Experimental Study

Goal Illustrate the advantage of threaded MPI
execution on SMP clusters.
Hardware Setting
A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB
main memory and 2 fast Ethernet cards per
machine.
Software Setting
OS RedHat Linux 6.0, kernel version 2.2.15 w/
channel bonding enabled.
Process-based MPI System MPICH 1.2
Thread-based MPI System TMPI (45 functions in
MPI 1.1 standard)

15
Inter-Cluster-Node Point-to-Point

Ping-ping, TMPI vs MPICH w/ shared memory

(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
700
20
600
)
18
s
m
16
500
TMPI
Transfer Rate (MB)
14
Round Trip Time (
MPICH
400
12
TMPI
MPICH
10
300
8
200
0
200
400
600
800
1000
0
200
400
600
800
1000
Message Size (bytes)
Message Size (KB)
16
Intra-Cluster-Node Point-to-Point

Ping-pong, TMPI vs MPICH1 (MPICH w/ shared
memory) and MPICH2 (MPICH w/o shared memory)

(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
180
TMPI
MPICH1
160
MPICH2
200
)
s
m
140
TMPI
MPICH1
120
MPICH2
150
Transfer Rate (MB)
100
Round Trip Time (
80
100
60
40
50
20
0
200
400
600
800
1000
0
200
400
600
800
1000
Message Size (bytes)
Message Size (KB)
17
Collective Communication

Reduce, Bcast, Allreduce.
TMPI / MPICH_SHM / MPICH_NOSHM
Three node distributions, three root node
settings.

(us) root Reduce Bcast Allreduce
4x1 same 9/121/4384 10/137/7913 160 /175/627
4x1 rotate 33/81/3699 129/91/4238 160 /175/627
4x1 combo 25/102/3436 17/32/966 160 /175/627
1x4 same 28/1999/1844 21/1610/1551 571/675/775
1x4 rotate 146/1944/1878 164/1774/1834 571/675/775
1x4 combo 167/1977/1854 43/409/392 571/675/775
4x4 same 39/2532/4809 56/2792/10246 736/1412/19914
4x4 rotate 161/1718/8566 216/2204/8036 736/1412/19914
4x4 combo 141/2242/8515 62/489/2054 736/1412/19914
18
Macro-Benchmark Performance
19
Conclusions

Great Advantage of Threaded MPI Execution on SMP
Clusters
Micro-benchmark 70 times faster than MPICH.
Macro-benchmark 100 faster than MPICH.
Optimization Techniques
Separated Collective and Point-to-Point
Communication Channels
Adaptive Buffer Management
Hierarchy-Aware Communications

http//www.cs.ucsb.edu/projects/tmpi/
20
Background Safe Execution of MPI Programs using
Threads

Program Transformation Eliminate global and
static variables (called permanent variables).
Thread-Specific Data (TSD)
Each thread can associate a pointer-sized data
variable with a commonly defined key value (an
integer). With the same key, different threads
can set/get the values of their own copy of the
data variable.
TSD-based Transformation Each permanent variable
declaration is replaced with a KEY declaration.
Each node associates its private copy of the
permanent variable with the corresponding key. In
places where global variables are referenced, use
the global keys to retrieve the per-thread copies
of the variables.