Optimizing Threaded MPI Execution on SMP Clusters - PowerPoint PPT Presentation

About This Presentation
Title:

Optimizing Threaded MPI Execution on SMP Clusters

Description:

Optimizing Threaded MPI Execution on SMP Clusters. Hong Tang ... MagPIe target for SMP clusters connected through WAN. Lower Communication Layer Optimization ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 22
Provided by: Hong48
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Threaded MPI Execution on SMP Clusters


1
Optimizing Threaded MPI Execution on SMP Clusters
  • Hong Tang and Tao Yang
  • Department of Computer Science
  • University of California, Santa Barbara

2
Parallel Computation on SMP Clusters
  • Massively Parallel Machines ? SMP Clusters
  • Commodity Components Off-the-shelf Processors
    Fast Network (Myrinet, Fast/GigaBit Ethernet)
  • Parallel Programming Model for SMP Clusters
  • MPI Portability, Performance, Legacy Programs
  • MPIVariations MPIMultithreading, MPIOpenMP

3
Threaded MPI Execution
  • MPI Paradigm Separated Address Spaces for
    Different MPI Nodes
  • Natural Solution MPI Nodes ? Processes
  • What if we map MPI nodes to threads?
  • Faster synchronization among MPI nodes running on
    the same machine.
  • Demonstrated in previous work PPoPP 99 for a
    single shared memory machine. (Developed
    techniques to safely execute MPI programs using
    threads.)
  • Threaded MPI Execution on SMP Clusters
  • Intra-Machine Comm. through Shared Memory
  • Inter-Machine Comm. through Network

4
Threaded MPI Execution Benefits Inter-Machine
Communication
  • Common Intuition
  • Our Findings

Inter-machine communication cost is dominated by
network delay, so the advantage of executing MPI
nodes as threads diminishes.
Using threads can significantly reduce the
buffering and orchestration overhead for
inter-machine communications.
5
Related Work
  • MPI on Network Clusters
  • MPICH a portable MPI implementation.
  • LAM/MPI communication through a standalone RPI
    server.
  • Collective Communication Optimization
  • SUN-MPI and MPI-StarT modify MPICH ADI layer
    target for SMP clusters.
  • MagPIe target for SMP clusters connected
    through WAN.
  • Lower Communication Layer Optimization
  • MPI-FM and MPI-AM.
  • Threaded Execution of Message Passing Programs
  • MPI-Lite, LPVM, TPVM.

6
Background MPICH Design
7
MPICH Communication Structure
MPICH with shared memory
8
TMPI Communication Structure
?
?
?
9
Comparison of TMPI and MPICH
  • Drawbacks of MPICH w/ Shared Memory
  • Intra-node communication limited by shared memory
    size.
  • Busy polling to check messages from either daemon
    or local peer.
  • Cannot do automatic resource clean-up.
  • Drawbacks of MPICH w/o Shared Memory
  • Big overhead for intra-node communication.
  • Too many daemon processes and open connections.
  • Drawbacks of both MPICH Systems
  • Extra data copying for inter-machine
    communication.

10
TMPI Communication Design
11
Separation of Point-to-Point and Collective
Communication Channels
  • Observations MPI Point-to-point Communication
    and Collective Communication Semantics are
    Different.
  • Separated channels for pt2pt and collective comm.
  • Eliminate daemon intervention for collective
    communication.
  • Less effective for MPICH no sharing of ports
    among processes.

Point-to-point Collective
Unknown Source (MPI_ANY_SOURCE) Determined Source (Ancestor in the spanning tree.)
Out-of-order (Message Tag) In order delivery
Asynchronous (Non-block Receive) Synchronous
12
Hierarchy-Aware Collective Communication
  • Observation Two level communication hierarchy.
  • Inside an SMP node shared memory (10-8 sec)
  • Between SMP nodes network (10-6 sec)
  • Idea Building the communication spanning tree in
    two steps
  • Choose a root MPI node on each cluster node and
    build a spanning tree among all the cluster
    nodes.
  • Second, all other MPI nodes connect to the local
    root node.

13
Adaptive Buffer Management
  • Question How do we manage temporary buffering of
    message data when the remote receiver is not
    ready to accept data?
  • Choices
  • Send the data with the request eager push.
  • Send request only and send data when the receiver
    is ready three-phase protocol.
  • TMPI adapt between both methods.

14
Experimental Study
  • Goal Illustrate the advantage of threaded MPI
    execution on SMP clusters.
  • Hardware Setting
  • A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB
    main memory and 2 fast Ethernet cards per
    machine.
  • Software Setting
  • OS RedHat Linux 6.0, kernel version 2.2.15 w/
    channel bonding enabled.
  • Process-based MPI System MPICH 1.2
  • Thread-based MPI System TMPI (45 functions in
    MPI 1.1 standard)

15
Inter-Cluster-Node Point-to-Point
  • Ping-ping, TMPI vs MPICH w/ shared memory

(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
700
20
600
)
18
s
m
16
500
TMPI
Transfer Rate (MB)
14
Round Trip Time (
MPICH
400
12
TMPI
MPICH
10
300
8
200
0
200
400
600
800
1000
0
200
400
600
800
1000
Message Size (bytes)
Message Size (KB)
16
Intra-Cluster-Node Point-to-Point
  • Ping-pong, TMPI vs MPICH1 (MPICH w/ shared
    memory) and MPICH2 (MPICH w/o shared memory)

(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
180
TMPI
MPICH1
160
MPICH2
200
)
s
m
140
TMPI
MPICH1
120
MPICH2
150
Transfer Rate (MB)
100
Round Trip Time (
80
100
60
40
50
20
0
200
400
600
800
1000
0
200
400
600
800
1000
Message Size (bytes)
Message Size (KB)
17
Collective Communication
  • Reduce, Bcast, Allreduce.
  • TMPI / MPICH_SHM / MPICH_NOSHM
  • Three node distributions, three root node
    settings.

(us) root Reduce Bcast Allreduce
4x1 same 9/121/4384 10/137/7913 160 /175/627
4x1 rotate 33/81/3699 129/91/4238 160 /175/627
4x1 combo 25/102/3436 17/32/966 160 /175/627
1x4 same 28/1999/1844 21/1610/1551 571/675/775
1x4 rotate 146/1944/1878 164/1774/1834 571/675/775
1x4 combo 167/1977/1854 43/409/392 571/675/775
4x4 same 39/2532/4809 56/2792/10246 736/1412/19914
4x4 rotate 161/1718/8566 216/2204/8036 736/1412/19914
4x4 combo 141/2242/8515 62/489/2054 736/1412/19914
18
Macro-Benchmark Performance
19
Conclusions
  • Great Advantage of Threaded MPI Execution on SMP
    Clusters
  • Micro-benchmark 70 times faster than MPICH.
  • Macro-benchmark 100 faster than MPICH.
  • Optimization Techniques
  • Separated Collective and Point-to-Point
    Communication Channels
  • Adaptive Buffer Management
  • Hierarchy-Aware Communications

http//www.cs.ucsb.edu/projects/tmpi/
20
Background Safe Execution of MPI Programs using
Threads
  • Program Transformation Eliminate global and
    static variables (called permanent variables).
  • Thread-Specific Data (TSD)
  • Each thread can associate a pointer-sized data
    variable with a commonly defined key value (an
    integer). With the same key, different threads
    can set/get the values of their own copy of the
    data variable.
  • TSD-based Transformation Each permanent variable
    declaration is replaced with a KEY declaration.
    Each node associates its private copy of the
    permanent variable with the corresponding key. In
    places where global variables are referenced, use
    the global keys to retrieve the per-thread copies
    of the variables.

21
Program Transformation An Example
Write a Comment
User Comments (0)
About PowerShow.com