Optimizing Bandwidth Limited Problems Using OneSided Communication and Overlap - PowerPoint PPT Presentation

About This Presentation

Title:

Optimizing Bandwidth Limited Problems Using OneSided Communication and Overlap

Description:

Enables novel algorithms to leverage the power of these networks ... Sorting algorithms rely on a similar intensive communication pattern ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 27

Provided by: christianb8

Learn more at: https://upc.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing Bandwidth Limited Problems Using OneSided Communication and Overlap

1
Optimizing Bandwidth Limited Problems Using
One-Sided Communication and Overlap

Christian Bell1,2, Dan Bonachea1,
Rajesh Nishtala1, and Katherine Yelick1,2
1UC Berkeley, Computer Science Division
2Lawrence Berkeley National Laboratory

2
Conventional Wisdom

Send few, large messages
Allows the network to deliver the most effective
bandwidth
Isolate computation and communication phases
Uses bulk-synchronous programming
Allows for packing to maximize message size
Message passing is preferred paradigm for
clusters
Global Address Space (GAS) Languages are
primarily useful for latency sensitive
applications
GAS Languages mainly help productivity
However, not well known for their performance
advantages

3
Our Contributions

Increasingly, cost of HPC machines is in the
network
One-sided communication model is a better match
to modern networks
GAS Languages simplify programming for this model
How to use these communication advantages
Case study with NAS Fourier Transform (FT)
Algorithms designed to relieve communication
bottlenecks
Overlap communication and computation
Send messages early and often to maximize overlap

4
UPC Programming Model

Global address space any thread/process may
directly read/write data allocated by another
Partitioned data is designated as local (near)
or global (possibly far) programmer controls
layout

Global arrays Allows any processor to directly
access data on any other processor
shared
g
g
g
private
l
l
l
Proc 0
Proc 1
Proc n-1

3 of the current languages UPC, CAF, and
Titanium
Emphasis in this talk on UPC (based on C)
However programming paradigms presented in this
work are not limited to UPC

5
Advantages of GAS Languages

Productivity
GAS supports construction of complex shared data
structures
High level constructs simplify parallel
programming
Related work has already focused on these
advantages
Performance (the main focus of this talk)
GAS Languages can be faster than two-sided MPI
One-sided communication paradigm for GAS
languages more natural fit to modern cluster
networks
Enables novel algorithms to leverage the power of
these networks
GASNet, the communication system in the Berkeley
UPC Project, is designed to take advantage of
this communication paradigm

6
One-Sided vs Two-Sided
host CPU
one-sided put (e.g., GASNet)
network interface
dest. addr.
data payload
memory
two-sided message (e.g., MPI)
message id
data payload

A one-sided put/get can be entirely handled by
network interface with RDMA support
CPU can dedicate more time to computation rather
than handling communication
A two-sided message can employ RDMA for only part
of the communication
Each message requires the target to provide the
destination address
Offloaded to network interface in networks like
Quadrics
RDMA makes it apparent that MPI has added costs
associated with ordering to make it usable as a
end-user programming model

7
Latency Advantages

Comparison
One-sided
Initiator can always transmit remote address
Close semantic match to high bandwidth, zero-copy
RDMA
Two-sided
Receiver must provide destination address
Latency measurement correlates to software
overhead
Much of the small-message latency is due to time
spent in software/firmware processing

down is good
8
Bandwidth Advantages

One-sided semantics better match to RDMA
supported networks
Relaxing point-to-point ordering constraint can
allow for higher performance on some networks
GASNet saturates to hardware peak at lower
message sizes
Synchronization decoupled from data transfer
MPI semantics designed for end user
Comparison against good MPI implementation
Semantic requirements hinder MPI performance
Synchronization and data transferred coupled
together in message passing

up is good
Over a factor of 2 improvement for 1kB messages
9
Bandwidth Advantages (cont)

GASNet and MPI saturate to roughly the same
bandwidth for large messages
GASNet consistently outperforms MPI for
mid-range message sizes

up is good
10
A Case Study NAS FT

How to use the potential that the microbenchmarks
reveal?
Perform a large 3 dimensional Fourier Transform
Used in many areas of computational sciences
Molecular dynamics, computational fluid dynamics,
image processing, signal processing, nanoscience,
astrophysics, etc.
Representative of a class of communication
intensive algorithms
Sorting algorithms rely on a similar intensive
communication pattern
Requires every processor to communicate with
every other processor
Limited by bandwidth

11
Performing a 3D FFT

NX x NY x NZ elements spread across P processors
Will Use 1-Dimensional Layout in Z dimension
Each processor gets NZ / P planes of NX x NY
elements per plane

Example P 4
NZ
NZ/P
1D Partition
NX
p3
p2
p1
NY
p0
12
Performing a 3D FFT (part 2)

Perform an FFT in all three dimensions
With 1D layout, 2 out of the 3 dimensions are
local while the last Z dimension is distributed

Step 1 FFTs on the columns (all elements local)
Step 2 FFTs on the rows (all elements local)
Step 3 FFTs in the Z-dimension (requires
communication)
13
Performing the 3D FFT (part 3)

Can perform Steps 1 and 2 since all the data is
available without communication
Perform a Global Transpose of the cube
Allows step 3 to continue

Transpose
14
The Transpose

Each processor has to scatter input domain to
other processors
Every processor divides its portion of the domain
into P pieces
Send each of the P pieces to a different
processor
Three different ways to break it up the messages
Packed Slabs (i.e. single packed Alltoall in
MPI parlance)
Slabs
Pencils
An order of magnitude increase in the number of
messages
An order of magnitude decrease in the size of
each message
Slabs and Pencils allow overlapping
communication and computation and leverage RDMA
support in modern networks

15
Algorithm 1 Packed Slabs

Example with P4, NXNYNZ16
Perform all row and column FFTs
Perform local transpose
data destined to a remote processor are grouped
together
Perform P puts of the data

put to proc 0
put to proc 1
put to proc 2
put to proc 3
Local transpose

For 5123 grid across 64 processors
Send 64 messages of 512kB each

16
Bandwidth Utilization

NAS FT (Class D) with 256 processors on
Opteron/InfiniBand
Each processor sends 256 messages of 512kBytes
Global Transpose (i.e. all to all exchange) only
achieves 67 of peak point-to-point bidirectional
bandwidth
Many factors could cause this slowdown
Network contention
Number of processors that each processor
communicates with
Can we do better?

17
Algorithm 2 Slabs

Waiting to send all data in one phase bunches up
communication events
Algorithm Sketch
for each of the NZ/P planes
Perform all column FFTs
for each of the P slabs
(a slab is NX/P rows)
Perform FFTs on the rows in the slab
Initiate 1-sided put of the slab
Wait for all puts to finish
Barrier
Non-blocking RDMA puts allow data movement to be
overlapped with computation.
Puts are spaced apart by the amount of time to
perform FFTs on NX/P rows

plane 0
Start computation for next plane

For 5123 grid across 64 processors
Send 512 messages of 64kB each

18
Algorithm 3 Pencils

Further reduce the granularity of communication
Send a row (pencil) as soon as it is ready
Algorithm Sketch
For each of the NZ/P planes
Perform all 16 column FFTs
For r0 rltNX/P r
For each slab s in the plane
Perform FFT on row r of slab s
Initiate 1-sided put of row r
Wait for all puts to finish
Barrier
Large increase in message count
Communication events finely diffused through
computation
Maximum amount of overlap
Communication starts early

plane 0
Start computation for next plane

For 5123 grid across 64 processors
Send 4096 messages of 8kB each

19
Communication Requirements
With Slabs GASNet is slightly faster than MPI

5123 across 64 processors
Alg 1 Packed Slabs
Send 64 messages of 512kB
Alg 2 Slabs
Send 512 messages of 64kB
Alg 3 Pencils
Send 4096 messages of 8kB

20
Platforms
21
Comparison of Algorithms

Compare 3 algorithms against original NAS FT
All versions including Fortran use FFTW for local
1D FFTs
Largest class that fit in the memory (usually
class D)
All UPC flavors outperform original Fortran/MPI
implantation by at least 20
One-sided semantics allow even exchange based
implementations to improve over MPI
implementations
Overlap algorithms spread the messages out,
easing the bottlenecks
1.9x speedup in the best case

up is good
22
Time Spent in Communication

Implemented the 3 algorithms in MPI using Irecvs
and Isends
Compare time spent initiating or waiting for
communication to finish
UPC consistently spends less time in
communication than its MPI counterpart
MPI unable to handle pencils algorithm in some
cases

28.6
312.8
34.1
MPI Crash (Pencils)
down is good
23
Performance Summary
MFLOPS / Proc
up is good
24
Conclusions

One-sided semantics used in GAS languages, such
as UPC, provide a more natural fit to modern
networks
Benchmarks demonstrate these advantages
Use these advantages to alleviate communication
bottlenecks in bandwidth limited applications
Paradoxically it helps to send more, smaller
messages
Both two-sided and one-sided implementations can
see advantages of overlap
One-sided implementations consistently outperform
two-sided counterparts because comm model more
natural fit
Send early and often to avoid communication
bottlenecks