Title: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications
1An Evaluation of Data-Parallel Compiler Support
for Line-Sweep Applications
- Daniel Chavarría-Miranda
- John Mellor-Crummey
- Dept. of Computer Science
- Rice University
2High-Performance Fortran (HPF)
- Industry-standard data parallel language
- Partitioning of data drives partitioning of
computation,
3Motivation
- Obtaining high performance from applications
written using high-level parallel languages has
been elusive - Tightly-coupled applications are particularly
hard - Data dependences serialize computation
- induces tradeoffs between parallelism,
communication granularity and frequency - traditional HPF partitionings limit scalability
and performance - Communication might be needed inside loops
4Contributions
- A set of compilation techniques that enable us to
match hand-coded performance for tightly-coupled
applications - An analysis of their performance impact
5dHPF Compiler
- Based on an abstract equational framework
- manipulates sets of processors, array elements,
iterations and pairwise mappings between these
sets - optimizations and code generation are implemented
as operations on these sets and mappings - Sophisticated computation partitioning model
- enables partial replication of computation to
reduce communication - Support for the multipartitioning distribution
- MULTI distribution specifier
- suited for line-sweep computations
- Innovative optimizations
- reduce communication
- improve locality
6Overview
- Introduction
- Line Sweep Computations
- Performance Comparison
- Optimization Evaluation
- Partially Replicated Computation
- Interprocedural Communication Elimination
- Communication Coalescing
- Direct Access Buffers
- Conclusions
7Line-Sweep Computations
- 1D recurrences on a multidimensional domain
- Recurrences order computation along each
dimension - Compiler based parallelization is hard loop
carried dependences, fine-grained parallelism
8Partitioning Choices (Transpose)
9Partitioning Choices (block CGP)
- Partial wavefront-type parallelism
Processor 0
Processor 1
Processor 2
Processor 3
10Partitioning Choices (multipartitioning)
- Full parallelism for sweeping along any
partitioned dimension
Processor 0
Processor 1
Processor 2
Processor 3
11NAS SP BT Benchmarks
- NAS SP BT benchmarks from NASA Ames
- use ADI to solve the Navier-Stokes equation in 3D
- forward backward line sweeps on each dimension,
for each time step - SP solves scalar penta-diagonal systems
- BT solves block-tridiagonal systems
- SP has double communication volume and frequency
12Experimental Setup
- 2 versions from NASA, each written in Fortran 77
- parallel MPI hand-coded version
- sequential version (3500 lines)
- dHPF input sequential version HPF directives
(including MULTI, 2 line count increase) - Inlined several procedures manually
- enables dHPF to overlap local computation with
communication without interprocedural tiling - Platform SGI Origin 2000 (128 250 MHz procs.),
SGIs MPI implementation, SGIs compilers
13Performance Comparison
- Compare four versions of NAS SP BT
- Multipartitioned MPI hand-coded version from NASA
- different executables for each number of
processors - Multipartitioned dHPF-generated version
- single executable for all numbers of processors
- Block-partitioned dHPF-generated version (with
coarse-grain pipelining, using a 2D partition) - single executable for all numbers of processors
- Block-partitioned pghpf-compiled version from
PGIs source code (using a full transpose with a
1D partition) - single executable for all numbers of processors
14Efficiency for NAS SP (1023 B size)
similar comm. volume, more serialization
gt 2x multipartitioning comm. volume
15Efficiency for NAS BT (1023 B size)
gt 2x multipartitioning comm. volume
16Overview
- Introduction
- Line Sweep Computations
- Performance Comparison
- Optimization Evaluation
- Partially Replicated Computation
- Inteprocedural Communication Elimination
- Communication Coalescing
- Direct Access Buffers
- Conclusions
17Evaluation Methodology
- All versions are dHPF-generated using
multipartitioning - Turn off a particular optimization (n - 1
approach) - determine overhead without it ( over fully
optimized) - Measure its contribution to overall performance
- total execution time
- total communication volume
- L2 data cache misses (where appropriate)
- Class A (643) and class B (1023) problem sizes on
two different processor counts (16 64
processors)
18Partially Replicated Computation
SHADOW a(2, 2)
SHADOW a(2, 2)
ON_HOME a(i-2, j) È ON_HOME a(i2, j) È ON_HOME
a(i, j-2) È ON_HOME a(i-1, j1) È ON_HOME a(i, j)
ON_EXT_HOME a(i, j)
- Partial computation replication is used to reduce
communication
19Impact of Partial Replication
- BT eliminate comm. for 5D arrays fjac and njac
in lhsltxyzgt - Both eliminate comm. for six 3D arrays in
compute_rhs
20Impact of Partial Replication (cont.)
21Interprocedural Communication Reduction
Extensions to HPF/JA Directives
- REFLECT placement of near-neighbor communication
- LOCAL communication not needed for a scope
- extended ON HOME partial computation replication
- Compiler doesnt need full interprocedural
communication and availability analyses to
determine whether data in overlap regions comm.
buffers is fresh
22Interprocedural Communication Reduction (cont.)
From top neighbor
From left neighbor
SHADOW a(2, 1) REFLECT (a(00, 10), a(10, 00))
SHADOW a(2, 1) REFLECT (a)
- The combination of REFLECT, extended ON HOME and
LOCAL reduces communication volume by 13,
resulting in a 9 reduction in execution time
23Normalizing Communication
do i 1, n do j 2, n 2 a(i, j) a(i,
j - 2) ! ON_HOME a(i, j) a(i, j 2) a(i,
j) ! ON_HOME a(i, j 2)
Same non-local data needed
P0
P1
P0
P1
a(i, j 2)
a(i, j)
a(i, j)
a(i, j - 2)
24Coalescing Communication
Coalesced Message
A
A
25Impact of Normalized Coalescing
26Impact of Normalized Coalescing
Key optimization for scalability
27Direct Access Buffers
- Choices for receiving complex coalesced messages
- Unpack them into the shadow regions
- two simultaneous live copies in cache
- unpacking can be costly
- uniform access to non-local local data
- Reference them directly out of the receive buffer
- introduces two modes of access for data
(non-local interior) - overhead of having a single loop with these two
modes is high - loops should be split into non-local interior
portions, according to the data they reference
28Impact of Direct Access Buffers
- Use direct access buffers for the main swept
arrays - Direct access buffers loop splitting reduces L2
data cache misses by 11, resulting in a
reduction of 11 in execution time
29Conclusions
- Compiler-generated code can match the performance
of sophisticated hand-coded parallelizations - High performance comes from the aggregate benefit
of multiple optimizations - Everything affects scalability good parallel
algorithms are only the starting point, excellent
resource utilization on the target machine is
needed - Data-parallel compilers should target each
potential source of inefficiency in the generated
code, if they want to deliver the performance
scientific users demand
30Efficiency for NAS SP (A)
31Efficiency for NAS BT (A)
32Data Partitioning
33Data Partitioning (cont.)
34Partially Replicated Computation
Local portion A Shadow Regions
Local portion A Shadow Regions
Replicated Computation
Local portion U Shadow Regions
Local portions U/B Shadow Regions
Communication
Processor p
Processor p 1
do i 1, n do j 2, n a(i,j) u(i,j-1)
1.0 ! ON_HOME a(i,j) È ON_HOME a(i,j1)
b(i,j) u(i,j-1) a(i,j-1) ! ON_HOME a(i,j)
35Using HFP/JA for Comm. Elimination
36Using HFP/JA for Comm. Elimination
37Normalized Comm. Coalescing (cont.)
do timestep 1, T do j 1, n do i 3,
n a(i, j) a(i 1, j) b(i 1, j) !
ON_HOME a(i,j) enddo enddo do j 1, n
do i 1, n 2 a(i 2, j) a(i 3, j)
b(i 1, j) ! ON_HOME a(i 2, j) enddo
enddo do j 1, n do i 1, n 1
a(i 1, j) a(i 2, j) b(i 1, j) ! ON_HOME
b(i 1, j) enddo enddo enddo
Coalesce communication at this point
38Impact of Direct Access Buffers
39Impact of Direct Access Buffers
40Direct Access Buffers
Pack, Send, Receive Unpack
Processor 1
Processor 0
41Direct Access Buffers
Pack, Send Receive
Use
Processor 1
Processor 0