An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications - PowerPoint PPT Presentation

About This Presentation

Title:

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Description:

Presentation of a scientific topic. The pages of this sample document can be printed and copied to overhead transparencies. – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 42

Provided by: DanielC123

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

1
An Evaluation of Data-Parallel Compiler Support
for Line-Sweep Applications

Daniel Chavarría-Miranda
John Mellor-Crummey
Dept. of Computer Science
Rice University

2
High-Performance Fortran (HPF)

Industry-standard data parallel language
Partitioning of data drives partitioning of
computation,

3
Motivation

Obtaining high performance from applications
written using high-level parallel languages has
been elusive
Tightly-coupled applications are particularly
hard
Data dependences serialize computation
induces tradeoffs between parallelism,
communication granularity and frequency
traditional HPF partitionings limit scalability
and performance
Communication might be needed inside loops

4
Contributions

A set of compilation techniques that enable us to
match hand-coded performance for tightly-coupled
applications
An analysis of their performance impact

5
dHPF Compiler

Based on an abstract equational framework
manipulates sets of processors, array elements,
iterations and pairwise mappings between these
sets
optimizations and code generation are implemented
as operations on these sets and mappings
Sophisticated computation partitioning model
enables partial replication of computation to
reduce communication
Support for the multipartitioning distribution
MULTI distribution specifier
suited for line-sweep computations
Innovative optimizations
reduce communication
improve locality

6
Overview

Introduction
Line Sweep Computations
Performance Comparison
Optimization Evaluation
Partially Replicated Computation
Interprocedural Communication Elimination
Communication Coalescing
Direct Access Buffers
Conclusions

7
Line-Sweep Computations

1D recurrences on a multidimensional domain
Recurrences order computation along each
dimension
Compiler based parallelization is hard loop
carried dependences, fine-grained parallelism

8
Partitioning Choices (Transpose)
9
Partitioning Choices (block CGP)

Partial wavefront-type parallelism

Processor 0
Processor 1
Processor 2
Processor 3
10
Partitioning Choices (multipartitioning)

Full parallelism for sweeping along any
partitioned dimension

Processor 0
Processor 1
Processor 2
Processor 3
11
NAS SP BT Benchmarks

NAS SP BT benchmarks from NASA Ames
use ADI to solve the Navier-Stokes equation in 3D
forward backward line sweeps on each dimension,
for each time step
SP solves scalar penta-diagonal systems
BT solves block-tridiagonal systems
SP has double communication volume and frequency

12
Experimental Setup

2 versions from NASA, each written in Fortran 77
parallel MPI hand-coded version
sequential version (3500 lines)
dHPF input sequential version HPF directives
(including MULTI, 2 line count increase)
Inlined several procedures manually
enables dHPF to overlap local computation with
communication without interprocedural tiling
Platform SGI Origin 2000 (128 250 MHz procs.),
SGIs MPI implementation, SGIs compilers

13
Performance Comparison

Compare four versions of NAS SP BT
Multipartitioned MPI hand-coded version from NASA
different executables for each number of
processors
Multipartitioned dHPF-generated version
single executable for all numbers of processors
Block-partitioned dHPF-generated version (with
coarse-grain pipelining, using a 2D partition)
single executable for all numbers of processors
Block-partitioned pghpf-compiled version from
PGIs source code (using a full transpose with a
1D partition)
single executable for all numbers of processors

14
Efficiency for NAS SP (1023 B size)
similar comm. volume, more serialization
gt 2x multipartitioning comm. volume
15
Efficiency for NAS BT (1023 B size)
gt 2x multipartitioning comm. volume
16
Overview

Introduction
Line Sweep Computations
Performance Comparison
Optimization Evaluation
Partially Replicated Computation
Inteprocedural Communication Elimination
Communication Coalescing
Direct Access Buffers
Conclusions

17
Evaluation Methodology

All versions are dHPF-generated using
multipartitioning
Turn off a particular optimization (n - 1
approach)
determine overhead without it ( over fully
optimized)
Measure its contribution to overall performance
total execution time
total communication volume
L2 data cache misses (where appropriate)
Class A (643) and class B (1023) problem sizes on
two different processor counts (16 64
processors)

18
Partially Replicated Computation
SHADOW a(2, 2)
SHADOW a(2, 2)
ON_HOME a(i-2, j) È ON_HOME a(i2, j) È ON_HOME
a(i, j-2) È ON_HOME a(i-1, j1) È ON_HOME a(i, j)
ON_EXT_HOME a(i, j)

Partial computation replication is used to reduce
communication

19
Impact of Partial Replication

BT eliminate comm. for 5D arrays fjac and njac
in lhsltxyzgt
Both eliminate comm. for six 3D arrays in
compute_rhs

20
Impact of Partial Replication (cont.)
21
Interprocedural Communication Reduction
Extensions to HPF/JA Directives

REFLECT placement of near-neighbor communication
LOCAL communication not needed for a scope
extended ON HOME partial computation replication
Compiler doesnt need full interprocedural
communication and availability analyses to
determine whether data in overlap regions comm.
buffers is fresh

22
Interprocedural Communication Reduction (cont.)
From top neighbor
From left neighbor
SHADOW a(2, 1) REFLECT (a(00, 10), a(10, 00))
SHADOW a(2, 1) REFLECT (a)

The combination of REFLECT, extended ON HOME and
LOCAL reduces communication volume by 13,
resulting in a 9 reduction in execution time

23
Normalizing Communication
do i 1, n do j 2, n 2 a(i, j) a(i,
j - 2) ! ON_HOME a(i, j) a(i, j 2) a(i,
j) ! ON_HOME a(i, j 2)
Same non-local data needed
P0
P1
P0
P1
a(i, j 2)
a(i, j)
a(i, j)
a(i, j - 2)
24
Coalescing Communication
Coalesced Message
A
A
25
Impact of Normalized Coalescing
26
Impact of Normalized Coalescing
Key optimization for scalability
27
Direct Access Buffers

Choices for receiving complex coalesced messages
Unpack them into the shadow regions
two simultaneous live copies in cache
unpacking can be costly
uniform access to non-local local data
Reference them directly out of the receive buffer
introduces two modes of access for data
(non-local interior)
overhead of having a single loop with these two
modes is high
loops should be split into non-local interior
portions, according to the data they reference

28
Impact of Direct Access Buffers

Use direct access buffers for the main swept
arrays
Direct access buffers loop splitting reduces L2
data cache misses by 11, resulting in a
reduction of 11 in execution time

29
Conclusions

Compiler-generated code can match the performance
of sophisticated hand-coded parallelizations
High performance comes from the aggregate benefit
of multiple optimizations
Everything affects scalability good parallel
algorithms are only the starting point, excellent
resource utilization on the target machine is
needed
Data-parallel compilers should target each
potential source of inefficiency in the generated
code, if they want to deliver the performance
scientific users demand

30
Efficiency for NAS SP (A)
31
Efficiency for NAS BT (A)
32
Data Partitioning
33
Data Partitioning (cont.)
34
Partially Replicated Computation
Local portion A Shadow Regions
Local portion A Shadow Regions
Replicated Computation
Local portion U Shadow Regions
Local portions U/B Shadow Regions
Communication
Processor p
Processor p 1
do i 1, n do j 2, n a(i,j) u(i,j-1)
1.0 ! ON_HOME a(i,j) È ON_HOME a(i,j1)
b(i,j) u(i,j-1) a(i,j-1) ! ON_HOME a(i,j)
35
Using HFP/JA for Comm. Elimination
36
Using HFP/JA for Comm. Elimination
37
Normalized Comm. Coalescing (cont.)
do timestep 1, T do j 1, n do i 3,
n a(i, j) a(i 1, j) b(i 1, j) !
ON_HOME a(i,j) enddo enddo do j 1, n
do i 1, n 2 a(i 2, j) a(i 3, j)
b(i 1, j) ! ON_HOME a(i 2, j) enddo
enddo do j 1, n do i 1, n 1
a(i 1, j) a(i 2, j) b(i 1, j) ! ON_HOME
b(i 1, j) enddo enddo enddo
Coalesce communication at this point
38
Impact of Direct Access Buffers
39
Impact of Direct Access Buffers
40
Direct Access Buffers
Pack, Send, Receive Unpack
Processor 1
Processor 0
41
Direct Access Buffers
Pack, Send Receive
Use
Processor 1
Processor 0

Write a Comment

User Comments (0)