An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications - PowerPoint PPT Presentation

About This Presentation
Title:

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Description:

Presentation of a scientific topic. The pages of this sample document can be printed and copied to overhead transparencies. – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 42
Provided by: DanielC123
Category:

less

Transcript and Presenter's Notes

Title: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications


1
An Evaluation of Data-Parallel Compiler Support
for Line-Sweep Applications
  • Daniel Chavarría-Miranda
  • John Mellor-Crummey
  • Dept. of Computer Science
  • Rice University

2
High-Performance Fortran (HPF)
  • Industry-standard data parallel language
  • Partitioning of data drives partitioning of
    computation,

3
Motivation
  • Obtaining high performance from applications
    written using high-level parallel languages has
    been elusive
  • Tightly-coupled applications are particularly
    hard
  • Data dependences serialize computation
  • induces tradeoffs between parallelism,
    communication granularity and frequency
  • traditional HPF partitionings limit scalability
    and performance
  • Communication might be needed inside loops

4
Contributions
  • A set of compilation techniques that enable us to
    match hand-coded performance for tightly-coupled
    applications
  • An analysis of their performance impact

5
dHPF Compiler
  • Based on an abstract equational framework
  • manipulates sets of processors, array elements,
    iterations and pairwise mappings between these
    sets
  • optimizations and code generation are implemented
    as operations on these sets and mappings
  • Sophisticated computation partitioning model
  • enables partial replication of computation to
    reduce communication
  • Support for the multipartitioning distribution
  • MULTI distribution specifier
  • suited for line-sweep computations
  • Innovative optimizations
  • reduce communication
  • improve locality

6
Overview
  • Introduction
  • Line Sweep Computations
  • Performance Comparison
  • Optimization Evaluation
  • Partially Replicated Computation
  • Interprocedural Communication Elimination
  • Communication Coalescing
  • Direct Access Buffers
  • Conclusions

7
Line-Sweep Computations
  • 1D recurrences on a multidimensional domain
  • Recurrences order computation along each
    dimension
  • Compiler based parallelization is hard loop
    carried dependences, fine-grained parallelism

8
Partitioning Choices (Transpose)
9
Partitioning Choices (block CGP)
  • Partial wavefront-type parallelism

Processor 0
Processor 1
Processor 2
Processor 3
10
Partitioning Choices (multipartitioning)
  • Full parallelism for sweeping along any
    partitioned dimension

Processor 0
Processor 1
Processor 2
Processor 3
11
NAS SP BT Benchmarks
  • NAS SP BT benchmarks from NASA Ames
  • use ADI to solve the Navier-Stokes equation in 3D
  • forward backward line sweeps on each dimension,
    for each time step
  • SP solves scalar penta-diagonal systems
  • BT solves block-tridiagonal systems
  • SP has double communication volume and frequency

12
Experimental Setup
  • 2 versions from NASA, each written in Fortran 77
  • parallel MPI hand-coded version
  • sequential version (3500 lines)
  • dHPF input sequential version HPF directives
    (including MULTI, 2 line count increase)
  • Inlined several procedures manually
  • enables dHPF to overlap local computation with
    communication without interprocedural tiling
  • Platform SGI Origin 2000 (128 250 MHz procs.),
    SGIs MPI implementation, SGIs compilers

13
Performance Comparison
  • Compare four versions of NAS SP BT
  • Multipartitioned MPI hand-coded version from NASA
  • different executables for each number of
    processors
  • Multipartitioned dHPF-generated version
  • single executable for all numbers of processors
  • Block-partitioned dHPF-generated version (with
    coarse-grain pipelining, using a 2D partition)
  • single executable for all numbers of processors
  • Block-partitioned pghpf-compiled version from
    PGIs source code (using a full transpose with a
    1D partition)
  • single executable for all numbers of processors

14
Efficiency for NAS SP (1023 B size)
similar comm. volume, more serialization
gt 2x multipartitioning comm. volume
15
Efficiency for NAS BT (1023 B size)
gt 2x multipartitioning comm. volume
16
Overview
  • Introduction
  • Line Sweep Computations
  • Performance Comparison
  • Optimization Evaluation
  • Partially Replicated Computation
  • Inteprocedural Communication Elimination
  • Communication Coalescing
  • Direct Access Buffers
  • Conclusions

17
Evaluation Methodology
  • All versions are dHPF-generated using
    multipartitioning
  • Turn off a particular optimization (n - 1
    approach)
  • determine overhead without it ( over fully
    optimized)
  • Measure its contribution to overall performance
  • total execution time
  • total communication volume
  • L2 data cache misses (where appropriate)
  • Class A (643) and class B (1023) problem sizes on
    two different processor counts (16 64
    processors)

18
Partially Replicated Computation
SHADOW a(2, 2)
SHADOW a(2, 2)
ON_HOME a(i-2, j) È ON_HOME a(i2, j) È ON_HOME
a(i, j-2) È ON_HOME a(i-1, j1) È ON_HOME a(i, j)
ON_EXT_HOME a(i, j)
  • Partial computation replication is used to reduce
    communication

19
Impact of Partial Replication
  • BT eliminate comm. for 5D arrays fjac and njac
    in lhsltxyzgt
  • Both eliminate comm. for six 3D arrays in
    compute_rhs

20
Impact of Partial Replication (cont.)
21
Interprocedural Communication Reduction
Extensions to HPF/JA Directives
  • REFLECT placement of near-neighbor communication
  • LOCAL communication not needed for a scope
  • extended ON HOME partial computation replication
  • Compiler doesnt need full interprocedural
    communication and availability analyses to
    determine whether data in overlap regions comm.
    buffers is fresh

22
Interprocedural Communication Reduction (cont.)
From top neighbor
From left neighbor
SHADOW a(2, 1) REFLECT (a(00, 10), a(10, 00))
SHADOW a(2, 1) REFLECT (a)
  • The combination of REFLECT, extended ON HOME and
    LOCAL reduces communication volume by 13,
    resulting in a 9 reduction in execution time

23
Normalizing Communication
do i 1, n do j 2, n 2 a(i, j) a(i,
j - 2) ! ON_HOME a(i, j) a(i, j 2) a(i,
j) ! ON_HOME a(i, j 2)
Same non-local data needed
P0
P1
P0
P1
a(i, j 2)
a(i, j)
a(i, j)
a(i, j - 2)
24
Coalescing Communication
Coalesced Message
A
A
25
Impact of Normalized Coalescing
26
Impact of Normalized Coalescing
Key optimization for scalability
27
Direct Access Buffers
  • Choices for receiving complex coalesced messages
  • Unpack them into the shadow regions
  • two simultaneous live copies in cache
  • unpacking can be costly
  • uniform access to non-local local data
  • Reference them directly out of the receive buffer
  • introduces two modes of access for data
    (non-local interior)
  • overhead of having a single loop with these two
    modes is high
  • loops should be split into non-local interior
    portions, according to the data they reference

28
Impact of Direct Access Buffers
  • Use direct access buffers for the main swept
    arrays
  • Direct access buffers loop splitting reduces L2
    data cache misses by 11, resulting in a
    reduction of 11 in execution time

29
Conclusions
  • Compiler-generated code can match the performance
    of sophisticated hand-coded parallelizations
  • High performance comes from the aggregate benefit
    of multiple optimizations
  • Everything affects scalability good parallel
    algorithms are only the starting point, excellent
    resource utilization on the target machine is
    needed
  • Data-parallel compilers should target each
    potential source of inefficiency in the generated
    code, if they want to deliver the performance
    scientific users demand

30
Efficiency for NAS SP (A)
31
Efficiency for NAS BT (A)
32
Data Partitioning
33
Data Partitioning (cont.)
34
Partially Replicated Computation
Local portion A Shadow Regions
Local portion A Shadow Regions
Replicated Computation
Local portion U Shadow Regions
Local portions U/B Shadow Regions
Communication
Processor p
Processor p 1
do i 1, n do j 2, n a(i,j) u(i,j-1)
1.0 ! ON_HOME a(i,j) È ON_HOME a(i,j1)
b(i,j) u(i,j-1) a(i,j-1) ! ON_HOME a(i,j)
35
Using HFP/JA for Comm. Elimination
36
Using HFP/JA for Comm. Elimination
37
Normalized Comm. Coalescing (cont.)
do timestep 1, T do j 1, n do i 3,
n a(i, j) a(i 1, j) b(i 1, j) !
ON_HOME a(i,j) enddo enddo do j 1, n
do i 1, n 2 a(i 2, j) a(i 3, j)
b(i 1, j) ! ON_HOME a(i 2, j) enddo
enddo do j 1, n do i 1, n 1
a(i 1, j) a(i 2, j) b(i 1, j) ! ON_HOME
b(i 1, j) enddo enddo enddo
Coalesce communication at this point
38
Impact of Direct Access Buffers
39
Impact of Direct Access Buffers
40
Direct Access Buffers
Pack, Send, Receive Unpack
Processor 1
Processor 0
41
Direct Access Buffers
Pack, Send Receive
Use
Processor 1
Processor 0
Write a Comment
User Comments (0)
About PowerShow.com