An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C - PowerPoint PPT Presentation

About This Presentation

Title:

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C

Description:

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 32

Provided by: rice151

Learn more at: http://caf.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C

1
An Evaluation of Global Address Space Languages
Co-Array Fortran and Unified Parallel C

Cristian Coarfa, Yuri Dotsenko, John
Mellor-Crummey
Rice University
Francois Cantonnet, Tarek El-Ghazawi, Ashrujit
Mohanti, Yiyi Yao
George Washington University
Daniel Chavarria-Miranda
Pacific Northwest National Laboratory

2
GAS Languages

Global address space programming model
one-sided communication (GET/PUT)
Programmer has control over performance-critical
factors
data distribution and locality control
computation partitioning
communication placement
Data movement and synchronization as language
primitives
amenable to compiler-based communication
optimization

simpler than msg passing
3
Questions

Can GAS languages match the performance of
hand-tuned message passing programs?
What are the obstacles to obtaining performance
with GAS languages?
What should be done to ameliorate them?
by language modifications or extensions
by compilers
by run-time systems
How easy is it to develop high performance
programs in GAS languages?

4
Approach

Evaluate CAF and UPC using NAS Parallel
Benchmarks
Compare performance to that of MPI versions
use hardware performance counters to pinpoint
differences
Determine optimization techniques common for both
languages as well as language specific
optimizations
language features
program implementation strategies
compiler optimizations
runtime optimizations
Assess programmability of the CAF and UPC variants

5
Outline

Questions and approach
CAF UPC
Features
Compilers
Performance considerations
Experimental evaluation
Conclusions

6
CAF UPC Common Features

SPMD programming model
Both private and shared data
Language-level one-sided shared-memory
communication
Synchronization intrinsic functions (barrier,
fence)
Pointers and dynamic allocation

7
CAF UPC Differences I

Multidimensional arrays
CAF multidimensional arrays, procedure argument
reshaping
UPC linearization, typically using macros
Local accesses to shared data
CAF Fortran 90 array syntax without brackets,
e.g. a(1M,N)
UPC shared array reference using MYTHREAD or a C
pointer

8
CAF and UPC Differences II

Scalar/element-wise remote accesses
CAF multidimensional subscripts bracket syntax
a(1,1) a(1,M)this_image()-1
UPC shared (flat) array access with linearized
subscripts
aNMMYTHREAD aNMMYTHREAD-N
Bulk and strided remote accesses
CAF use natural syntax of Fortran 90 array
sections and operations on remote co-array
sections (less temporaries on SMPs)
UPC use library functions (and temporary storage
to hold a copy)

9
Bulk Communication
CAF integer a(N,M) a(1N,12)
a(1N,M-1M)this_image()-1
UPC shared int a upc_memget(aNMMYTHREAD
, aNMMYTHREAD-2N, 2Nsizeof(int))
10
CAF UPC Differences III

Synchronization
CAF team synchronization
UPC split-phase barrier, locks
UPC worksharing construct upc_forall
UPC richer set of pointer types

11
Outline

Questions and approach
CAF UPC
Features
Compilers
Performance considerations
Experimental evaluation
Conclusions

12
CAF Compilers

Rice Co-Array Fortran Compiler (cafc)
Multi-platform compiler
Implements core of the language
core sufficient for non-trivial codes
currently lacks support for derived type and
dynamic co-arrays
Source-to-source translator
translates CAF into Fortran 90 and communication
code
uses ARMCI or GASNet as communication substrate
can generate load/store for remote data accesses
on SMPs
Performance comparable to that of hand-tuned MPI
codes
Open source
Vendor compilers Cray

13
UPC Compilers

Berkeley UPC Compiler
Multi-platform compiler
Implements full UPC 1.1 specification
Source-to-source translator
converts UPC into ANSI C and calls to UPC runtime
library GASNet
tailors code to a specific architecture cluster
or SMP
Open source
Intrepid UPC compiler
Based on GCC compiler
Works on SGI Origin, Cray T3E and Linux SMP
Other vendor compilers Cray, HP

14
Outline

Motivation and Goals
CAF UPC
Features
Compilers
Performance considerations
Experimental evaluation
Conclusions

15
Scalar Performance

Generate code amenable to backend compiler
optimizations
Quality of back end compilers
poor reduction recognition in the Intel C
compiler
Local access to shared data
CAF use F90 pointers and procedure arguments
UPC use C pointers instead of UPC shared
pointers
Alias and dependence analysis
Fortran vs. C language semantics
multidimensional arrays in Fortran
procedure argument reshaping
Convey lack of aliasing for (non-aliased) shared
variables
CAF use procedure splitting so co-arrays are
referenced as arguments
UPC use restrict C99 keyword for C pointers used
to access shared data

16
Communication

Communication vectorization is essential for high
performance on cluster architectures for both
languages
CAF
use F90 array sections (compiler translates to
appropriate library calls)
UPC
use library functions for contiguous transfers
use UPC extensions for strided transfer in
Berkeley UPC compiler
Increase efficiency of strided transfers by
packing/unpacking data at the language level

17
Synchronization

Barrier-based synchronization
Can lead to over-synchronized code
Use point-to-point synchronization
CAF proposed language extension (sync_notify,
sync_wait)
UPC language-level implementation

18
Outline

Questions and approach
CAF UPC
Experimental evaluation
Conclusions

19
Platforms and Benchmarks

Platforms
Itanium2Myrinet 2000 (900 MHz Itanium2)
AlphaQuadrics QSNetI (1 GHz Alpha EV6.8CB)
SGI Altix 3000 (1.5 GHz Itanium2)
SGI Origin 2000 (R10000)
Codes
NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
MG, CG, SP, BT
CAF and UPC versions were derived from
Fortran77MPI versions

20
MG class A (2563) on Itanium2Myrinet2000
Higher is better
21
MG class C (5123) on SGI Altix 3000
Fortran compiler linearized array subscripts 30
slowdown compared to multidimensional subscripts
64
Higher is better
22
MG class B (2563) on SGI Origin 2000
Higher is better
23
CG class C (150000) on SGI Altix 3000
Higher is better
24
CG class B (75000) on SGI Origin 2000
Higher is better
25
SP class C (1623) on Itanium2Myrinet2000
Higher is better
26
SP class C (1623) on AlphaQuadrics
Higher is better
27
BT class C (1623) on Itanium2Myrinet2000
Higher is better
28
BT class B (1023) on SGI Altix 3000
Higher is better
29
Conclusions

Matching MPI performance required using bulk
communication
library-based primitives are cumbersome in UPC
communicating multi-dimensional array sections is
natural in CAF
lack of efficient run-time support for strided
communication is a problem
With CAF, can achieve performance comparable to
MPI
With UPC, matching MPI performance can be
difficult
CG able to match MPI on all platforms
SP, BT, MG substantial gap remains

30
Why the Gap?

Communication layer is not the problem
CAF with ARMCI or GASNet yields equivalent
performance
Scalar code optimization of scientific code is
the key!
SPBT SGI Fortran unrolljam, SWP
MG SGI Fortran loop alignment, fusion
CG Intel Fortran optimized sum reduction
Linearized subscripts for multidimensional arrays
hurt!
measured 30 performance gap with Intel Fortran

31
Programming for Performance

In the absence of effective optimizing compilers
for CAF and UPC, achieving high performance is
difficult
To make codes efficient across the full range of
architectures, we need
better language support for synchronization
point-to-point synchronization is an important
common case!
better CAF UPC compiler support
communication vectorization
synchronization strength reduction
better compiler optimization of loops with
complex dependence patterns
better run-time library support
efficient communication of strided array sections