Title: An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C
1An Evaluation of Global Address Space Languages
Co-Array Fortran and Unified Parallel C
- Cristian Coarfa, Yuri Dotsenko, John
Mellor-Crummey - Rice University
- Francois Cantonnet, Tarek El-Ghazawi, Ashrujit
Mohanti, Yiyi Yao - George Washington University
- Daniel Chavarria-Miranda
- Pacific Northwest National Laboratory
2GAS Languages
- Global address space programming model
- one-sided communication (GET/PUT)
- Programmer has control over performance-critical
factors - data distribution and locality control
- computation partitioning
- communication placement
- Data movement and synchronization as language
primitives - amenable to compiler-based communication
optimization
simpler than msg passing
3Questions
- Can GAS languages match the performance of
hand-tuned message passing programs? - What are the obstacles to obtaining performance
with GAS languages? - What should be done to ameliorate them?
- by language modifications or extensions
- by compilers
- by run-time systems
- How easy is it to develop high performance
programs in GAS languages?
4Approach
- Evaluate CAF and UPC using NAS Parallel
Benchmarks - Compare performance to that of MPI versions
- use hardware performance counters to pinpoint
differences - Determine optimization techniques common for both
languages as well as language specific
optimizations - language features
- program implementation strategies
- compiler optimizations
- runtime optimizations
- Assess programmability of the CAF and UPC variants
5Outline
- Questions and approach
- CAF UPC
- Features
- Compilers
- Performance considerations
- Experimental evaluation
- Conclusions
6CAF UPC Common Features
- SPMD programming model
- Both private and shared data
- Language-level one-sided shared-memory
communication - Synchronization intrinsic functions (barrier,
fence) - Pointers and dynamic allocation
7CAF UPC Differences I
- Multidimensional arrays
- CAF multidimensional arrays, procedure argument
reshaping - UPC linearization, typically using macros
- Local accesses to shared data
- CAF Fortran 90 array syntax without brackets,
e.g. a(1M,N) - UPC shared array reference using MYTHREAD or a C
pointer
8CAF and UPC Differences II
- Scalar/element-wise remote accesses
- CAF multidimensional subscripts bracket syntax
- a(1,1) a(1,M)this_image()-1
- UPC shared (flat) array access with linearized
subscripts - aNMMYTHREAD aNMMYTHREAD-N
- Bulk and strided remote accesses
- CAF use natural syntax of Fortran 90 array
sections and operations on remote co-array
sections (less temporaries on SMPs) - UPC use library functions (and temporary storage
to hold a copy)
9Bulk Communication
CAF integer a(N,M) a(1N,12)
a(1N,M-1M)this_image()-1
UPC shared int a upc_memget(aNMMYTHREAD
, aNMMYTHREAD-2N, 2Nsizeof(int))
10CAF UPC Differences III
- Synchronization
- CAF team synchronization
- UPC split-phase barrier, locks
- UPC worksharing construct upc_forall
- UPC richer set of pointer types
11Outline
- Questions and approach
- CAF UPC
- Features
- Compilers
- Performance considerations
- Experimental evaluation
- Conclusions
12CAF Compilers
- Rice Co-Array Fortran Compiler (cafc)
- Multi-platform compiler
- Implements core of the language
- core sufficient for non-trivial codes
- currently lacks support for derived type and
dynamic co-arrays - Source-to-source translator
- translates CAF into Fortran 90 and communication
code - uses ARMCI or GASNet as communication substrate
- can generate load/store for remote data accesses
on SMPs - Performance comparable to that of hand-tuned MPI
codes - Open source
- Vendor compilers Cray
13UPC Compilers
- Berkeley UPC Compiler
- Multi-platform compiler
- Implements full UPC 1.1 specification
- Source-to-source translator
- converts UPC into ANSI C and calls to UPC runtime
library GASNet - tailors code to a specific architecture cluster
or SMP - Open source
- Intrepid UPC compiler
- Based on GCC compiler
- Works on SGI Origin, Cray T3E and Linux SMP
- Other vendor compilers Cray, HP
14Outline
- Motivation and Goals
- CAF UPC
- Features
- Compilers
- Performance considerations
- Experimental evaluation
- Conclusions
15Scalar Performance
- Generate code amenable to backend compiler
optimizations - Quality of back end compilers
- poor reduction recognition in the Intel C
compiler - Local access to shared data
- CAF use F90 pointers and procedure arguments
- UPC use C pointers instead of UPC shared
pointers - Alias and dependence analysis
- Fortran vs. C language semantics
- multidimensional arrays in Fortran
- procedure argument reshaping
- Convey lack of aliasing for (non-aliased) shared
variables - CAF use procedure splitting so co-arrays are
referenced as arguments - UPC use restrict C99 keyword for C pointers used
to access shared data
16Communication
- Communication vectorization is essential for high
performance on cluster architectures for both
languages - CAF
- use F90 array sections (compiler translates to
appropriate library calls) - UPC
- use library functions for contiguous transfers
- use UPC extensions for strided transfer in
Berkeley UPC compiler - Increase efficiency of strided transfers by
packing/unpacking data at the language level
17Synchronization
- Barrier-based synchronization
- Can lead to over-synchronized code
- Use point-to-point synchronization
- CAF proposed language extension (sync_notify,
sync_wait) - UPC language-level implementation
18Outline
- Questions and approach
- CAF UPC
- Experimental evaluation
- Conclusions
19Platforms and Benchmarks
- Platforms
- Itanium2Myrinet 2000 (900 MHz Itanium2)
- AlphaQuadrics QSNetI (1 GHz Alpha EV6.8CB)
- SGI Altix 3000 (1.5 GHz Itanium2)
- SGI Origin 2000 (R10000)
- Codes
- NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
- MG, CG, SP, BT
- CAF and UPC versions were derived from
Fortran77MPI versions
20MG class A (2563) on Itanium2Myrinet2000
Higher is better
21MG class C (5123) on SGI Altix 3000
Fortran compiler linearized array subscripts 30
slowdown compared to multidimensional subscripts
64
Higher is better
22MG class B (2563) on SGI Origin 2000
Higher is better
23CG class C (150000) on SGI Altix 3000
Higher is better
24CG class B (75000) on SGI Origin 2000
Higher is better
25SP class C (1623) on Itanium2Myrinet2000
Higher is better
26SP class C (1623) on AlphaQuadrics
Higher is better
27BT class C (1623) on Itanium2Myrinet2000
Higher is better
28BT class B (1023) on SGI Altix 3000
Higher is better
29Conclusions
- Matching MPI performance required using bulk
communication - library-based primitives are cumbersome in UPC
- communicating multi-dimensional array sections is
natural in CAF - lack of efficient run-time support for strided
communication is a problem - With CAF, can achieve performance comparable to
MPI - With UPC, matching MPI performance can be
difficult - CG able to match MPI on all platforms
- SP, BT, MG substantial gap remains
30Why the Gap?
- Communication layer is not the problem
- CAF with ARMCI or GASNet yields equivalent
performance - Scalar code optimization of scientific code is
the key! - SPBT SGI Fortran unrolljam, SWP
- MG SGI Fortran loop alignment, fusion
- CG Intel Fortran optimized sum reduction
- Linearized subscripts for multidimensional arrays
hurt! - measured 30 performance gap with Intel Fortran
31Programming for Performance
- In the absence of effective optimizing compilers
for CAF and UPC, achieving high performance is
difficult - To make codes efficient across the full range of
architectures, we need - better language support for synchronization
- point-to-point synchronization is an important
common case! - better CAF UPC compiler support
- communication vectorization
- synchronization strength reduction
- better compiler optimization of loops with
complex dependence patterns - better run-time library support
- efficient communication of strided array sections