Title: John Mellor-Crummey
1Experiences Building a Multi-platform Compiler
for Co-array Fortran
- John Mellor-Crummey
- Cristian Coarfa, Yuri Dotsenko
- Department of Computer Science
- Rice University
AHPCRC PGAS Workshop September, 2005
2Goals for HPC Languages
- Expressiveness
- Ease of programming
- Portable performance
- Ubiquitous availability
3PGAS Languages
- Global address space programming model
- one-sided communication (GET/PUT)
- Programmer has control over performance-critical
factors - data distribution and locality control
- computation partitioning
- communication placement
- Data movement and synchronization as language
primitives - amenable to compiler-based communication
optimization
simpler than msg passing
4Co-array Fortran Programming Model
- SPMD process images
- fixed number of images during execution
- images operate asynchronously
- Both private and shared data
- real x(20, 20) a private 20x20 array in each
image - real y(20, 20) a shared 20x20 array in each
image - Simple one-sided shared-memory communication
- x(,jj2) y(,pp2)r copy columns from
image r into local columns - Synchronization intrinsic functions
- sync_all a barrier and a memory fence
- sync_mem a memory fence
- sync_team(team members to notify, team members
to wait for) - Pointers and (perhaps asymmetric) dynamic
allocation - Parallel I/O
5One-sided Communication with Co-Arrays
image 1
image 2
image N
image 1
image 2
image N
6CAF Compilers
- Cray compilers for X1 T3E architectures
- Rice Co-Array Fortran Compiler (cafc)
7Rice cafc Compiler
- Source-to-source compiler
- source-to-source yields multi-platform
portability - Implements core language features
- core sufficient for non-trivial codes
- preliminary support for derived types
- soon support for allocatable components
- Open source
Performance comparable to that of hand-tuned MPI
codes
8Implementation Strategy
- Goals
- portability
- high performance on a wide range of platforms
- Approach
- source-to-source compilation of CAF codes
- use Open64/SL Fortran 90 infrastructure
- CAF ? Fortran 90 communication operations
- communication
- ARMCI and GASNet one-sided comm libraries for
portability - load/store communication on shared-memory
platforms
9Key Implementation Concerns
- Fast access to local co-array data
- Fast communication
- Overlap of communication and computation
10Accessing Co-Array Data
- Two Representations
- SAVE and COMMON co-arrays as Fortran 90 pointers
- F90 pointers to memory allocated outside Fortran
run-time system - original references accessing local co-array data
- rhs(1,i,j,k,c) u(1,i-1,j,k,c) -
- transformed references
- rhsptr(1,i,j,k,c) uptr(1,i-1,j,k,c) -
- Procedure co-array arguments as F90
explicit-shape arrays - CAF language requires explicit shape for co-array
arguments
real a(10,10,10) type CAFDesc_real_3
real, pointer ptr(,,) ! F90 pointer to local
co-array data end Type CAFDesc_real_3 type(CAFDesc
_real_3) a
11Performance Challenges
- Problem
- Fortran 90 pointer-based representation does not
convey - the lack of co-array aliasing
- contiguity of co-array data
- co-array bounds information
- lack of knowledge inhibits important code
optimizations - Approach procedure splitting
12Procedure Splitting
subroutine f() real, save c(100) interface
subroutine f_inner(, c_arg) real
c_arg end subroutine f_inner end
interface call f_inner(,c(1)) end subroutine
f subroutine f_inner(, c_arg) real
c_arg(100) ... c_arg(50) ... end
subroutine f_inner
CAF to CAF optimization
subroutine f() real, save c(100) ...
c(50) ... end subroutine f
- Benefits
- better alias analysis
- contiguity of co-array data
- co-array bounds information
- better dependence analysis
result back-end compiler can generate better code
13Implementing Communication
- x(1n) a(1n)p
- General approach use buffer to hold off
processor data - allocate buffer
- perform GET to fill buffer
- perform computation x(1n) buffer(1n)
- deallocate buffer
- Optimizations
- no buffer for co-array to co-array copies
- unbuffered load/store on shared-memory systems
14Strided vs. Contiguous Transfers
- Problem
- CAF remote reference might induce many small data
transfers - a(i,1n)p b(j,1n)
- Solution
- pack strided data on source and unpack it on
destination - Constraints
- cant express both source-level packing and
unpacking for a one-sided transfer - two-sided packing/unpacking is awkward for users
- Preferred approach
- have communication layer perform packing/unpacking
15Pragmatics of Packing
- Who should implement packing?
- CAF programmer
- difficult to program
- CAF compiler
- must convert PUTs into two-sided communication to
unpack - difficult whole-program transformation
- Communication library
- most natural place
- ARMCI currently performs packing on Myrinet (at
least)
16Synchronization
- Original CAF specification team synchronization
only - sync_all, sync_team
- Limits performance on loosely-coupled
architectures - Point-to-point extensions
- sync_notify(q)
- sync_wait(p)
- Point to point
synchronization semantics - Delivery of a notify to q from p ?
- all communication from p to q issued before the
notify has been delivered to q
17Hiding Communication Latency
- Goal enable communication/computation overlap
- Impediments to generating non-blocking
communication - use of indexed subscripts in co-dimensions
- lack of whole program analysis
- Approach support hints for non-blocking
communication - overcome conservative compiler analysis
- enable sophisticated programmers to achieve good
performance today
18Questions about PGAS Languages
- Performance
- can performance match hand-tuned msg passing
programs? - what are the obstacles to top performance?
- what should be done to overcome them?
- language modifications or extensions?
- program implementation strategies?
- compiler technology?
- run-time system enhancements?
- Programmability
- how easy is it to develop high performance
programs?
19Investigating these Issues
- Evaluate CAF, UPC, and MPI versions of NAS
benchmarks - Performance
- compare CAF and UPC performance to that of MPI
versions - use hardware performance counters to pinpoint
differences - determine optimization techniques common for both
languages as well as language specific
optimizations - language features
- program implementation strategies
- compiler optimizations
- runtime optimizations
- Programmability
- assess programmability of the CAF and UPC variants
20Platforms and Benchmarks
- Platforms
- Itanium2Myrinet 2000 (900 MHz Itanium2)
- AlphaQuadrics QSNetI (1 GHz Alpha EV6.8CB)
- SGI Altix 3000 (1.5 GHz Itanium2)
- SGI Origin 2000 (R10000)
- Codes
- NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
- MG, CG, SP, BT
- CAF and UPC versions were derived from
Fortran77MPI versions
21MG class A (2563) on Itanium2Myrinet2000
Higher is better
22MG class C (5123) on SGI Altix 3000
Fortran compiler linearized array subscripts 30
slowdown compared to multidimensional subscripts
64
Higher is better
23MG class B (2563) on SGI Origin 2000
Higher is better
24CG class C (150000) on SGI Altix 3000
Higher is better
25CG class B (75000) on SGI Origin 2000
Higher is better
26SP class C (1623) on Itanium2Myrinet2000
Higher is better
27SP class C (1623) on AlphaQuadrics
Higher is better
28BT class C (1623) on Itanium2Myrinet2000
Higher is better
29BT class B (1023) on SGI Altix 3000
Higher is better
30Performance Observations
- Achieving highest performance can be difficult
- need effective optimizing compilers for PGAS
languages - Communication layer is not the problem
- CAF with ARMCI or GASNet yields equivalent
performance - Scalar code optimization of scientific code is
the key! - SPBT SGI Fortran unrolljam, SWP
- MG SGI Fortran loop alignment, fusion
- CG Intel Fortran optimized sum reduction
- Linearized subscripts for multidimensional arrays
hurt! - measured 30 performance gap with Intel Fortran
31Performance Prescriptions
- For portable high performance, we need
- Better language support for CAF synchronization
- point-to-point synchronization is an important
common case! - currently only a Rice extension outside the CAF
standard - Better CAF UPC compiler support
- communication vectorization
- synchronization strength reduction important for
programmability - Compiler optimization of loops with complex
dependences - Better run-time library support
- efficient communication support for strided array
sections
32Programmability Observations
- Matching MPI performance required using bulk
communication - communicating multi-dimensional array sections is
natural in CAF - library-based primitives are cumbersome in UPC
- Strided communication is problematic for
performance - tedious programming of packing/unpacking at src
level - Wavefront computations
- MPI buffered communication easily decouples
sender/receiver - PGAS models buffering explicitly managed by
programmer