An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C - PowerPoint PPT Presentation

About This Presentation
Title:

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C

Description:

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 32
Provided by: rice151
Learn more at: http://caf.rice.edu
Category:

less

Transcript and Presenter's Notes

Title: An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C


1
An Evaluation of Global Address Space Languages
Co-Array Fortran and Unified Parallel C
  • Cristian Coarfa, Yuri Dotsenko, John
    Mellor-Crummey
  • Rice University
  • Francois Cantonnet, Tarek El-Ghazawi, Ashrujit
    Mohanti, Yiyi Yao
  • George Washington University
  • Daniel Chavarria-Miranda
  • Pacific Northwest National Laboratory

2
GAS Languages
  • Global address space programming model
  • one-sided communication (GET/PUT)
  • Programmer has control over performance-critical
    factors
  • data distribution and locality control
  • computation partitioning
  • communication placement
  • Data movement and synchronization as language
    primitives
  • amenable to compiler-based communication
    optimization

simpler than msg passing
3
Questions
  • Can GAS languages match the performance of
    hand-tuned message passing programs?
  • What are the obstacles to obtaining performance
    with GAS languages?
  • What should be done to ameliorate them?
  • by language modifications or extensions
  • by compilers
  • by run-time systems
  • How easy is it to develop high performance
    programs in GAS languages?

4
Approach
  • Evaluate CAF and UPC using NAS Parallel
    Benchmarks
  • Compare performance to that of MPI versions
  • use hardware performance counters to pinpoint
    differences
  • Determine optimization techniques common for both
    languages as well as language specific
    optimizations
  • language features
  • program implementation strategies
  • compiler optimizations
  • runtime optimizations
  • Assess programmability of the CAF and UPC variants

5
Outline
  • Questions and approach
  • CAF UPC
  • Features
  • Compilers
  • Performance considerations
  • Experimental evaluation
  • Conclusions

6
CAF UPC Common Features
  • SPMD programming model
  • Both private and shared data
  • Language-level one-sided shared-memory
    communication
  • Synchronization intrinsic functions (barrier,
    fence)
  • Pointers and dynamic allocation

7
CAF UPC Differences I
  • Multidimensional arrays
  • CAF multidimensional arrays, procedure argument
    reshaping
  • UPC linearization, typically using macros
  • Local accesses to shared data
  • CAF Fortran 90 array syntax without brackets,
    e.g. a(1M,N)
  • UPC shared array reference using MYTHREAD or a C
    pointer

8
CAF and UPC Differences II
  • Scalar/element-wise remote accesses
  • CAF multidimensional subscripts bracket syntax
  • a(1,1) a(1,M)this_image()-1
  • UPC shared (flat) array access with linearized
    subscripts
  • aNMMYTHREAD aNMMYTHREAD-N
  • Bulk and strided remote accesses
  • CAF use natural syntax of Fortran 90 array
    sections and operations on remote co-array
    sections (less temporaries on SMPs)
  • UPC use library functions (and temporary storage
    to hold a copy)

9
Bulk Communication
CAF integer a(N,M) a(1N,12)
a(1N,M-1M)this_image()-1
UPC shared int a upc_memget(aNMMYTHREAD
, aNMMYTHREAD-2N, 2Nsizeof(int))
10
CAF UPC Differences III
  • Synchronization
  • CAF team synchronization
  • UPC split-phase barrier, locks
  • UPC worksharing construct upc_forall
  • UPC richer set of pointer types

11
Outline
  • Questions and approach
  • CAF UPC
  • Features
  • Compilers
  • Performance considerations
  • Experimental evaluation
  • Conclusions

12
CAF Compilers
  • Rice Co-Array Fortran Compiler (cafc)
  • Multi-platform compiler
  • Implements core of the language
  • core sufficient for non-trivial codes
  • currently lacks support for derived type and
    dynamic co-arrays
  • Source-to-source translator
  • translates CAF into Fortran 90 and communication
    code
  • uses ARMCI or GASNet as communication substrate
  • can generate load/store for remote data accesses
    on SMPs
  • Performance comparable to that of hand-tuned MPI
    codes
  • Open source
  • Vendor compilers Cray

13
UPC Compilers
  • Berkeley UPC Compiler
  • Multi-platform compiler
  • Implements full UPC 1.1 specification
  • Source-to-source translator
  • converts UPC into ANSI C and calls to UPC runtime
    library GASNet
  • tailors code to a specific architecture cluster
    or SMP
  • Open source
  • Intrepid UPC compiler
  • Based on GCC compiler
  • Works on SGI Origin, Cray T3E and Linux SMP
  • Other vendor compilers Cray, HP

14
Outline
  • Motivation and Goals
  • CAF UPC
  • Features
  • Compilers
  • Performance considerations
  • Experimental evaluation
  • Conclusions

15
Scalar Performance
  • Generate code amenable to backend compiler
    optimizations
  • Quality of back end compilers
  • poor reduction recognition in the Intel C
    compiler
  • Local access to shared data
  • CAF use F90 pointers and procedure arguments
  • UPC use C pointers instead of UPC shared
    pointers
  • Alias and dependence analysis
  • Fortran vs. C language semantics
  • multidimensional arrays in Fortran
  • procedure argument reshaping
  • Convey lack of aliasing for (non-aliased) shared
    variables
  • CAF use procedure splitting so co-arrays are
    referenced as arguments
  • UPC use restrict C99 keyword for C pointers used
    to access shared data

16
Communication
  • Communication vectorization is essential for high
    performance on cluster architectures for both
    languages
  • CAF
  • use F90 array sections (compiler translates to
    appropriate library calls)
  • UPC
  • use library functions for contiguous transfers
  • use UPC extensions for strided transfer in
    Berkeley UPC compiler
  • Increase efficiency of strided transfers by
    packing/unpacking data at the language level

17
Synchronization
  • Barrier-based synchronization
  • Can lead to over-synchronized code
  • Use point-to-point synchronization
  • CAF proposed language extension (sync_notify,
    sync_wait)
  • UPC language-level implementation

18
Outline
  • Questions and approach
  • CAF UPC
  • Experimental evaluation
  • Conclusions

19
Platforms and Benchmarks
  • Platforms
  • Itanium2Myrinet 2000 (900 MHz Itanium2)
  • AlphaQuadrics QSNetI (1 GHz Alpha EV6.8CB)
  • SGI Altix 3000 (1.5 GHz Itanium2)
  • SGI Origin 2000 (R10000)
  • Codes
  • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
  • MG, CG, SP, BT
  • CAF and UPC versions were derived from
    Fortran77MPI versions

20
MG class A (2563) on Itanium2Myrinet2000
Higher is better
21
MG class C (5123) on SGI Altix 3000
Fortran compiler linearized array subscripts 30
slowdown compared to multidimensional subscripts
64
Higher is better
22
MG class B (2563) on SGI Origin 2000
Higher is better
23
CG class C (150000) on SGI Altix 3000
Higher is better
24
CG class B (75000) on SGI Origin 2000
Higher is better
25
SP class C (1623) on Itanium2Myrinet2000
Higher is better
26
SP class C (1623) on AlphaQuadrics
Higher is better
27
BT class C (1623) on Itanium2Myrinet2000
Higher is better
28
BT class B (1023) on SGI Altix 3000
Higher is better
29
Conclusions
  • Matching MPI performance required using bulk
    communication
  • library-based primitives are cumbersome in UPC
  • communicating multi-dimensional array sections is
    natural in CAF
  • lack of efficient run-time support for strided
    communication is a problem
  • With CAF, can achieve performance comparable to
    MPI
  • With UPC, matching MPI performance can be
    difficult
  • CG able to match MPI on all platforms
  • SP, BT, MG substantial gap remains

30
Why the Gap?
  • Communication layer is not the problem
  • CAF with ARMCI or GASNet yields equivalent
    performance
  • Scalar code optimization of scientific code is
    the key!
  • SPBT SGI Fortran unrolljam, SWP
  • MG SGI Fortran loop alignment, fusion
  • CG Intel Fortran optimized sum reduction
  • Linearized subscripts for multidimensional arrays
    hurt!
  • measured 30 performance gap with Intel Fortran

31
Programming for Performance
  • In the absence of effective optimizing compilers
    for CAF and UPC, achieving high performance is
    difficult
  • To make codes efficient across the full range of
    architectures, we need
  • better language support for synchronization
  • point-to-point synchronization is an important
    common case!
  • better CAF UPC compiler support
  • communication vectorization
  • synchronization strength reduction
  • better compiler optimization of loops with
    complex dependence patterns
  • better run-time library support
  • efficient communication of strided array sections
Write a Comment
User Comments (0)
About PowerShow.com