Title: UPC and Titanium
1UPC and Titanium
- Open-source compilers and tools for
- scalable global address space computing
- Kathy Yelick
- University of California, Berkeley and
- Lawrence Berkeley National Laboratory
2Outline
- Global Address Languages in General
- Distinction between languages and libraries
- UPC
- Language overview
- Berkeley UPC compiler status and microbenchmarks
- Application benchmarks and plans
- Titanium
- Language overview
- Berkeley Titanium compiler status
- Application benchmarks and plans
3Global Address Space Languages
- Explicitly-parallel programming model with SPMD
parallelism - Fixed at program start-up, typically 1 thread per
processor - Global address space model of memory
- Allows programmer to directly represent
distributed data structures - Address space is logically partitioned
- Local vs. remote memory (two-level hierarchy)
- Programmer control over performance critical
decisions - Data layout and communication
- Performance transparency and tunability are goals
- Initial implementation can use fine-grained
shared memory - Suitable for current and future architectures
- Either shared memory or lightweight messaging is
key - Base languages differ UPC (C), CAF (Fortran),
Titanium (Java)
4Global Address Space
X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private
- The languages share the global address space
abstraction - Shared memory is partitioned by processors
- Remote memory may stay remote no automatic
caching implied - One-sided communication through reads/writes of
shared variables - Both individual and bulk memory copies
- Differ on details
- Some models have a separate private memory area
- Distributed arrays generality and how they are
constructed
5UPC Programming Model Features
- SPMD parallelism
- fixed number of images during execution
- images operate asynchronously
- Several kinds of array distributions
- double an a private
n-element array on each processor - shared double an a n-element shared
array, with cyclic mapping - shared 4 double an a block cyclic array with
4-element blocks - shared 0 double a (shared 0 double )
upc_alloc(n) - a
shared array with all elements local - Pointers for irregular data structures
- shared double sp a pointer to shared
data - double lp a pointers to
private data
6UPC Programming Model Features
- Global synchronization
- upc_barrier traditional
barrier - upc_notify/upc_wait split-phase global
synchronization - Pair-wise synchronization
- upc_lock/upc_unlock traditional locks
- Memory consistence has two types of accesses
- Strict must be performed immediately and
atomically typically a blocking round-trip
message if remote - Relaxed still must preserve dependencies, but
other processors may view these as happening out
of order - Parallel I/O
- Based on ideas in MPI I/O
- Specification for UPC by Thakur, El Ghazawi et al
7Berkeley UPC Compiler
- Compiler based on Open64
- Recently merged Rice sources
- Multiple front-ends, including gcc
- Intermediate form called WHIRL
- Current focus on C backend
- IA64 possible in future
- UPC Runtime
- Pointer representation
- Shared/distribute memory
- Communication in GASNet
- Portable
- Language-independent
UPC
Higher WHIRL
Optimizing transformations
C Runtime
Lower WHIRL
Assembly IA64, MIPS, Runtime
8Design for Portability Performance
- UPC to C translator
- Translates UPC to C insert runtime calls for
parallel features - UPC runtime
- Allocate shared data implement
pointers-to-shared - GASNet
- A uniform interface for low-level communication
primitives - Portability
- C is our intermediate language
- GASNet is itself layered with a small core as the
essential part - High-Performance
- Native C compiler optimizes serial code
- Translator can perform communication
optimizations - GASNet can access network directly
9Berkeley UPC Compiler Status
- UPC Extensions added to front-end
- Code-generation complete
- Some issues related to code quality (hints to
backend compilers) - GASNet communication layer
- Running on Quadrics/Elan, IBM/LAPI, Myrinet/GM,
and MPI - Optimized for small non-blocking messages and
compiled code - Next step strided and indexed put/get leveraging
ARMCI work - UPC Runtime layer
- Developed and tested on all GASNet
implementations - Supports multiple pointer representations
- Next step direct shared memory support
- Release scheduled for later this month
- Glitch related to include files and usability to
iron out
10Pointer-to-Shared Representation
- UPC has three difference kinds of pointers
- Block-cyclic, cyclic, and indefinite (always
local) - A pointer needs a phase to keep track of where
it is in a block - Source of overhead for updating and
de-referencing - Consumes space in the pointer
- Our runtime has special cases for
- Phaseless (cyclic and indefinite) skip phase
update - Indefinite skip thread id update
- Pointer size/representation easily reconfigured
- 64 bits on small machines, 128 on large, word or
struct
11Preliminary Performance
- Testbed
- Compaq AlphaServer, with Quadrics GASNet conduit
- Compaq C compiler for the translated C code
- Microbenchmarks
- Measures the cost of UPC language features and
construct - Shared pointer arithmetic, barrier, allocation,
etc - Vector addition no remote communication
- NAS Parallel Benchmarks
- EP no communication
- IS large bulk memory operations
- MG bulk memput
- CG fine-grained vs. bulk memput
12Performance of Shared Pointer Arithmetic
- Phaseless pointers are an important optimization
- Indefinite pointers almost as fast as regular C
pointers - General blocked cyclic pointer 7x slower for
addition - Competitive with HP compiler, which generates
native code - Both compiler have known opportunities for
improvement
13Cost of Shared Memory Access
- Local shared accesses somewhat slower than
private ones - HP has improved local performance in newer
version - Remote accesses worse than local, as expected
- Runtime/GASNet layering for portability is not a
problem
14NAS PB EP
- EP Embarrassingly Parallel has no communication
- Serial performance via C code generation is not a
problem
15NAS PB IS
- IS Integer Sort is dominated by Bulk
Communication - GASNet bulk communication adds no measurable
overhead
16NAS PB MG
- MG Multigrid involves medium bulk copies
- Berkeley reveals a slight serial performance
degradation due to casts - Berkeley-C uses the original C code for the inner
loops
17Scaling MG on the T3E
- Scalability of the language shown here for the
T3E compiler - Directly shared memory support is probably needed
to be competitive on most current machines
18Mesh Generation in UPC
- Parallel Mesh Generation in UPC
- 2D Delaunay triangulation
- Based on Triangle software by Shewchuk (UCB)
- Parallel version from NERSC uses dynamic load
balancing, software caching, and parallel sorting
19Research in Optimizations
- Privatizing accesses for local memory
- In conjunction with elimination of forall loop
affinity tests - Communication optimizations
- Separate get/put from sync exploit split-phase
barrier - Message aggregation (fine-grained to bulk)
- Software caching
- Research problems
- Optimization selection based on performance
model - Language research in the UPC memory consistency
model
20Preliminary Performance Results
- UPC communication optimizations
- Performed by hand
- Remote fetch-and-increment (not random data)
21UPC Interactions
- UPC consortium
- Tarek El-Ghazawi is coordinator semi-annual
meetings, daily e-mail - Revised UPC Language Specification (IDA,GWU,)
- UPC Collectives (MTU)
- UPC I/O Specifications (GWU, ANL-PModels)
- Other Implementations
- HP (Alpha cluster and CMPI compiler (with MTU))
- MTU (CMPI Compiler based on HP compiler, memory
model) - Cray (X1 implementation)
- Intrepid (SGI implementation based on gcc)
- Etnus (debugging)
- UPC Book T. El-Ghazawi, B. Carlson, T. Sterling,
K. Yelick - Goal is proofs by SC03
- HPC HPCS Effort
- Recent interest from Sandia
22Titanium
- Based on Java, a cleaner C
- classes, automatic memory management, etc.
- compiled to C and then native binary (no JVM)
- Same parallelism model as UPC and CAF
- SPMD with a global address space
- Dynamic Java threads are not supported
- Optimizing compiler
- static (compile-time) optimizer, not a JIT
- communication and memory optimizations
- synchronization analysis (e.g. static barrier
analysis) - cache and other uniprocessor optimizations
23Summary of Features Added to Java
- Scalable parallelism (Java threads replaced)
- Immutable (value) classes
- Multidimensional arrays with unordered iteration
- Checked Synchronization
- Operator overloading
- Templates
- Zone-based memory management (regions)
- Libraries for collective communication,
distributed arrays, bulk I/O
24Immutable Classes in Titanium
- For small objects, would sometimes prefer
- to avoid level of indirection
- pass by value (copy entire object)
- especially when immutable -- fields never
modified - Example
- immutable class Complex
- Complex () real0 imag0
- Complex operator (Complex c) ...
-
- Complex c1 new Complex(7.1, 4.3)
- c1 c1 c1
- Addresses performance and programmability
- Similar to structs in C (not C classes) in
terms of performance - Adds support for complex types
25Multidimensional Arrays
- Arrays in Java are objects
- Array bounds are checked
- Multidimensional arrays are arrays-of-arrays
- Safe and general, but potentially slow
- New kind of multidimensional array added to
Titanium - Sub-arrays are supported (interior, boundary,
etc.) - Indexed by Points (tuple of ints)
- Combined with unordered iteration to enable
optimizations - foreach (p within A.domain())
- Ap...
- A could be multidimensional, an interior
region, etc.
26Communication
- Titanium has explicit global communication
- Broadcast, reduction, etc.
- Primarily used to set up distributed data
structures - Most communication is implicit through the shared
address space - Dereferencing a global reference, g.x, can
generate communication - Arrays have copy operations, which generate bulk
communication A1.copy(A2) - Automatically computes the intersection of A1 and
A2s index set or domain
27Distributed Data Structures
- Building distributed arrays
- Particle 1d single 1d allParticle
- new Particle 0Ti.numProcs-11d
- Particle 1d myParticle
- new Particle 0myParticleCount-1
- allParticle.exchange(myParticle)
- Now each processor has array of pointers, one to
each processors chunk of particles
All to all broadcast
P0
P1
P2
28Titanium Compiler Status
- Titanium compiler runs on almost any machine
- Requires a C compiler (and decent C to compile
translator) - Pthreads for shared memory
- Communication layer for distributed memory (or
hybrid) - Recently moved to live on GASNet obtained GM,
Elan, and improved LAPI implementation - Leverages other PModels work for maintenance
- Recent language extensions
- Indexed array copy (scatter/gather style)
- Non-blocking array copy under development
- Compiler optimizations
- Cache optimizations, for loop optimizations
- Communication optimizations for overlap,
pipelining, and scatter/gather under development
29Applications in Titanium
- Several benchmarks
- Fluid solvers with Adaptive Mesh Refinement (AMR)
- Conjugate Gradient
- 3D Multigrid
- Unstructured mesh kernel EM3D
- Dense linear algebra LU, MatMul
- Tree-structured n-body code
- Finite element benchmark
- Genetics micro-array selection
- SciMark serial benchmarks
- Larger applications
- Heart simulation
- Ocean modeling with AMR (in progress)
30Serial Performance (Pure Java)
- Several optimizations in Titanium compiler (tc)
over the past year - These codes are all written in pure Java without
performance extensions
31AMR for Ocean Modeling
- Ocean Modeling Wen, Colella
- Require embedded boundaries to model ocean
floor/coastline - Line vs. point relaxation to handle aspect ratio
1000km x 10km - Result in irregular data structures and array
accesses - Goal for this year
- Basin scale AMR circulation model
- Currently a non-adaptive implementation
- Compiler and language support design
Graphics from Titanium AMR Gas Dynamics
McCorquodale,Colella
32Heart Simulation
- Immersed Boundary Method Peskin/MacQueen
- Fibers (e.g., heart muscles) modeled by list of
fiber points - Fluid space modeled by a regular lattice
- Irregular fiber lists need to interact with
regular fluid lattice - Trade-off between load balancing of fibers and
minimizing communication - Memory and communication intensive
- Random array access is key problem in the
performance - Developed compiler optimizations to improve their
performance - Application effort funded by NSF/NPACI
33Parallel Performance and Scalability
- Poisson solver using Method of Local
Corrections Balls, Colella - Communication lt 5 Scaled speedup nearly ideal
(flat) - IBM SP
Cray T3E
34Titanium Interactions
- GASNet interactions
- In addition to the
- Application collaborators
- Charles Peskin and Dave McQueen and Courant
Institute - Phil Colella and Tong Wen and LBNL
- Scott Baden and Greg Balls and UCSD
- Involved in Sun HPCS Effort
- The GASNet work is common to UPC and Titanium
- Joint effort between U.C. Berkeley and LBNL
- (UPC project is primarily at LBNL Titanium is
U.C. Berkeley) - Collaboration with Nieplocha on communication
runtime - Participation in Global Address Space tutorials
35- The End
- http//upc.nersc.gov
- http//titanium.cs.berkeley.edu/
36NAS PB CG
- CG Conjugate Gradient can be written naturally
with fine-grained communication in the sparse
matrix-vector product - Worked well on the T3E (and hopefully will on the
X1) - For other machines, a bulk version is required
37NAS MG in Titanium
- Preliminary Performance for MG code on IBM SP