Title: Compiling for Parallel Machines
1Compiling for Parallel Machines
Kathy Yelick
2Two General Research Goals
- Correctness help programmers eliminate bugs
- Analysis to detect bugs statically (and
conservatively) - Tools such as debuggers to help detect bugs
dynamically - Performance help make programs run faster
- Static compiler optimizations
- May use analyses similar to above to ensure
compiler is correctly transforming code - In many areas, the open problem is determining
which transformations should be applied when - Link or load-time optimizations, including object
code translation - Feedback-directed optimization
- Runtime optimization
- For parallel machines, if you cant get good
performance, whats the point?
3A Little History
- Most research on compiling for parallel machines
is - automatic parallelization of serial code
- loop-level parallelization (usually Fortran)
- Most parallel programs are written using explicit
parallelism, either - Message passing with a single processor multiple
data (SPMD) model - ) usually MPI with either Fortran or mixed C
and Fortran for scientific applications - ) shared memory with a thread and synchronization
library in C or Java for non-scientific
applications - Option B is easier to program, but requires
hardware support that is still unproven for more
than 200 processors
4Titanium Overview
- Give programmers a global address space
- Useful for building large complex data structures
that are spread over the machine - But, dont pretend it will have uniform access
time (I.e., not quite shared memory) - Use an explicit parallelism model
- SPMD for simplicity
- Extend a standard language with data structures
for specific problem domain, grid-based
scientific applications - Small amount of syntax added for ease of
programming - General idea build domain-specific features into
the language and optimization framework
5Titanium Goals
- Performance
- close to C/FORTRAN MPI or better
- Portability
- develop on uniprocessor, then SMP, then
MPP/Cluster - Safety
- as safe as Java, extended to parallel framework
- Expressiveness
- close to usability of threads
- add minimal set of features
- Compatibility, interoperability, etc.
- no gratuitous departures from Java standard
6Titanium Goals
- Performance
- close to C/FORTRAN MPI or better
- Safety
- as safe as Java, extended to parallel framework
- Expressiveness
- close to usability of threads
- add minimal set of features
- Compatibility, interoperability, etc.
- no gratuitous departures from Java standard
7Titanium
- Take the best features of threads and MPI
- global address space like threads (ease
programming) - SPMD parallelism like MPI (for performance)
- local/global distinction, i.e., layout matters
(for performance) - Based on Java, a cleaner C
- classes, memory management
- Language is extensible through classes
- domain-specific language extensions
- current support for grid-based computations,
including AMR - Optimizing compiler
- communication and memory optimizations
- synchronization analysis
- cache and other uniprocessor optimizations
8New Language Features
- Scalable parallelism
- SPMD model of execution with global address space
- Multidimensional arrays
- points and index sets as first-class values to
simplify programs - iterators for performance
- Checked Synchronization
- single-valued variables and globally executed
methods - Global Communication Library
- Immutable classes
- user-definable non-reference types for
performance - Operator overloading
- by demand from our user community
- Semi-automated zone-based memory management
- as safe as a garbage-collected language
- better parallel performance and scalability
9Lecture Outline
- Language and compiler support for uniprocessor
performance - Immutable classes
- Multidimensional Arrays
- foreach
- Language support for parallel computation
- Analysis of parallel code
- Summary and future directions
10Java A Cleaner C
- Java is an object-oriented language
- classes (no standalone functions) with methods
- inheritance between classes multiple interface
inheritance only - Documentation on web at java.sun.com
- Syntax similar to C
- class Hello
- public static void main (String argv)
- System.out.println(Hello, world!)
-
-
- Safe
- Strongly typed checked at compile time, no
unsafe casts - Automatic memory management
- Titanium is (almost) strict superset
11Java Objects
- Primitive scalar types boolean, double, int,
etc. - implementations will store these on the program
stack - access is fast -- comparable to other languages
- Objects user-defined and from the standard
library - passed by pointer value (object sharing) into
functions - has level of indirection (pointer to) implicit
- simple model, but inefficient for small objects
2.6 3 true
r 7.1 i 4.3
12Java Object Example
- class Complex
- private double real
- private double imag
- public Complex(double r, double i)
- real r imag i
- public Complex add(Complex c)
- return new Complex(c.real real,
c.imag imag) - public double getReal return real
- public double getImag return imag
-
- Complex c new Complex(7.1, 4.3)
- c c.add(c)
- class VisComplex extends Complex ...
13Immutable Classes in Titanium
- For small objects, would sometimes prefer
- to avoid level of indirection
- pass by value (copying of entire object)
- especially when objects are immutable -- fields
are unchangeable - extends the idea of primitive values (1, 4.2,
etc.) to user-defined values - Titanium introduces immutable classes
- all fields are final (implicitly)
- cannot inherit from (extend) or be inherited by
other classes - needs to have 0-argument constructor, e.g.,
Complex () - immutable class Complex ...
- Complex c new Complex(7.1, 4.3)
14Arrays in Java
- Arrays in Java are objects
- Only 1D arrays are directly supported
- Array bounds are checked
- Multidimensional arrays as arrays-of-arrays are
slow
15Multidimensional Arrays in Titanium
- New kind of multidimensional array added
- Two arrays may overlap (unlike Java arrays)
- Indexed by Points (tuple of ints)
- Constructed over a set of Points, called Domains
- RectDomains are special case of domains
- Points, Domains and RectDomains are built-in
immutable classes - Support for adaptive meshes and other mesh/grid
operations
RectDomainlt2gt d 0n,0n Pointlt2gt p 1,
2 double 2d a new double d a0,0
a9,9
16Naïve MatMul with Titanium Arrays
- public static void matMul(double 2d a, double
2d b, - double 2d c)
- int n c.domain().max()1 // assumes square
- for (int i 0 i lt n i)
- for (int j 0 j lt n j)
- for (int k 0 k lt n k)
- ci,j ai,k bk,j
-
-
-
17Two Performance Issues
- In any language, uniprocessor performance is
often dominated by memory hierarchy costs - algorithms that are blocked for the memory
hierarchy (caches and registers) can be much
faster - In Titanium, the representation of arrays is
fast, but the access methods are expensive - need optimizations on Titanium arrays
- common subexpression elimination
- eliminate (or hoist) bounds checking
- strength reduce e.g., naïve code has 1 divide
per dimension for each array access - See Geoff Pikes work
- goal competitive with C/Fortran performance or
better
18Matrix Multiply (blocked, or tiled)
- Consider A,B,C to be N by N matrices of b by b
subblocks where bn/N is called the blocksize - for i 1 to N
- for j 1 to N
- read block C(i,j) into fast memory
- for k 1 to N
- read block A(i,k) into fast
memory - read block B(k,j) into fast
memory - C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks - write block C(i,j) back to slow memory
A(i,k)
C(i,j)
C(i,j)
B(k,j)
19Memory Hierarchy Optimizations MatMul
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
20Unordered iteration
- Often useful to reorder iterations for caches
- Compilers can do this for simple operations,
e.g., matrix multiply, but hard in general - Titanium adds unordered iteration on rectangular
domains - foreach (p within r) ..
- p is a Point new point, scoped only within the
foreach body - r is a previously-declared RectDomain
- Foreach simplifies bounds checking as well
- Additional operations on domains and arrays to
subset and transform
21Better MatMul with Titanium Arrays
- public static void matMul(double 2d a, double
2d b, - double 2d c)
- foreach (ij within c.domain())
- double 1d aRowi a.slice(1, ij1)
- double 1d bColj b.slice(2, ij2)
- foreach (k within aRowi.domain())
- cij aRowik bColjk
-
-
-
- Current compiler eliminates array overhead,
making it comparable to C performance for 3
nested loops - Automatic tiling still TBD
22Sequential Performance
Performance results from 98 new IR and
optimization framework almost complete.
23Lecture Outline
- Language and compiler support for uniprocessor
performance - Language support for parallel computation
- SPMD execution
- Global and local references
- Communication
- Barriers and single
- Synchronized methods and blocks (as in Java)
- Analysis of parallel code
- Summary and future directions
24SPMD Execution Model
- Java programs can be run as Titanium, but the
result will be that all processors do all the
work - E.g., parallel hello world
- class HelloWorld
- public static void main (String argv)
- System.out.println(Hello from proc
- Ti.thisProc())
-
-
- Any non-trivial program will have communication
and synchronization between processors
25SPMD Execution Model
- A common style is compute/communicate
- E.g., in each timestep within fish simulation
with gravitation attraction - read all fish and compute forces on mine
- Ti.barrier()
- write to my fish using new forces
- Ti.barrier()
-
26SPMD Model
- All processor start together and execute same
code, but not in lock-step - Sometimes they take different branches
- if (Ti.thisProc() 0) do setup
- for(all data I own) compute on data
- Common source of bugs is barriers or other global
operations inside branches or loops - barrier, broadcast, reduction, exchange
- A single method is one called by all procs
- public single static void allStep()
- A single variable has the same value on all
procs - int single timestep 0
27SPMD Execution Model
- Barriers and single in FishSimulation (n-body)
- class FishSim
- public static void main (String argv)
- int allTimestep 0
- int allEndTime 100
- for ( allTimestep lt allEndTime
allTimestep) - read all fish and compute forces on mine
- Ti.barrier()
- write to my fish using new forces
- Ti.barrier()
-
-
-
- Single methods inferred see David Gays work
single
single
single
28Global Address Space
- Processes allocate locally
- References can be passed to other processes
Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val.. C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val // full
gv.val // functionality
29Use of Global / Local
- Default is global
- easier to port shared-memory programs
- performance bugs common global pointers are more
expensive - harder to use sequential kernels
- Use local declarations in critical sections
- Compiler can infer many instances of local
- See Liblits work on LQI (Local Qualification
Inference)
30Local Pointer Analysis Liblit, Aiken
- Global references simplify programming, but incur
overhead, even when data is local - Split-C therefore requires global pointers be
declared explicitly - Titanium pointers global by default easier,
better portability - Automatic local qualification inference
31Parallel performance
- Speedup on Ultrasparc SMP
- AMR largely limited by
- current algorithm
- problem size
- 2 levels, with top one serial
- Not yet optimized with local for distributed
memory
32Lecture Outline
- Language and compiler support for uniprocessor
performance - Language support for parallel computation
- Analysis and Optimization of parallel code
- Tolerate network latency Split-C experience
- Hardware trends and reordering
- Semantics sequential consistency
- Cycle detection parallel dependence analysis
- Synchronization analysis parallel flow analysis
- Summary and future directions
33Split-C Experience Latency Overlap
- Titanium borrowed ideas from Split-C
- global address space
- SPMD parallelism
- But, Split-C had non-blocking accesses built in
to tolerate network latency on remote read/write - Also one-way communication
- Conclusion useful, but complicated
int global p x p / get / p
3 / put / sync / wait for my
puts/gets /
p - x / store / all_store_sync /
wait globally /
34Other sources of Overlap
- Would like compiler to introduce put/get/store.
- Hardware also reorders
- out-of-order execution
- write buffered with read by-pass
- non-FIFO write buffers
- weak memory models in general
- Software already reorders too
- register allocation
- any code motion
- System provides enforcement primitives
- e.g., memory fence, volatile, etc.
- tend to be heavy wait and with unpredictable
performance - Can the compiler hide all this?
35Semantics Sequential Consistency
- When compiling sequential programs
- Valid if y not in expr1 and x not in expr2
(roughly) - When compiling parallel code, not sufficient test.
y expr2 x expr1
Initially flag data 0 Proc A Proc
B data 1 while (flag1) flag 1
.. ..data..
36Cycle Detection Dependence Analog
- Processors define a program order on accesses
from the same thread - P is the union of these total orders
- Memory system define an access order on
accesses to the same variable - A is access order (read/write
write/write pairs) - A violation of sequential consistency is cycle in
P U A. - Intuition time cannot flow backwards.
37Cycle Detection
- Generalizes to arbitrary numbers of variables and
processors - Cycles may be arbitrarily long, but it is
sufficient to consider only cycles with 1 or 2
consecutive stops per processor Sasha Snir
write x write y read y
read y write
x
38Static Analysis for Cycle Detection
- Approximate P by the control flow graph
- Approximate A by undirected dependence edges
- Let the delay set D be all edges from P that
are part of a minimal cycle - The execution order of D edge must be preserved
other P edges may be reordered (modulo usual
rules about serial code) - Synchronization analsysis also critical
Krishnamurthy
write z read x
write y read x
read y write z
39Automatic Communication Optimization
- Implemented in subset of C with limited pointers
Krishnamurthy, Yelick - Experiments on the NOW 3 synchronization styles
- Future pointer analysis and optimizations for
AMR Jeh, Yelick
40Other Language Extensions
- Java extensions for expressiveness performance
- Operator overloading
- Zone-based memory management
- Foreign function interface
- The following is not yet implemented in the
compiler - Parameterized types (aka templates)
41Implementation
- Strategy
- compile Titanium into C
- Solaris or Posix threads for SMPs
- Active Messages (Split-C library) for
communication - Status
- runs on SUN Enterprise 8-way SMP
- runs on Berkeley NOW
- runs on the Tera (not fully tested)
- T3E port partially working
- SP2 port under way
42Titanium Status
- Titanium language definition complete.
- Titanium compiler running.
- Compiles for uniprocessors, NOW, Tera, t3e, SMPs,
SP2 (under way). - Application developments ongoing.
- Lots of research opportunities.
43Future Directions
- Super optimizers for targeted kernels
- e.g., Phipack, Sparsity, FFTW, and Atlas
- include feedback and some runtime information
- New application domains
- unstructured grids (aka graphs and sparse
matrices) - I/O-intensive applications such as information
retrieval - Optimizing I/O as well as communication
- uniform treatment of memory hierarchy
optimizations - Performance heterogeneity from the hardware
- related to dynamic load balancing in software
- Reasoning about parallel code
- correctness analysis race condition and
synchronization analysis - better analysis aliases and threads
- Java memory model and hiding the hardware model
44Backup Slides
45Point, RectDomain, Arrays in General
- Points specified by a tuple of ints
- RectDomains given by
- lower bound point
- upper bound point
- stride point
- Array given by RectDomain and element type
Pointlt2gt lb 1, 1 Pointlt2gt ub 10,
20 RectDomainlt2gt R lb ub 2, 2 double
2d A new doubler ... foreach (p in
A.domain()) Ap B2 p 1, 1
46AMR Poisson
- Poisson Solver Semenzato, Pike, Colella
- 3D AMR
- finite domain
- variable
coefficients - multigrid
across levels - Performance of Titanium implementation
- Sequential multigrid performance /- 20 of
Fortran - On fixed, well-balanced problem of 8 patches, 723
parallel speedups of 5.5 on 8 processors
47Distributed Data Structures
- Build distributed data structures
- broadcast or exchange
- RectDomain lt1gt single allProcs
0Ti.numProcs-1 - RectDomain lt1gt myFishDomain 0myFishCount-1
- Fish 1d single 1d allFish
- new Fish allProcs1d
- Fish 1d myFish new Fish myFishDomain
- allFish.exchage(myFish)
- Now each processor has an array of global
pointers, one to each processors chunk of fish
48Consistency Model
- Titanium adopts the Java memory consistency model
- Roughly Access to shared variables that are not
synchronized have undefined behavior. - Use synchronization to control access to shared
variables. - barriers
- synchronized methods and blocks
49Example Domain
r
- Domains in general are not rectangular
- Built using set operations
- union,
- intersection,
- difference, -
- Example is red-black algorithm
(6, 4)
(0, 0)
r 1, 1
(7, 5)
Pointlt2gt lb 0, 0 Pointlt2gt ub 6,
4 RectDomainlt2gt r lb ub 2,
2 Domainlt2gt red r (r 1, 1) foreach
(p in red) ...
(1, 1)
red
(7, 5)
(0, 0)
50Example using Domains and foreach
- Gauss-Seidel red-black computation in multigrid
void gsrb() boundary (phi) for (domainlt2gt
d res d ! null d
(d red ? black null)) foreach (q in
d) resq ((phin(q) phis(q)
phie(q) phiw(q))4
(phine(q) phinw(q) phise(q)
phisw(q)) - 20.0phiq -
krhsq) 0.05 foreach (q in d) phiq
resq
unordered iteration
51Applications
- Three-D AMR Poisson Solver (AMR3D)
- block-structured grids
- 2000 line program
- algorithm not yet fully implemented in other
languages - tests performance and effectiveness of language
features - Other 2D Poisson Solvers (under development)
- infinite domains
- based on method of local corrections
- Three-D Electromagnetic Waves (EM3D)
- unstructured grids
- Several smaller benchmarks