Compiling for Parallel Machines

About This Presentation

Title:

Compiling for Parallel Machines

Description:

RectDomain 2 d = [0:n,0:n]; Point 2 p = [1, 2]; double [2d] a = new double [d] ... Let the 'delay set' D be all edges from P that are part of a minimal cycle ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 51

Provided by: susanl2

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiling for Parallel Machines

1
Compiling for Parallel Machines
Kathy Yelick

CS264

2
Two General Research Goals

Correctness help programmers eliminate bugs
Analysis to detect bugs statically (and
conservatively)
Tools such as debuggers to help detect bugs
dynamically
Performance help make programs run faster
Static compiler optimizations
May use analyses similar to above to ensure
compiler is correctly transforming code
In many areas, the open problem is determining
which transformations should be applied when
Link or load-time optimizations, including object
code translation
Feedback-directed optimization
Runtime optimization
For parallel machines, if you cant get good
performance, whats the point?

3
A Little History

Most research on compiling for parallel machines
is
automatic parallelization of serial code
loop-level parallelization (usually Fortran)
Most parallel programs are written using explicit
parallelism, either
Message passing with a single processor multiple
data (SPMD) model
) usually MPI with either Fortran or mixed C
and Fortran for scientific applications
) shared memory with a thread and synchronization
library in C or Java for non-scientific
applications
Option B is easier to program, but requires
hardware support that is still unproven for more
than 200 processors

4
Titanium Overview

Give programmers a global address space
Useful for building large complex data structures
that are spread over the machine
But, dont pretend it will have uniform access
time (I.e., not quite shared memory)
Use an explicit parallelism model
SPMD for simplicity
Extend a standard language with data structures
for specific problem domain, grid-based
scientific applications
Small amount of syntax added for ease of
programming
General idea build domain-specific features into
the language and optimization framework

5
Titanium Goals

Performance
close to C/FORTRAN MPI or better
Portability
develop on uniprocessor, then SMP, then
MPP/Cluster
Safety
as safe as Java, extended to parallel framework
Expressiveness
close to usability of threads
add minimal set of features
Compatibility, interoperability, etc.
no gratuitous departures from Java standard

6
Titanium Goals

Performance
close to C/FORTRAN MPI or better
Safety
as safe as Java, extended to parallel framework
Expressiveness
close to usability of threads
add minimal set of features
Compatibility, interoperability, etc.
no gratuitous departures from Java standard

7
Titanium

Take the best features of threads and MPI
global address space like threads (ease
programming)
SPMD parallelism like MPI (for performance)
local/global distinction, i.e., layout matters
(for performance)
Based on Java, a cleaner C
classes, memory management
Language is extensible through classes
domain-specific language extensions
current support for grid-based computations,
including AMR
Optimizing compiler
communication and memory optimizations
synchronization analysis
cache and other uniprocessor optimizations

8
New Language Features

Scalable parallelism
SPMD model of execution with global address space
Multidimensional arrays
points and index sets as first-class values to
simplify programs
iterators for performance
Checked Synchronization
single-valued variables and globally executed
methods
Global Communication Library
Immutable classes
user-definable non-reference types for
performance
Operator overloading
by demand from our user community
Semi-automated zone-based memory management
as safe as a garbage-collected language
better parallel performance and scalability

9
Lecture Outline

Language and compiler support for uniprocessor
performance
Immutable classes
Multidimensional Arrays
foreach
Language support for parallel computation
Analysis of parallel code
Summary and future directions

10
Java A Cleaner C

Java is an object-oriented language
classes (no standalone functions) with methods
inheritance between classes multiple interface
inheritance only
Documentation on web at java.sun.com
Syntax similar to C
class Hello
public static void main (String argv)
System.out.println(Hello, world!)
Safe
Strongly typed checked at compile time, no
unsafe casts
Automatic memory management
Titanium is (almost) strict superset

11
Java Objects

Primitive scalar types boolean, double, int,
etc.
implementations will store these on the program
stack
access is fast -- comparable to other languages
Objects user-defined and from the standard
library
passed by pointer value (object sharing) into
functions
has level of indirection (pointer to) implicit
simple model, but inefficient for small objects

2.6 3 true
r 7.1 i 4.3
12
Java Object Example

class Complex
private double real
private double imag
public Complex(double r, double i)
real r imag i
public Complex add(Complex c)
return new Complex(c.real real,
c.imag imag)
public double getReal return real
public double getImag return imag
Complex c new Complex(7.1, 4.3)
c c.add(c)
class VisComplex extends Complex ...

13
Immutable Classes in Titanium

For small objects, would sometimes prefer
to avoid level of indirection
pass by value (copying of entire object)
especially when objects are immutable -- fields
are unchangeable
extends the idea of primitive values (1, 4.2,
etc.) to user-defined values
Titanium introduces immutable classes
all fields are final (implicitly)
cannot inherit from (extend) or be inherited by
other classes
needs to have 0-argument constructor, e.g.,
Complex ()
immutable class Complex ...
Complex c new Complex(7.1, 4.3)

14
Arrays in Java

Arrays in Java are objects
Only 1D arrays are directly supported
Array bounds are checked
Multidimensional arrays as arrays-of-arrays are
slow

15
Multidimensional Arrays in Titanium

New kind of multidimensional array added
Two arrays may overlap (unlike Java arrays)
Indexed by Points (tuple of ints)
Constructed over a set of Points, called Domains
RectDomains are special case of domains
Points, Domains and RectDomains are built-in
immutable classes
Support for adaptive meshes and other mesh/grid
operations

RectDomainlt2gt d 0n,0n Pointlt2gt p 1,
2 double 2d a new double d a0,0
a9,9
16
Naïve MatMul with Titanium Arrays

public static void matMul(double 2d a, double
2d b,
double 2d c)
int n c.domain().max()1 // assumes square
for (int i 0 i lt n i)
for (int j 0 j lt n j)
for (int k 0 k lt n k)
ci,j ai,k bk,j

17
Two Performance Issues

In any language, uniprocessor performance is
often dominated by memory hierarchy costs
algorithms that are blocked for the memory
hierarchy (caches and registers) can be much
faster
In Titanium, the representation of arrays is
fast, but the access methods are expensive
need optimizations on Titanium arrays
common subexpression elimination
eliminate (or hoist) bounds checking
strength reduce e.g., naïve code has 1 divide
per dimension for each array access
See Geoff Pikes work
goal competitive with C/Fortran performance or
better

18
Matrix Multiply (blocked, or tiled)

Consider A,B,C to be N by N matrices of b by b
subblocks where bn/N is called the blocksize
for i 1 to N
for j 1 to N
read block C(i,j) into fast memory
for k 1 to N
read block A(i,k) into fast
memory
read block B(k,j) into fast
memory
C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)

B(k,j)
19
Memory Hierarchy Optimizations MatMul
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
20
Unordered iteration

Often useful to reorder iterations for caches
Compilers can do this for simple operations,
e.g., matrix multiply, but hard in general
Titanium adds unordered iteration on rectangular
domains
foreach (p within r) ..
p is a Point new point, scoped only within the
foreach body
r is a previously-declared RectDomain
Foreach simplifies bounds checking as well
Additional operations on domains and arrays to
subset and transform

21
Better MatMul with Titanium Arrays

public static void matMul(double 2d a, double
2d b,
double 2d c)
foreach (ij within c.domain())
double 1d aRowi a.slice(1, ij1)
double 1d bColj b.slice(2, ij2)
foreach (k within aRowi.domain())
cij aRowik bColjk
Current compiler eliminates array overhead,
making it comparable to C performance for 3
nested loops
Automatic tiling still TBD

22
Sequential Performance
Performance results from 98 new IR and
optimization framework almost complete.
23
Lecture Outline

Language and compiler support for uniprocessor
performance
Language support for parallel computation
SPMD execution
Global and local references
Communication
Barriers and single
Synchronized methods and blocks (as in Java)
Analysis of parallel code
Summary and future directions

24
SPMD Execution Model

Java programs can be run as Titanium, but the
result will be that all processors do all the
work
E.g., parallel hello world
class HelloWorld
public static void main (String argv)
System.out.println(Hello from proc
Ti.thisProc())
Any non-trivial program will have communication
and synchronization between processors

25
SPMD Execution Model

A common style is compute/communicate
E.g., in each timestep within fish simulation
with gravitation attraction
read all fish and compute forces on mine
Ti.barrier()
write to my fish using new forces
Ti.barrier()

26
SPMD Model

All processor start together and execute same
code, but not in lock-step
Sometimes they take different branches
if (Ti.thisProc() 0) do setup
for(all data I own) compute on data
Common source of bugs is barriers or other global
operations inside branches or loops
barrier, broadcast, reduction, exchange
A single method is one called by all procs
public single static void allStep()
A single variable has the same value on all
procs
int single timestep 0

27
SPMD Execution Model

Barriers and single in FishSimulation (n-body)
class FishSim
public static void main (String argv)
int allTimestep 0
int allEndTime 100
for ( allTimestep lt allEndTime
allTimestep)
read all fish and compute forces on mine
Ti.barrier()
write to my fish using new forces
Ti.barrier()
Single methods inferred see David Gays work

single
single
single
28
Global Address Space

Processes allocate locally
References can be passed to other processes

Other processes
Process 0
LOCAL HEAP
LOCAL HEAP
Class C int val.. C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val // full
gv.val // functionality
29
Use of Global / Local

Default is global
easier to port shared-memory programs
performance bugs common global pointers are more
expensive
harder to use sequential kernels
Use local declarations in critical sections
Compiler can infer many instances of local
See Liblits work on LQI (Local Qualification
Inference)

30
Local Pointer Analysis Liblit, Aiken

Global references simplify programming, but incur
overhead, even when data is local
Split-C therefore requires global pointers be
declared explicitly
Titanium pointers global by default easier,
better portability
Automatic local qualification inference

31
Parallel performance

Speedup on Ultrasparc SMP
AMR largely limited by
current algorithm
problem size
2 levels, with top one serial
Not yet optimized with local for distributed
memory

32
Lecture Outline

Language and compiler support for uniprocessor
performance
Language support for parallel computation
Analysis and Optimization of parallel code
Tolerate network latency Split-C experience
Hardware trends and reordering
Semantics sequential consistency
Cycle detection parallel dependence analysis
Synchronization analysis parallel flow analysis
Summary and future directions

33
Split-C Experience Latency Overlap

Titanium borrowed ideas from Split-C
global address space
SPMD parallelism
But, Split-C had non-blocking accesses built in
to tolerate network latency on remote read/write
Also one-way communication
Conclusion useful, but complicated

int global p x p / get / p
3 / put / sync / wait for my
puts/gets /
p - x / store / all_store_sync /
wait globally /
34
Other sources of Overlap

Would like compiler to introduce put/get/store.
Hardware also reorders
out-of-order execution
write buffered with read by-pass
non-FIFO write buffers
weak memory models in general
Software already reorders too
register allocation
any code motion
System provides enforcement primitives
e.g., memory fence, volatile, etc.
tend to be heavy wait and with unpredictable
performance
Can the compiler hide all this?

35
Semantics Sequential Consistency

When compiling sequential programs
Valid if y not in expr1 and x not in expr2
(roughly)
When compiling parallel code, not sufficient test.

y expr2 x expr1
Initially flag data 0 Proc A Proc
B data 1 while (flag1) flag 1
.. ..data..
36
Cycle Detection Dependence Analog

Processors define a program order on accesses
from the same thread
P is the union of these total orders
Memory system define an access order on
accesses to the same variable
A is access order (read/write
write/write pairs)
A violation of sequential consistency is cycle in
P U A.
Intuition time cannot flow backwards.

37
Cycle Detection

Generalizes to arbitrary numbers of variables and
processors
Cycles may be arbitrarily long, but it is
sufficient to consider only cycles with 1 or 2
consecutive stops per processor Sasha Snir

write x write y read y
read y write
x
38
Static Analysis for Cycle Detection

Approximate P by the control flow graph
Approximate A by undirected dependence edges
Let the delay set D be all edges from P that
are part of a minimal cycle
The execution order of D edge must be preserved
other P edges may be reordered (modulo usual
rules about serial code)
Synchronization analsysis also critical
Krishnamurthy

write z read x
write y read x
read y write z
39
Automatic Communication Optimization

Implemented in subset of C with limited pointers
Krishnamurthy, Yelick
Experiments on the NOW 3 synchronization styles
Future pointer analysis and optimizations for
AMR Jeh, Yelick

40
Other Language Extensions

Java extensions for expressiveness performance
Operator overloading
Zone-based memory management
Foreign function interface
The following is not yet implemented in the
compiler
Parameterized types (aka templates)

41
Implementation

Strategy
compile Titanium into C
Solaris or Posix threads for SMPs
Active Messages (Split-C library) for
communication
Status
runs on SUN Enterprise 8-way SMP
runs on Berkeley NOW
runs on the Tera (not fully tested)
T3E port partially working
SP2 port under way

42
Titanium Status

Titanium language definition complete.
Titanium compiler running.
Compiles for uniprocessors, NOW, Tera, t3e, SMPs,
SP2 (under way).
Application developments ongoing.
Lots of research opportunities.

43
Future Directions

Super optimizers for targeted kernels
e.g., Phipack, Sparsity, FFTW, and Atlas
include feedback and some runtime information
New application domains
unstructured grids (aka graphs and sparse
matrices)
I/O-intensive applications such as information
retrieval
Optimizing I/O as well as communication
uniform treatment of memory hierarchy
optimizations
Performance heterogeneity from the hardware
related to dynamic load balancing in software
Reasoning about parallel code
correctness analysis race condition and
synchronization analysis
better analysis aliases and threads
Java memory model and hiding the hardware model

44
Backup Slides
45
Point, RectDomain, Arrays in General

Points specified by a tuple of ints
RectDomains given by
lower bound point
upper bound point
stride point
Array given by RectDomain and element type

Pointlt2gt lb 1, 1 Pointlt2gt ub 10,
20 RectDomainlt2gt R lb ub 2, 2 double
2d A new doubler ... foreach (p in
A.domain()) Ap B2 p 1, 1
46
AMR Poisson

Poisson Solver Semenzato, Pike, Colella
3D AMR
finite domain
variable

coefficients
multigrid

across levels
Performance of Titanium implementation
Sequential multigrid performance /- 20 of
Fortran
On fixed, well-balanced problem of 8 patches, 723
parallel speedups of 5.5 on 8 processors

47
Distributed Data Structures

Build distributed data structures
broadcast or exchange
RectDomain lt1gt single allProcs
0Ti.numProcs-1
RectDomain lt1gt myFishDomain 0myFishCount-1
Fish 1d single 1d allFish
new Fish allProcs1d
Fish 1d myFish new Fish myFishDomain
allFish.exchage(myFish)
Now each processor has an array of global
pointers, one to each processors chunk of fish

48
Consistency Model

Titanium adopts the Java memory consistency model
Roughly Access to shared variables that are not
synchronized have undefined behavior.
Use synchronization to control access to shared
variables.
barriers
synchronized methods and blocks

49
Example Domain
r

Domains in general are not rectangular
Built using set operations
union,
intersection,
difference, -
Example is red-black algorithm

(6, 4)
(0, 0)
r 1, 1
(7, 5)
Pointlt2gt lb 0, 0 Pointlt2gt ub 6,
4 RectDomainlt2gt r lb ub 2,
2 Domainlt2gt red r (r 1, 1) foreach
(p in red) ...
(1, 1)
red
(7, 5)
(0, 0)
50
Example using Domains and foreach

Gauss-Seidel red-black computation in multigrid

void gsrb() boundary (phi) for (domainlt2gt
d res d ! null d
(d red ? black null)) foreach (q in
d) resq ((phin(q) phis(q)
phie(q) phiw(q))4
(phine(q) phinw(q) phise(q)
phisw(q)) - 20.0phiq -
krhsq) 0.05 foreach (q in d) phiq
resq
unordered iteration
51
Applications