Clusters and their Applications

About This Presentation

Title:

Clusters and their Applications

Description:

Some s by Jim Demmel, David Culler, Horst Simon, and Erich Strohmaier – PowerPoint PPT presentation

Number of Views:222

Avg rating:3.0/5.0

Slides: 118

Provided by: KathyY150

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clusters and their Applications

1
Clusters and their Applications

Kathy Yelick
yelick_at_cs.berkeley.edu
http//www.cs.berkeley.edu/yelick/
http//upc.lbl.gov
http//titanium.cs.berkeley.edu

2
Outline

Overview of parallel programming models
Trends in large-scale parallel machines
The UPC language
The Titanium language
An application study heart simulation in Titanium

3
Outline

Overview of parallel programming models
Shared memory threads
Message passing
Partitioned global address space (PGAS)
Data parallel
Hybrids
Trends in large-scale parallel machines
The UPC language
The Titanium language
An application study heart simulation in Titanium

4
A generic parallel architecture
Proc
Proc
Proc
Proc
Proc
Proc
Interconnection Network
Memory
Memory
Memory
Memory
Memory

Where is the memory physically located?
Is it connect directly to processors?
What is the connectivity of the network?

5
Parallel Programming Models

Programming model is made up of the languages and
libraries that create an abstract view of the
machine
Control
How is parallelism created?
What orderings exist between operations?
Data
What data is private vs. shared?
How is logically shared data accessed or
communicated?
Synchronization
What operations can be used to coordinate
parallelism
What are the atomic (indivisible) operations?
Cost
How do we account for the cost of each of the
above?

6
Simple Example

Consider applying a function f to the elements of
an array A and then computing its sum
Questions
Where does A live? All in single memory?
Partitioned?
What work will be done by each processors?
They need to coordinate to get a single result,
how?

A array of all data fA f(A) s sum(fA)
s
7
Programming Model 1 Shared Memory

Program is a collection of threads of control.
Can be created dynamically, mid-execution, in
some languages
Each thread has a set of private variables, e.g.,
local stack variables
Also a set of shared variables, e.g., static
variables, shared common blocks, or global heap.
Threads communicate implicitly by writing and
reading shared variables.
Threads coordinate by synchronizing on shared
variables

Shared memory
s
s ...
y ..s ...
Private memory
Pn
P1
P0
8
Shared Memory Code for Computing a Sum
static int s 0
Thread 0 local_s1 0 for (i 0 iltn/2
i) local_s1 local_s1 f(Ai)
s s local_s1
Thread 1 local_s2 0 for (i n/2 iltn
i) local_s2 local_s2 f(Ai)
s s local_s2

Since addition is associative, its OK to
rearrange order
Most computation is on private variables
Sharing frequency is also reduced, which might
improve speed
But there is still a race condition on the update
of shared s
The race condition can be fixed by adding locks
(only one thread can hold a lock at a time
others wait for it)

9
Programming Model 2 Message Passing

Program consists of a collection of named
processes.
Usually fixed at program startup time
Thread of control plus local address space -- NO
shared data.
Logically shared data is partitioned over local
processes.
Processes communicate by explicit send/receive
pairs
Coordination is implicit in every communication
event.
MPI (Message Passing Interface) is the most
commonly used SW

Private memory
y ..s ...
Pn
P1
P0
Network
10
Message Passing Computing s A1A2

First possible solution what could go wrong?

Processor 1 xlocal A1 send xlocal,
proc2 receive xremote, proc2 s xlocal
xremote
Processor 2 xlocal A2 send xlocal,
proc1 receive xremote, proc1 s xlocal
xremote

If send/receive acts like the telephone system?
The post office?

What if there are more than 2 processors?

11
MPI The de facto standard

MPI has become the de facto standard for parallel
computing using message passing
Pros and Cons of standards
MPI created finally a standard for applications
development in the HPC community ? portability
The MPI standard is a least common denominator
building on mid-80s technology, so may discourage
innovation
Programming Model reflects hardware!

I am not sure how I will program a Petaflops
computer, but I am sure that I will need MPI
somewhere HDS 2001
12
Programming Model 3 Global Address Space

Partitioned Global Address Space (PGAS)
programming
Program consists of a collection of named
threads.
Usually fixed at program startup time
Private and shared data, as in shared memory
model
Mostly access local date (private or shared)
Examples UPC, Titanium, Co-Array Fortran

Shared memory
sn 18
s0 27
s1 34
sum ..si ...
Private memory
smyThread ...
Pn
P1
P0
13
PGAS (UPC) Code for Computing a Sum
shared A n shared int s THREADS
Thread 0,,THREADS-1 sum sMY_THREAD 0
for (i MY_THREAD, iltn iTHREADS)
sMY_THREAD f(Ai) barrier if
(MY_THREAD 0) for (i 0 i lt THREADS
i) sum si

Array s is distributed with 1 element per thread
starting at 0
Most accesses are to local variables, i.e.,
private variables or the local elements of shared
s
Thread 0 accesses remote values of s to compute
global sum
Final sum could be done more efficiently as a
tree calculation

14
Programming Model 4 Data Parallel

Single thread of control consisting of parallel
operations.
Parallel operations applied to all (or a defined
subset) of a data structure, usually an array
Communication is implicit in parallel operators
Elegant and easy to understand and reason about
Coordination is implicit statements executed
synchronously
Similar to Matlab language for array operations
Drawbacks
Not all problems fit this model (irregular
parallelism)
Difficult to map onto coarse-grained machines

A array of all data fA f(A) s sum(fA)
s
15
Programming Model 5 Hybrids

Hybrid hardware, clusters of SMPs are common
These programming models can be mixed
Message passing (MPI) at the top level with
shared memory (OpenMP) within a node is used
MPI everywhere is more common today
Can we have a single programming model?
New DARPA HPCS languages mix data parallelism and
threads in a global address space
Partitioned Global Address Space (PGAS) models
can call message passing libraries or vice verse
PGAS models can be used in a hybrid mode
Shared memory when it exists in hardware
Communication (done by the runtime system)
otherwise

16
Outline

Overview of parallel programming models
Trends in large-scale parallel machines
Top500 list
Observations and predictions
The UPC language
The Titanium language
An application study heart simulation in Titanium

17
TOP500
- Listing of the 500 most powerful
Computers in the World - Yardstick Rmax from
Linpack Axb, dense problem - Updated twice
a year ISCxy in Germany, June xy SCxy in
USA, November xy - All data available from
www.top500.org
TPP performance
Rate
Size
18
TOP500 list - Data shown

Manufacturer Manufacturer or vendor
Computer Type indicated by manufacturer or
vendor
Installation Site Customer
Location Location and country
Year Year of installation/last major update
Customer Segment Academic,Research,Industry,Vendor
,Class.
Processors Number of processors
Rmax Maxmimal LINPACK performance
achieved
Rpeak Theoretical peak performance
Nmax Problemsize for achieving Rmax
N1/2 Problemsize for achieving half of Rmax
Nworld Position within the TOP500 ranking

19
(No Transcript)
20
(No Transcript)
21
Petaflop with 1M Cores By 2008
Common by 2015?
1Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100
Tflop/s 10 Tflops/s 1 Tflop/s 100 Gflop/s 10
Gflop/s 1 Gflop/s 10 MFlop/s
1 PFlop system in 2008
Data from top500.org
Slide source Horst Simon, LBNL
22
Petaflop with 1M Cores in your PC by 2025?
23
Outline

Overview of parallel programming models
Trends in large-scale parallel machines
The UPC language
PGAS language motivation and availability
Execution model
Shared vs. private data
Synchronization
Collectives
Distributed Arrays
Performance
The Titanium language
An application study heart simulation in Titanium

24
Current Implementations of PGAS Languages

A successful language/library must run everywhere
UPC
Commercial compilers available on Cray, SGI, HP
machines
Open source compiler from LBNL/UCB
(source-to-source)
Open source gcc-based compiler from Intrepid
CAF
Commercial compiler available on Cray machines
Open source compiler available from Rice
Titanium
Open source compiler from UCB runs on most
machines
DARPA HPCS Languages
Cray Chapel, IBM X10, Sun Fortress
Use PGAS memory abstraction, but have dynamic
threading
Recent additions to parallel language landscape ?
no mature compilers for clusters yet

25
Unified Parallel C (UPC)

Overview and Design Philosophy
Unified Parallel C (UPC) is
An explicit parallel extension of ANSI C
A partitioned global address space language
Sometimes called a GAS language
Similar to the C language philosophy
Programmers are clever and careful, and may need
to get close to hardware
to get performance, but
can get in trouble
Concise and efficient syntax
Common and familiar syntax and semantics for
parallel C with simple extensions to ANSI C
Based on ideas in Split-C, AC, and PCP

26
UPC Execution Model
27
UPC Execution Model

Threads working independently in a SPMD fashion
Number of threads specified at compile-time or
run-time available as program variable THREADS
MYTHREAD specifies thread index (0..THREADS-1)
upc_barrier is a global synchronization all wait
There is a form of parallel loop that we will see
later
There are two compilation modes
Static Threads mode
THREADS is specified at compile time by the user
The program may use THREADS as a compile-time
constant
Dynamic threads mode
Compiled code may be run with varying numbers of
threads

28
Hello World in UPC

Any legal C program is also a legal UPC program
If you compile and run it as UPC with P threads,
it will run P copies of the program.
Using this fact, plus the identifiers from the
previous slides, we can parallel hello world
include ltupc.hgt / needed for UPC extensions /
include ltstdio.hgt
main()
printf("Thread d of d hello UPC world\n",
MYTHREAD, THREADS)

29
Example Monte Carlo Pi Calculation

Estimate Pi by throwing darts at a unit square
Calculate percentage that fall in the unit circle
Area of square r2 1
Area of circle quadrant ¼ p r2 p/4
Randomly throw darts at x,y positions
If x2 y2 lt 1, then point is inside circle
Compute ratio
points inside / points total
p 4ratio

30
Pi in UPC

Independent estimates of pi
main(int argc, char argv)
int i, hits, trials 0
double pi
if (argc ! 2)trials 1000000
else trials atoi(argv1)
srand(MYTHREAD17)
for (i0 i lt trials i) hits hit()
pi 4.0hits/trials
printf("PI estimated to f.", pi)

31
Helper Code for Pi in UPC

Required includes
include ltstdio.hgt
include ltmath.hgt
include ltupc.hgt
Function to throw dart and calculate where it
hits
int hit()
int const rand_max 0xFFFFFF
double x ((double) rand()) / RAND_MAX
double y ((double) rand()) / RAND_MAX
if ((xx yy) lt 1.0)
return(1)
else
return(0)

32
Shared vs. Private Variables
33
Private vs. Shared Variables in UPC

Normal C variables and objects are allocated in
the private memory space for each thread.
Shared variables are allocated only once, with
thread 0
shared int ours // use sparingly
performance
int mine
Shared variables may not have dynamic lifetime
may not occur in a in a function definition,
except as static. Why?

Thread0 Thread1
Threadn
Shared
ours
Global address space
mine
mine
mine
Private
34
Pi in UPC Shared Memory Style

Parallel computing of pi, but with a bug
shared int hits
main(int argc, char argv)
int i, my_trials 0
int trials atoi(argv1)
my_trials (trials THREADS - 1)/THREADS
srand(MYTHREAD17)
for (i0 i lt my_trials i)
hits hit()
upc_barrier
if (MYTHREAD 0)
printf("PI estimated to f.",
4.0hits/trials)

shared variable to record hits
divide work up evenly
accumulate hits
What is the problem with this program?
35
Shared Arrays Are Cyclic By Default

Shared scalars always live in thread 0
Shared arrays are spread over the threads
Shared array elements are spread across the
threads
shared int xTHREADS / 1 element per
thread /
shared int y3THREADS / 3 elements per thread
/
shared int z33 / 2 or 3
elements per thread /
In the pictures below, assume THREADS 4
Red elts have affinity to thread 0

Think of linearized C array, then map in
round-robin
x
As a 2D array, y is logically blocked by columns
y
z
z is not
36
Pi in UPC Shared Array Version

Alternative fix to the race condition
Have each thread update a separate counter
But do it in a shared array
Have one thread compute sum
shared int all_hits THREADS
main(int argc, char argv)
declarations an initialization code omitted
for (i0 i lt my_trials i)
all_hitsMYTHREAD hit()
upc_barrier
if (MYTHREAD 0)
for (i0 i lt THREADS i) hits
all_hitsi
printf("PI estimated to f.",
4.0hits/trials)

all_hits is shared by all processors, just as
hits was
update element with local affinity
37
UPC Synchronization
38
UPC Global Synchronization

UPC has two basic forms of barriers
Barrier block until all other threads arrive
upc_barrier
Split-phase barriers
upc_notify this thread is ready for barrier
do computation unrelated to barrier
upc_wait wait for others to be ready
Optional labels allow for debugging
define MERGE_BARRIER 12
if (MYTHREAD2 0)
...
upc_barrier MERGE_BARRIER
else
...
upc_barrier MERGE_BARRIER

39
Synchronization - Locks

UPC Locks are an opaque type
upc_lock_t
Locks must be allocated before use
upc_lock_t upc_all_lock_alloc(void)
allocates 1 lock, pointer to all threads
upc_lock_t upc_global_lock_alloc(void)
allocates 1 lock, pointer to one thread
To use a lock
void upc_lock(upc_lock_t l)
void upc_unlock(upc_lock_t l)
use at start and end of critical region
Locks can be freed when not in use
void upc_lock_free(upc_lock_t ptr)

40
Pi in UPC Shared Memory Style

Parallel computing of pi, without the bug
shared int hits
main(int argc, char argv)
int i, my_hits, my_trials 0
upc_lock_t hit_lock upc_all_lock_alloc()
int trials atoi(argv1)
my_trials (trials THREADS - 1)/THREADS
srand(MYTHREAD17)
for (i0 i lt my_trials i)
my_hits hit()
upc_lock(hit_lock)
hits my_hits
upc_unlock(hit_lock)
upc_barrier
if (MYTHREAD 0)
printf("PI f", 4.0hits/trials)

create a lock
accumulate hits locally
accumulate across threads
41
Recap Private vs. Shared Variables in UPC

We saw several kinds of variables in the pi
example
Private scalars (my_hits)
Shared scalars (hits)
Shared arrays (all_hits)
Shared locks (hit_lock)

Thread0 Thread1
Threadn
where nThreads-1
hits
hit_lock
Shared
all_hits0
all_hitsn
all_hits1
Global address space
my_hits
my_hits
my_hits
Private
42
UPC Collectives
43
UPC Collectives in General

UPC collectives interface is in the language
spec
http//upc.lbl.gov/docs/user/upc_spec_1.2.pdf
It contains typical functions
Data movement broadcast, scatter, gather,
Computational reduce, prefix,
General interface has synchronization modes
Avoid over-synchronizing (barrier before/after)
Data being collected may be read/written by any
thread simultaneously
Simple interface for scalar values (int,
double,)
Berkeley UPC value-based collectives
Works with any compiler
http//upc.lbl.gov/docs/user/README-collectivev.tx
t

44
Pi in UPC Data Parallel Style

The previous version of Pi works, but is not
scalable
On a large of threads, the locked region will
be a bottleneck
Use a reduction for better scalability
include ltbupc_collectivev.hgt
// shared int hits
main(int argc, char argv)
...
for (i0 i lt my_trials i)
my_hits hit()
my_hits // type, input, thread,
op
bupc_allv_reduce(int, my_hits, 0,
UPC_ADD)
// upc_barrier
if (MYTHREAD 0)
printf("PI f", 4.0my_hits/trials)

Berkeley collectives
no shared variables
barrier implied by collective
45
Work Distribution Using upc_forall
46
Example Vector Addition

Questions about parallel vector additions
How to layout data (here it is cyclic)
Which processor does what (here it is owner
computes)

/ vadd.c /
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i for(i0 iltN i)
if (MYTHREAD iTHREADS) sumiv1iv2
i

cyclic layout
owner computes
47
Work Sharing with upc_forall()

The idiom in the previous slide is very common
Loop over all work on those owned by this proc
UPC adds a special type of loop
upc_forall(init test loop affinity)
statement
Programmer indicates the iterations are
independent
Undefined if there are dependencies across
threads
Affinity expression indicates which iterations to
run on each thread. It may have one of two
types
Integer affinityTHREADS is MYTHREAD
Pointer upc_threadof(affinity) is MYTHREAD
Syntactic sugar for loop on previous slide
Some compilers may do better than this, e.g.,
for(iMYTHREAD iltN iTHREADS)
Rather than having all threads iterate N times
for(i0 iltN i) if (MYTHREAD
iTHREADS)

48
Vector Addition with upc_forall

The vadd example can be rewritten as follows
Equivalent code could use sumi for affinity
The code would be correct but slow if the
affinity expression were i1 rather than i.

define N 100THREADSshared int v1N, v2N,
sumNvoid main() int i upc_forall(i0
iltN i i)
sumiv1iv2i

The cyclic data distribution may perform poorly
on some machines
49
Distributed Arrays in UPC
50
Blocked Layouts in UPC

If this code were doing nearest neighbor
averaging (3pt stencil) the cyclic layout would
be the worst possible layout.
Instead, want a blocked layout
Vector addition example can be rewritten as
follows using a blocked layout

define N 100THREADSshared int v1N,
v2N, sumNvoid main() int
i upc_forall(i0 iltN i sumi)
sumiv1iv2i

blocked layout
51
Layouts in General

All non-array objects have affinity with thread
zero.
Array layouts are controlled by layout
specifiers
Empty (cyclic layout)
(blocked layout)
0 or (indefinite layout, all on 1 thread)
b or b1b2bn b1b2bn (fixed block
size)
The affinity of an array element is defined in
terms of
block size, a compile-time constant
and THREADS.
Element i has affinity with thread
(i / block_size) THREADS
In 2D and higher, linearize the elements as in a
C representation, and then use above mapping

52
Pointers to Shared vs. Arrays

In the C tradition, arrays can be access through
pointers
Here is the vector addition example using pointers

define N 100THREADS
shared int v1N, v2N, sumN
void main() int ishared int p1, p2p1v1
p2v2for (i0 iltN i, p1, p2 )
if (i THREADS MYTHREAD) sumi p1
p2

v1
p1
53
UPC Pointers
Thread0 Thread1
Threadn
p3
p3
p3
Shared
p4
p4
p4
Global address space
p1
p1
p1
Private
p2
p2
p2
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
54
Dynamic Memory Allocation in UPC

Dynamic memory allocation of shared memory is
available in UPC
Non-collective (called independently)
shared void upc_global_alloc(size_t nblocks,
size_t nbytes)
nblocks number of blocks
nbytes block size
Collective (called together all threads get same
pointer)
shared void upc_all_alloc(size_t nblocks,
size_t nbytes)
Freeing dynamically allocated memory in shared
space
void upc_free(shared void ptr)

55
Performance of UPC
56
PGAS Languages have Performance Advantages

Strategy for acceptance of a new language
Make it run faster than anything else
Keys to high performance
Parallelism
Scaling the number of processors
Maximize single node performance
Generate friendly code or use tuned libraries
(BLAS, FFTW, etc.)
Avoid (unnecessary) communication cost
Latency, bandwidth, overhead
Berkeley UPC and Titanium use GASNet
communication layer
Avoid unnecessary delays due to dependencies
Load balance Pipeline algorithmic dependencies

57
One-Sided vs Two-Sided
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload

A one-sided put/get message can be handled
directly by a network interface with RDMA support
Avoid interrupting the CPU or storing data from
CPU (preposts)
A two-sided messages needs to be matched with a
receive to identify memory address to put data
Offloaded to Network Interface in networks like
Quadrics
Need to download match tables to interface (from
host)
Ordering requirements on messages can also hinder
bandwidth

58
One-Sided vs. Two-Sided Practice
NERSC Jacquard machine with Opteron processors

InfiniBand GASNet vapi-conduit and OSU MVAPICH
0.9.5
Half power point (N ½ ) differs by one order of
magnitude
This is not a criticism of the implementation!

Joint work with Paul Hargrove and Dan Bonachea
59
GASNet Portability and High-Performance
GASNet better for latency across machines
Joint work with UPC Group GASNet design by Dan
Bonachea
60
GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Joint work with UPC Group GASNet design by Dan
Bonachea
61
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Joint work with UPC Group GASNet design by Dan
Bonachea
62
Case Study NAS FT in UPC

Perform FFT on a 3D Grid
1D FFTs in each dimension, 3 phases
Transpose after first 2 for locality
Bisection bandwidth-limited
Problem as procs grows

Three approaches
Exchange
wait for 2nd dim FFTs to finish, send 1 message
per processor pair
Slab
wait for chunk of rows destined for 1 proc, send
when ready
Pencil
send each row as it completes

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
63
NAS FT Variants Performance Summary
.5 Tflops

Slab is always best for MPI small message cost
too high
Pencil is always best for UPC more overlap

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
64
Outline

Overview of parallel programming models
Trends in large-scale parallel machines
The UPC language
The Titanium language
Titanium Execution and Memory Model
Semi-automatic memory management
Support for Serial Programming
Performance and Applications
Compiler/Language Status
An application study heart simulation in Titanium

65
Titanium

UPC has advantages over message passing, but it
is still a relatively low-level language
Titanium uses the PGAS concept in a high level
language
Based on Java, a cleaner C
Classes, automatic memory management, etc.
Compiled to C and then machine code, no JVM
Same parallelism model at UPC and Co-Array
Fortran
SPMD parallelism
Dynamic Java threads are not supported
Optimizing compiler
Analyzes global synchronization
Optimizes pointers, communication, memory

66
Summary of Features Added to Java

Multidimensional arrays iterators, subarrays,
copying
Immutable (value) classes for Complex, etc.
Templates
Operator overloading
Scalable SPMD parallelism replaces threads
Global address space with local/global reference
distinction
Checked global synchronization
Zone-based memory management (regions)
Libraries for collective communication,
distributed arrays, bulk I/O, performance
profiling

67
SPMD Execution Model

Titanium has the same execution model as UPC and
CAF
Basic Java programs may be run as Titanium
programs, but all processors do all the work.
E.g., parallel hello world
class HelloWorld
public static void main (String
argv)
System.out.println(Hello from proc
Ti.thisProc()
out of
Ti.numProcs())
Global synchronization done using Ti.barrier()

68
Avoiding Errors Checked Barriers and Single

To put a barrier (or equivalent) inside a method,
you need to make the method single.
A single method is one called by all procs
public single static void allStep(...)
These single annotations on methods are optional,
and inferred by the compiler if you omit them
To put a barrier (or single method) in a branch
or loop, you need to use a single variable for
branch
A single variable has same value on all procs
int single timestep 0
Compiler proves that all processors call barriers
together Gay Aiken

69
Global Address Space

Globally shared address space is partitioned
References (pointers) are either local or global
(meaning possibly remote)
Reference global by default (unlike UPC, but like
HPCS languages)

Object heaps are shared by default
x 1 y 2
x 5 y 6
x 7 y 8
Global address space
l
l
l
g
g
g
Program stacks are private
p0
p1
pn
70
Global Address Space

Processes allocate locally
References can be passed to other processes

class C public int val...
if (Ti.thisProc() 0) lv new C()
gv broadcast lv from 0
1
//data race gv.val Ti.thisProc()
71
Distributed Data Structures

Building distributed arrays

Now each processor has array of pointers, one to
each processors chunk of particles

P0
P1
P2
72
Arrays in Java

Arrays in Java are objects
Only 1D arrays are directly supported
Multidimensional arrays are arrays of arrays
General, but slow

2d array

Subarrays are important in AMR (e.g., interior of
a grid)
Even C and C dont support these well
Hand-coding (array libraries) can confuse
optimizer
Can build multidimensional arrays, but we want
Compiler optimizations and nice syntax

73
Multidimensional Arrays in Titanium

New multidimensional array added
Supports subarrays without copies
can refer to rows, columns, slabs
interior, boundary, even elements
Indexed by Points (tuples of ints)
Built on a rectangular set of Points, RectDomain
Points, Domains and RectDomains are built-in
immutable classes, with useful literal syntax
Support for AMR and other grid computations
domain operations intersection, shrink, border
bounds-checking can be disabled after debugging

74
Titanium Points, RectDomains, Arrays

Points specified by a tuple of ints
RectDomains given by 3 points
lower bound, upper bound (and optional stride)
Array declared by num dimensions and type
Array created by passing RectDomain

75
More Array Operations

Titanium arrays have a rich set of operations
None of these modify the original array, they
just create another view of the data in that
array
You create arrays with a RectDomain and get it
back later using A.domain() for array A
A Domain is a set of points in space
A RectDomain is a rectangular one
Operations on Domains include , -, (union,
different intersection)

translate
restrict
slice (n dim to n-1)
76
Are these features expressive?
GOOD

Compared line counts of timed, uncommented
portion of each program
Multigrid (MG) and FFT (FT) disparities mostly
due to Ti domain calculus and array copy
Conjugate Gradient (CG) line counts are similar
since Fortran version is already compact

77
Java Compiled by Titanium Compiler

Sun JDK 1.4.1_01 (HotSpot(TM) Client VM) for
Linux
IBM J2SE 1.4.0 (Classic VM cxia32140-20020917a,
jitc JIT) for 32-bit Linux
Titaniumc v2.87 for Linux, gcc 3.2 as backend
compiler -O3. no bounds check
gcc 3.2, -O3 (ANSI-C version of the SciMark2
benchmark)

78
Java Compiled by Titanium Compiler

Same as previous slide, but using a larger data
set
More cache misses, etc.

79
Applications in Titanium

Benchmarks and Kernels
Scalable Poisson solver for infinite domains
NAS PB MG, FT, IS, CG
Unstructured mesh kernel EM3D
Dense linear algebra LU, MatMul
Tree-structured n-body code
Finite element benchmark
Larger applications
Poisson Solver Adaptive Mesh Refinement (AMR)
Heart simulation
Gas Dynamics with AMR
Genetics micro-array selection

80
Case Study Block-Structured AMR

Adaptive Mesh Refinement (AMR) is challenging
Irregular data accesses and control from
boundaries
Mixed global/local view is useful

Titanium AMR benchmark available
AMR Titanium work by Tong Wen and Philip Colella
81
Languages Support Helps Productivity

C/Fortran/MPI AMR
Chombo package from LBNL
Bulk-synchronous comm
Pack boundary data between procs
All optimizations done by programmer

Titanium AMR
Entirely in Titanium
Finer-grained communication
No explicit pack/unpack code
Automated in runtime system
General approach
Language allow programmer optimizations
Compiler/runtime does some automatically

Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
82
Titanium AMR Performance

Performance is comparable with much less
programming work
Compiler/runtime perform some tedious (SMP-aware)
optimizations

83
Titanium Compiler Status

Titanium runs on almost any machine
Requires a C compiler and C for the translator
Pthreads for shared memory
GASNet for distributed memory, which exists on
Quadrics (Elan), IBM/SP (LAPI), Myrinet (GM),
Infiniband, UDP, Shem (Altix and X1), Dolphin
(SCI), and MPI
Shared with Berkeley UPC compiler
Recent language and compiler work
Indexed (scatter/gather) array copy
Non-blocking array copy (experimental)
Inspector/Executor (in progress)

84
Outline

Overview of parallel programming models
Trends in large-scale parallel machines
The UPC language
The Titanium language
An application study heart simulation in Titanium

85
Heart Simulation

Method and Fortran code developed by Peskin and
McQueen at NYU
Ran on vector and shared memory machines
100 CPU hours on a Cray C90
Models blood flow in the heart
Immersed boundaries are individual muscle fibers

Rules for contraction, valves, etc. included
Applications
Understanding structural abnormalities
Evaluating artificial heart valves
Eventually, artificial hearts

Source www.psc.org
86
Other Applications

The immersed boundary method is a general
technique
Simulating immersed elastic boundaries in an
incompressible fluid
Other examples that have been explored
Inner ear (cochlea) (Givelberge, Bunn)
Blood clotting (platelet coagulation) (Aronson)
Flags and parachutes
Flagella
Embryo growth
Valveless pumping (E. Jung)
Paper making
Whirling instability of an elastic filament (S.
Lim)
Flow in collapsible tubes (M. Rozar)
Flapping of a flexible filament in a flowing soap
film (L. Zhu)
Deformation of red blood cells in shear flow
(Eggleon and Popel)

87
Immersed Boundary Simulation Framework
Model Builder
Immersed Boundary Simulation
Visualization Data Analysis
Titanium Vector Machines Shared and Distributed
memory parallel machines PC Clusters
C/OpenGL Java3D workstation PC
C workstation
88
Old Heart Model

Full structure shows cone shape
Includes atria, ventricles, valves, and some
arteries
The rest of the circulatory system is modeled by
sources inflow
sinks outflow

89
New Heart Model

New model replaces the geodesics with
triangulated surfaces
Based on CT scans from a healthy human.
Triangulated surface of left ventricle is shown

Work by
Peskin McQueen, NYU
Paragios ODonnell, Siemens
Setserr, Cleveland Clinic

90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
Immersed Boundary Equations
Navier Stokes
Force on fluid from material
Movement of material (with force from fluid)
u, p, F fluid velocity, pressure, force r, m
force applied by the immersed matter t, q
time, material coordinate x
position of fluid particle X position
of the immersed material particle d
Dirac delta function (fluid/material
interactions) f(q, t) application-specific
material force
106
Immersed Boundary Method Structure

4 steps in each timestep

2D Dirac Delta Function
1.Material activation force calculation
Material Points
4. Interpolate move material
2. Spread Force
Interaction
3. Navier-Stokes Solver
Fluid Lattice
107
Challenges to Parallelization

Irregular material points need to interact with
regular fluid lattice
Efficient scatter-gather across processors
Material points need to interact with each other
Spring force law between points on muscle
Placement of materials across processors
Locality store material points with underlying
fluid and with nearby material points
Load balance distribute points evenly
Need a scalable fluid solver
Currently based on 3D FFT
Multigrid and AMR explored by others

108
Material Interaction