Clusters and their Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Clusters and their Applications

Description:

Some s by Jim Demmel, David Culler, Horst Simon, and Erich Strohmaier – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 118
Provided by: KathyY150
Category:

less

Transcript and Presenter's Notes

Title: Clusters and their Applications


1
Clusters and their Applications
  • Kathy Yelick
  • yelick_at_cs.berkeley.edu
  • http//www.cs.berkeley.edu/yelick/
  • http//upc.lbl.gov
  • http//titanium.cs.berkeley.edu

2
Outline
  • Overview of parallel programming models
  • Trends in large-scale parallel machines
  • The UPC language
  • The Titanium language
  • An application study heart simulation in Titanium

3
Outline
  • Overview of parallel programming models
  • Shared memory threads
  • Message passing
  • Partitioned global address space (PGAS)
  • Data parallel
  • Hybrids
  • Trends in large-scale parallel machines
  • The UPC language
  • The Titanium language
  • An application study heart simulation in Titanium

4
A generic parallel architecture
Proc
Proc
Proc
Proc
Proc
Proc
Interconnection Network
Memory
Memory
Memory
Memory
Memory
  • Where is the memory physically located?
  • Is it connect directly to processors?
  • What is the connectivity of the network?

5
Parallel Programming Models
  • Programming model is made up of the languages and
    libraries that create an abstract view of the
    machine
  • Control
  • How is parallelism created?
  • What orderings exist between operations?
  • Data
  • What data is private vs. shared?
  • How is logically shared data accessed or
    communicated?
  • Synchronization
  • What operations can be used to coordinate
    parallelism
  • What are the atomic (indivisible) operations?
  • Cost
  • How do we account for the cost of each of the
    above?

6
Simple Example
  • Consider applying a function f to the elements of
    an array A and then computing its sum
  • Questions
  • Where does A live? All in single memory?
    Partitioned?
  • What work will be done by each processors?
  • They need to coordinate to get a single result,
    how?

A array of all data fA f(A) s sum(fA)
s
7
Programming Model 1 Shared Memory
  • Program is a collection of threads of control.
  • Can be created dynamically, mid-execution, in
    some languages
  • Each thread has a set of private variables, e.g.,
    local stack variables
  • Also a set of shared variables, e.g., static
    variables, shared common blocks, or global heap.
  • Threads communicate implicitly by writing and
    reading shared variables.
  • Threads coordinate by synchronizing on shared
    variables

Shared memory
s
s ...
y ..s ...
Private memory
Pn
P1
P0
8
Shared Memory Code for Computing a Sum
static int s 0
Thread 0 local_s1 0 for (i 0 iltn/2
i) local_s1 local_s1 f(Ai)
s s local_s1
Thread 1 local_s2 0 for (i n/2 iltn
i) local_s2 local_s2 f(Ai)
s s local_s2
  • Since addition is associative, its OK to
    rearrange order
  • Most computation is on private variables
  • Sharing frequency is also reduced, which might
    improve speed
  • But there is still a race condition on the update
    of shared s
  • The race condition can be fixed by adding locks
    (only one thread can hold a lock at a time
    others wait for it)

9
Programming Model 2 Message Passing
  • Program consists of a collection of named
    processes.
  • Usually fixed at program startup time
  • Thread of control plus local address space -- NO
    shared data.
  • Logically shared data is partitioned over local
    processes.
  • Processes communicate by explicit send/receive
    pairs
  • Coordination is implicit in every communication
    event.
  • MPI (Message Passing Interface) is the most
    commonly used SW

Private memory
y ..s ...
Pn
P1
P0
Network
10
Message Passing Computing s A1A2
  • First possible solution what could go wrong?

Processor 1 xlocal A1 send xlocal,
proc2 receive xremote, proc2 s xlocal
xremote
Processor 2 xlocal A2 send xlocal,
proc1 receive xremote, proc1 s xlocal
xremote
  • If send/receive acts like the telephone system?
    The post office?
  • What if there are more than 2 processors?

11
MPI The de facto standard
  • MPI has become the de facto standard for parallel
    computing using message passing
  • Pros and Cons of standards
  • MPI created finally a standard for applications
    development in the HPC community ? portability
  • The MPI standard is a least common denominator
    building on mid-80s technology, so may discourage
    innovation
  • Programming Model reflects hardware!

I am not sure how I will program a Petaflops
computer, but I am sure that I will need MPI
somewhere HDS 2001
12
Programming Model 3 Global Address Space
  • Partitioned Global Address Space (PGAS)
    programming
  • Program consists of a collection of named
    threads.
  • Usually fixed at program startup time
  • Private and shared data, as in shared memory
    model
  • Mostly access local date (private or shared)
  • Examples UPC, Titanium, Co-Array Fortran

Shared memory
sn 18
s0 27
s1 34
sum ..si ...
Private memory
smyThread ...
Pn
P1
P0
13
PGAS (UPC) Code for Computing a Sum
shared A n shared int s THREADS
Thread 0,,THREADS-1 sum sMY_THREAD 0
for (i MY_THREAD, iltn iTHREADS)
sMY_THREAD f(Ai) barrier if
(MY_THREAD 0) for (i 0 i lt THREADS
i) sum si
  • Array s is distributed with 1 element per thread
    starting at 0
  • Most accesses are to local variables, i.e.,
    private variables or the local elements of shared
    s
  • Thread 0 accesses remote values of s to compute
    global sum
  • Final sum could be done more efficiently as a
    tree calculation

14
Programming Model 4 Data Parallel
  • Single thread of control consisting of parallel
    operations.
  • Parallel operations applied to all (or a defined
    subset) of a data structure, usually an array
  • Communication is implicit in parallel operators
  • Elegant and easy to understand and reason about
  • Coordination is implicit statements executed
    synchronously
  • Similar to Matlab language for array operations
  • Drawbacks
  • Not all problems fit this model (irregular
    parallelism)
  • Difficult to map onto coarse-grained machines

A array of all data fA f(A) s sum(fA)
s
15
Programming Model 5 Hybrids
  • Hybrid hardware, clusters of SMPs are common
  • These programming models can be mixed
  • Message passing (MPI) at the top level with
    shared memory (OpenMP) within a node is used
  • MPI everywhere is more common today
  • Can we have a single programming model?
  • New DARPA HPCS languages mix data parallelism and
    threads in a global address space
  • Partitioned Global Address Space (PGAS) models
    can call message passing libraries or vice verse
  • PGAS models can be used in a hybrid mode
  • Shared memory when it exists in hardware
  • Communication (done by the runtime system)
    otherwise

16
Outline
  • Overview of parallel programming models
  • Trends in large-scale parallel machines
  • Top500 list
  • Observations and predictions
  • The UPC language
  • The Titanium language
  • An application study heart simulation in Titanium

17
TOP500
- Listing of the 500 most powerful
Computers in the World - Yardstick Rmax from
Linpack Axb, dense problem - Updated twice
a year ISCxy in Germany, June xy SCxy in
USA, November xy - All data available from
www.top500.org
TPP performance
Rate
Size
18
TOP500 list - Data shown
  • Manufacturer Manufacturer or vendor
  • Computer Type indicated by manufacturer or
    vendor
  • Installation Site Customer
  • Location Location and country
  • Year Year of installation/last major update
  • Customer Segment Academic,Research,Industry,Vendor
    ,Class.
  • Processors Number of processors
  • Rmax Maxmimal LINPACK performance
    achieved
  • Rpeak Theoretical peak performance
  • Nmax Problemsize for achieving Rmax
  • N1/2 Problemsize for achieving half of Rmax
  • Nworld Position within the TOP500 ranking

19
(No Transcript)
20
(No Transcript)
21
Petaflop with 1M Cores By 2008
Common by 2015?
1Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100
Tflop/s 10 Tflops/s 1 Tflop/s 100 Gflop/s 10
Gflop/s 1 Gflop/s 10 MFlop/s
1 PFlop system in 2008
Data from top500.org
Slide source Horst Simon, LBNL
22
Petaflop with 1M Cores in your PC by 2025?
23
Outline
  • Overview of parallel programming models
  • Trends in large-scale parallel machines
  • The UPC language
  • PGAS language motivation and availability
  • Execution model
  • Shared vs. private data
  • Synchronization
  • Collectives
  • Distributed Arrays
  • Performance
  • The Titanium language
  • An application study heart simulation in Titanium

24
Current Implementations of PGAS Languages
  • A successful language/library must run everywhere
  • UPC
  • Commercial compilers available on Cray, SGI, HP
    machines
  • Open source compiler from LBNL/UCB
    (source-to-source)
  • Open source gcc-based compiler from Intrepid
  • CAF
  • Commercial compiler available on Cray machines
  • Open source compiler available from Rice
  • Titanium
  • Open source compiler from UCB runs on most
    machines
  • DARPA HPCS Languages
  • Cray Chapel, IBM X10, Sun Fortress
  • Use PGAS memory abstraction, but have dynamic
    threading
  • Recent additions to parallel language landscape ?
    no mature compilers for clusters yet

25
Unified Parallel C (UPC)
  • Overview and Design Philosophy
  • Unified Parallel C (UPC) is
  • An explicit parallel extension of ANSI C
  • A partitioned global address space language
  • Sometimes called a GAS language
  • Similar to the C language philosophy
  • Programmers are clever and careful, and may need
    to get close to hardware
  • to get performance, but
  • can get in trouble
  • Concise and efficient syntax
  • Common and familiar syntax and semantics for
    parallel C with simple extensions to ANSI C
  • Based on ideas in Split-C, AC, and PCP

26
UPC Execution Model
27
UPC Execution Model
  • Threads working independently in a SPMD fashion
  • Number of threads specified at compile-time or
    run-time available as program variable THREADS
  • MYTHREAD specifies thread index (0..THREADS-1)
  • upc_barrier is a global synchronization all wait
  • There is a form of parallel loop that we will see
    later
  • There are two compilation modes
  • Static Threads mode
  • THREADS is specified at compile time by the user
  • The program may use THREADS as a compile-time
    constant
  • Dynamic threads mode
  • Compiled code may be run with varying numbers of
    threads

28
Hello World in UPC
  • Any legal C program is also a legal UPC program
  • If you compile and run it as UPC with P threads,
    it will run P copies of the program.
  • Using this fact, plus the identifiers from the
    previous slides, we can parallel hello world
  • include ltupc.hgt / needed for UPC extensions /
  • include ltstdio.hgt
  • main()
  • printf("Thread d of d hello UPC world\n",
  • MYTHREAD, THREADS)

29
Example Monte Carlo Pi Calculation
  • Estimate Pi by throwing darts at a unit square
  • Calculate percentage that fall in the unit circle
  • Area of square r2 1
  • Area of circle quadrant ¼ p r2 p/4
  • Randomly throw darts at x,y positions
  • If x2 y2 lt 1, then point is inside circle
  • Compute ratio
  • points inside / points total
  • p 4ratio

30
Pi in UPC
  • Independent estimates of pi
  • main(int argc, char argv)
  • int i, hits, trials 0
  • double pi
  • if (argc ! 2)trials 1000000
  • else trials atoi(argv1)
  • srand(MYTHREAD17)
  • for (i0 i lt trials i) hits hit()
  • pi 4.0hits/trials
  • printf("PI estimated to f.", pi)

31
Helper Code for Pi in UPC
  • Required includes
  • include ltstdio.hgt
  • include ltmath.hgt
  • include ltupc.hgt
  • Function to throw dart and calculate where it
    hits
  • int hit()
  • int const rand_max 0xFFFFFF
  • double x ((double) rand()) / RAND_MAX
  • double y ((double) rand()) / RAND_MAX
  • if ((xx yy) lt 1.0)
  • return(1)
  • else
  • return(0)

32
Shared vs. Private Variables
33
Private vs. Shared Variables in UPC
  • Normal C variables and objects are allocated in
    the private memory space for each thread.
  • Shared variables are allocated only once, with
    thread 0
  • shared int ours // use sparingly
    performance
  • int mine
  • Shared variables may not have dynamic lifetime
    may not occur in a in a function definition,
    except as static. Why?

Thread0 Thread1
Threadn
Shared
ours
Global address space
mine
mine
mine
Private
34
Pi in UPC Shared Memory Style
  • Parallel computing of pi, but with a bug
  • shared int hits
  • main(int argc, char argv)
  • int i, my_trials 0
  • int trials atoi(argv1)
  • my_trials (trials THREADS - 1)/THREADS
  • srand(MYTHREAD17)
  • for (i0 i lt my_trials i)
  • hits hit()
  • upc_barrier
  • if (MYTHREAD 0)
  • printf("PI estimated to f.",
    4.0hits/trials)

shared variable to record hits
divide work up evenly
accumulate hits
What is the problem with this program?
35
Shared Arrays Are Cyclic By Default
  • Shared scalars always live in thread 0
  • Shared arrays are spread over the threads
  • Shared array elements are spread across the
    threads
  • shared int xTHREADS / 1 element per
    thread /
  • shared int y3THREADS / 3 elements per thread
    /
  • shared int z33 / 2 or 3
    elements per thread /
  • In the pictures below, assume THREADS 4
  • Red elts have affinity to thread 0

Think of linearized C array, then map in
round-robin
x
As a 2D array, y is logically blocked by columns
y
z
z is not
36
Pi in UPC Shared Array Version
  • Alternative fix to the race condition
  • Have each thread update a separate counter
  • But do it in a shared array
  • Have one thread compute sum
  • shared int all_hits THREADS
  • main(int argc, char argv)
  • declarations an initialization code omitted
  • for (i0 i lt my_trials i)
  • all_hitsMYTHREAD hit()
  • upc_barrier
  • if (MYTHREAD 0)
  • for (i0 i lt THREADS i) hits
    all_hitsi
  • printf("PI estimated to f.",
    4.0hits/trials)

all_hits is shared by all processors, just as
hits was
update element with local affinity
37
UPC Synchronization
38
UPC Global Synchronization
  • UPC has two basic forms of barriers
  • Barrier block until all other threads arrive
  • upc_barrier
  • Split-phase barriers
  • upc_notify this thread is ready for barrier
  • do computation unrelated to barrier
  • upc_wait wait for others to be ready
  • Optional labels allow for debugging
  • define MERGE_BARRIER 12
  • if (MYTHREAD2 0)
  • ...
  • upc_barrier MERGE_BARRIER
  • else
  • ...
  • upc_barrier MERGE_BARRIER

39
Synchronization - Locks
  • UPC Locks are an opaque type
  • upc_lock_t
  • Locks must be allocated before use
  • upc_lock_t upc_all_lock_alloc(void)
  • allocates 1 lock, pointer to all threads
  • upc_lock_t upc_global_lock_alloc(void)
  • allocates 1 lock, pointer to one thread
  • To use a lock
  • void upc_lock(upc_lock_t l)
  • void upc_unlock(upc_lock_t l)
  • use at start and end of critical region
  • Locks can be freed when not in use
  • void upc_lock_free(upc_lock_t ptr)

40
Pi in UPC Shared Memory Style
  • Parallel computing of pi, without the bug
  • shared int hits
  • main(int argc, char argv)
  • int i, my_hits, my_trials 0
  • upc_lock_t hit_lock upc_all_lock_alloc()
  • int trials atoi(argv1)
  • my_trials (trials THREADS - 1)/THREADS
  • srand(MYTHREAD17)
  • for (i0 i lt my_trials i)
  • my_hits hit()
  • upc_lock(hit_lock)
  • hits my_hits
  • upc_unlock(hit_lock)
  • upc_barrier
  • if (MYTHREAD 0)
  • printf("PI f", 4.0hits/trials)

create a lock
accumulate hits locally
accumulate across threads
41
Recap Private vs. Shared Variables in UPC
  • We saw several kinds of variables in the pi
    example
  • Private scalars (my_hits)
  • Shared scalars (hits)
  • Shared arrays (all_hits)
  • Shared locks (hit_lock)

Thread0 Thread1
Threadn
where nThreads-1
hits
hit_lock
Shared
all_hits0
all_hitsn
all_hits1
Global address space
my_hits
my_hits
my_hits
Private
42
UPC Collectives
43
UPC Collectives in General
  • UPC collectives interface is in the language
    spec
  • http//upc.lbl.gov/docs/user/upc_spec_1.2.pdf
  • It contains typical functions
  • Data movement broadcast, scatter, gather,
  • Computational reduce, prefix,
  • General interface has synchronization modes
  • Avoid over-synchronizing (barrier before/after)
  • Data being collected may be read/written by any
    thread simultaneously
  • Simple interface for scalar values (int,
    double,)
  • Berkeley UPC value-based collectives
  • Works with any compiler
  • http//upc.lbl.gov/docs/user/README-collectivev.tx
    t

44
Pi in UPC Data Parallel Style
  • The previous version of Pi works, but is not
    scalable
  • On a large of threads, the locked region will
    be a bottleneck
  • Use a reduction for better scalability
  • include ltbupc_collectivev.hgt
  • // shared int hits
  • main(int argc, char argv)
  • ...
  • for (i0 i lt my_trials i)
  • my_hits hit()
  • my_hits // type, input, thread,
    op
  • bupc_allv_reduce(int, my_hits, 0,
    UPC_ADD)
  • // upc_barrier
  • if (MYTHREAD 0)
  • printf("PI f", 4.0my_hits/trials)

Berkeley collectives
no shared variables
barrier implied by collective
45
Work Distribution Using upc_forall
46
Example Vector Addition
  • Questions about parallel vector additions
  • How to layout data (here it is cyclic)
  • Which processor does what (here it is owner
    computes)
  • / vadd.c /
  • include ltupc_relaxed.hgtdefine N
    100THREADSshared int v1N, v2N,
    sumNvoid main() int i for(i0 iltN i)
  • if (MYTHREAD iTHREADS) sumiv1iv2
    i

cyclic layout
owner computes
47
Work Sharing with upc_forall()
  • The idiom in the previous slide is very common
  • Loop over all work on those owned by this proc
  • UPC adds a special type of loop
  • upc_forall(init test loop affinity)
  • statement
  • Programmer indicates the iterations are
    independent
  • Undefined if there are dependencies across
    threads
  • Affinity expression indicates which iterations to
    run on each thread. It may have one of two
    types
  • Integer affinityTHREADS is MYTHREAD
  • Pointer upc_threadof(affinity) is MYTHREAD
  • Syntactic sugar for loop on previous slide
  • Some compilers may do better than this, e.g.,
  • for(iMYTHREAD iltN iTHREADS)
  • Rather than having all threads iterate N times
  • for(i0 iltN i) if (MYTHREAD
    iTHREADS)

48
Vector Addition with upc_forall
  • The vadd example can be rewritten as follows
  • Equivalent code could use sumi for affinity
  • The code would be correct but slow if the
    affinity expression were i1 rather than i.
  • define N 100THREADSshared int v1N, v2N,
    sumNvoid main() int i upc_forall(i0
    iltN i i)
  • sumiv1iv2i

The cyclic data distribution may perform poorly
on some machines
49
Distributed Arrays in UPC
50
Blocked Layouts in UPC
  • If this code were doing nearest neighbor
    averaging (3pt stencil) the cyclic layout would
    be the worst possible layout.
  • Instead, want a blocked layout
  • Vector addition example can be rewritten as
    follows using a blocked layout
  • define N 100THREADSshared int v1N,
    v2N, sumNvoid main() int
    i upc_forall(i0 iltN i sumi)
  • sumiv1iv2i

blocked layout
51
Layouts in General
  • All non-array objects have affinity with thread
    zero.
  • Array layouts are controlled by layout
    specifiers
  • Empty (cyclic layout)
  • (blocked layout)
  • 0 or (indefinite layout, all on 1 thread)
  • b or b1b2bn b1b2bn (fixed block
    size)
  • The affinity of an array element is defined in
    terms of
  • block size, a compile-time constant
  • and THREADS.
  • Element i has affinity with thread
  • (i / block_size) THREADS
  • In 2D and higher, linearize the elements as in a
    C representation, and then use above mapping

52
Pointers to Shared vs. Arrays
  • In the C tradition, arrays can be access through
    pointers
  • Here is the vector addition example using pointers
  • define N 100THREADS
  • shared int v1N, v2N, sumN
  • void main() int ishared int p1, p2p1v1
    p2v2for (i0 iltN i, p1, p2 )
  • if (i THREADS MYTHREAD) sumi p1
    p2

v1
p1
53
UPC Pointers
Thread0 Thread1
Threadn
p3
p3
p3
Shared
p4
p4
p4
Global address space
p1
p1
p1
Private
p2
p2
p2
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
54
Dynamic Memory Allocation in UPC
  • Dynamic memory allocation of shared memory is
    available in UPC
  • Non-collective (called independently)
  • shared void upc_global_alloc(size_t nblocks,
  • size_t nbytes)
  • nblocks number of blocks
  • nbytes block size
  • Collective (called together all threads get same
    pointer)
  • shared void upc_all_alloc(size_t nblocks,
  • size_t nbytes)
  • Freeing dynamically allocated memory in shared
    space
  • void upc_free(shared void ptr)

55
Performance of UPC
56
PGAS Languages have Performance Advantages
  • Strategy for acceptance of a new language
  • Make it run faster than anything else
  • Keys to high performance
  • Parallelism
  • Scaling the number of processors
  • Maximize single node performance
  • Generate friendly code or use tuned libraries
    (BLAS, FFTW, etc.)
  • Avoid (unnecessary) communication cost
  • Latency, bandwidth, overhead
  • Berkeley UPC and Titanium use GASNet
    communication layer
  • Avoid unnecessary delays due to dependencies
  • Load balance Pipeline algorithmic dependencies

57
One-Sided vs Two-Sided
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload
  • A one-sided put/get message can be handled
    directly by a network interface with RDMA support
  • Avoid interrupting the CPU or storing data from
    CPU (preposts)
  • A two-sided messages needs to be matched with a
    receive to identify memory address to put data
  • Offloaded to Network Interface in networks like
    Quadrics
  • Need to download match tables to interface (from
    host)
  • Ordering requirements on messages can also hinder
    bandwidth

58
One-Sided vs. Two-Sided Practice
NERSC Jacquard machine with Opteron processors
  • InfiniBand GASNet vapi-conduit and OSU MVAPICH
    0.9.5
  • Half power point (N ½ ) differs by one order of
    magnitude
  • This is not a criticism of the implementation!

Joint work with Paul Hargrove and Dan Bonachea
59
GASNet Portability and High-Performance
GASNet better for latency across machines
Joint work with UPC Group GASNet design by Dan
Bonachea
60
GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Joint work with UPC Group GASNet design by Dan
Bonachea
61
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Joint work with UPC Group GASNet design by Dan
Bonachea
62
Case Study NAS FT in UPC
  • Perform FFT on a 3D Grid
  • 1D FFTs in each dimension, 3 phases
  • Transpose after first 2 for locality
  • Bisection bandwidth-limited
  • Problem as procs grows
  • Three approaches
  • Exchange
  • wait for 2nd dim FFTs to finish, send 1 message
    per processor pair
  • Slab
  • wait for chunk of rows destined for 1 proc, send
    when ready
  • Pencil
  • send each row as it completes

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
63
NAS FT Variants Performance Summary
.5 Tflops
  • Slab is always best for MPI small message cost
    too high
  • Pencil is always best for UPC more overlap

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
64
Outline
  • Overview of parallel programming models
  • Trends in large-scale parallel machines
  • The UPC language
  • The Titanium language
  • Titanium Execution and Memory Model
  • Semi-automatic memory management
  • Support for Serial Programming
  • Performance and Applications
  • Compiler/Language Status
  • An application study heart simulation in Titanium

65
Titanium
  • UPC has advantages over message passing, but it
    is still a relatively low-level language
  • Titanium uses the PGAS concept in a high level
    language
  • Based on Java, a cleaner C
  • Classes, automatic memory management, etc.
  • Compiled to C and then machine code, no JVM
  • Same parallelism model at UPC and Co-Array
    Fortran
  • SPMD parallelism
  • Dynamic Java threads are not supported
  • Optimizing compiler
  • Analyzes global synchronization
  • Optimizes pointers, communication, memory

66
Summary of Features Added to Java
  • Multidimensional arrays iterators, subarrays,
    copying
  • Immutable (value) classes for Complex, etc.
  • Templates
  • Operator overloading
  • Scalable SPMD parallelism replaces threads
  • Global address space with local/global reference
    distinction
  • Checked global synchronization
  • Zone-based memory management (regions)
  • Libraries for collective communication,
    distributed arrays, bulk I/O, performance
    profiling

67
SPMD Execution Model
  • Titanium has the same execution model as UPC and
    CAF
  • Basic Java programs may be run as Titanium
    programs, but all processors do all the work.
  • E.g., parallel hello world
  • class HelloWorld
  • public static void main (String
    argv)
  • System.out.println(Hello from proc
  • Ti.thisProc()
  • out of
  • Ti.numProcs())
  • Global synchronization done using Ti.barrier()

68
Avoiding Errors Checked Barriers and Single
  • To put a barrier (or equivalent) inside a method,
    you need to make the method single.
  • A single method is one called by all procs
  • public single static void allStep(...)
  • These single annotations on methods are optional,
    and inferred by the compiler if you omit them
  • To put a barrier (or single method) in a branch
    or loop, you need to use a single variable for
    branch
  • A single variable has same value on all procs
  • int single timestep 0
  • Compiler proves that all processors call barriers
    together Gay Aiken

69
Global Address Space
  • Globally shared address space is partitioned
  • References (pointers) are either local or global
    (meaning possibly remote)
  • Reference global by default (unlike UPC, but like
    HPCS languages)

Object heaps are shared by default
x 1 y 2
x 5 y 6
x 7 y 8
Global address space
l
l
l
g
g
g
Program stacks are private
p0
p1
pn
70
Global Address Space
  • Processes allocate locally
  • References can be passed to other processes

class C public int val...
if (Ti.thisProc() 0) lv new C()
gv broadcast lv from 0
1
//data race gv.val Ti.thisProc()
71
Distributed Data Structures
  • Building distributed arrays
  • Now each processor has array of pointers, one to
    each processors chunk of particles

P0
P1
P2
72
Arrays in Java
  • Arrays in Java are objects
  • Only 1D arrays are directly supported
  • Multidimensional arrays are arrays of arrays
  • General, but slow

2d array
  • Subarrays are important in AMR (e.g., interior of
    a grid)
  • Even C and C dont support these well
  • Hand-coding (array libraries) can confuse
    optimizer
  • Can build multidimensional arrays, but we want
  • Compiler optimizations and nice syntax

73
Multidimensional Arrays in Titanium
  • New multidimensional array added
  • Supports subarrays without copies
  • can refer to rows, columns, slabs
    interior, boundary, even elements
  • Indexed by Points (tuples of ints)
  • Built on a rectangular set of Points, RectDomain
  • Points, Domains and RectDomains are built-in
    immutable classes, with useful literal syntax
  • Support for AMR and other grid computations
  • domain operations intersection, shrink, border
  • bounds-checking can be disabled after debugging

74
Titanium Points, RectDomains, Arrays
  • Points specified by a tuple of ints
  • RectDomains given by 3 points
  • lower bound, upper bound (and optional stride)
  • Array declared by num dimensions and type
  • Array created by passing RectDomain

75
More Array Operations
  • Titanium arrays have a rich set of operations
  • None of these modify the original array, they
    just create another view of the data in that
    array
  • You create arrays with a RectDomain and get it
    back later using A.domain() for array A
  • A Domain is a set of points in space
  • A RectDomain is a rectangular one
  • Operations on Domains include , -, (union,
    different intersection)

translate
restrict
slice (n dim to n-1)
76
Are these features expressive?
GOOD
  • Compared line counts of timed, uncommented
    portion of each program
  • Multigrid (MG) and FFT (FT) disparities mostly
    due to Ti domain calculus and array copy
  • Conjugate Gradient (CG) line counts are similar
    since Fortran version is already compact

77
Java Compiled by Titanium Compiler
  • Sun JDK 1.4.1_01 (HotSpot(TM) Client VM) for
    Linux
  • IBM J2SE 1.4.0 (Classic VM cxia32140-20020917a,
    jitc JIT) for 32-bit Linux
  • Titaniumc v2.87 for Linux, gcc 3.2 as backend
    compiler -O3. no bounds check
  • gcc 3.2, -O3 (ANSI-C version of the SciMark2
    benchmark)

78
Java Compiled by Titanium Compiler
  • Same as previous slide, but using a larger data
    set
  • More cache misses, etc.

79
Applications in Titanium
  • Benchmarks and Kernels
  • Scalable Poisson solver for infinite domains
  • NAS PB MG, FT, IS, CG
  • Unstructured mesh kernel EM3D
  • Dense linear algebra LU, MatMul
  • Tree-structured n-body code
  • Finite element benchmark
  • Larger applications
  • Poisson Solver Adaptive Mesh Refinement (AMR)
  • Heart simulation
  • Gas Dynamics with AMR
  • Genetics micro-array selection

80
Case Study Block-Structured AMR
  • Adaptive Mesh Refinement (AMR) is challenging
  • Irregular data accesses and control from
    boundaries
  • Mixed global/local view is useful

Titanium AMR benchmark available
AMR Titanium work by Tong Wen and Philip Colella
81
Languages Support Helps Productivity
  • C/Fortran/MPI AMR
  • Chombo package from LBNL
  • Bulk-synchronous comm
  • Pack boundary data between procs
  • All optimizations done by programmer
  • Titanium AMR
  • Entirely in Titanium
  • Finer-grained communication
  • No explicit pack/unpack code
  • Automated in runtime system
  • General approach
  • Language allow programmer optimizations
  • Compiler/runtime does some automatically

Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
82
Titanium AMR Performance
  • Performance is comparable with much less
    programming work
  • Compiler/runtime perform some tedious (SMP-aware)
    optimizations

83
Titanium Compiler Status
  • Titanium runs on almost any machine
  • Requires a C compiler and C for the translator
  • Pthreads for shared memory
  • GASNet for distributed memory, which exists on
  • Quadrics (Elan), IBM/SP (LAPI), Myrinet (GM),
    Infiniband, UDP, Shem (Altix and X1), Dolphin
    (SCI), and MPI
  • Shared with Berkeley UPC compiler
  • Recent language and compiler work
  • Indexed (scatter/gather) array copy
  • Non-blocking array copy (experimental)
  • Inspector/Executor (in progress)

84
Outline
  • Overview of parallel programming models
  • Trends in large-scale parallel machines
  • The UPC language
  • The Titanium language
  • An application study heart simulation in Titanium

85
Heart Simulation
  • Method and Fortran code developed by Peskin and
    McQueen at NYU
  • Ran on vector and shared memory machines
  • 100 CPU hours on a Cray C90
  • Models blood flow in the heart
  • Immersed boundaries are individual muscle fibers
  • Rules for contraction, valves, etc. included
  • Applications
  • Understanding structural abnormalities
  • Evaluating artificial heart valves
  • Eventually, artificial hearts

Source www.psc.org
86
Other Applications
  • The immersed boundary method is a general
    technique
  • Simulating immersed elastic boundaries in an
    incompressible fluid
  • Other examples that have been explored
  • Inner ear (cochlea) (Givelberge, Bunn)
  • Blood clotting (platelet coagulation) (Aronson)
  • Flags and parachutes
  • Flagella
  • Embryo growth
  • Valveless pumping (E. Jung)
  • Paper making
  • Whirling instability of an elastic filament (S.
    Lim)
  • Flow in collapsible tubes (M. Rozar)
  • Flapping of a flexible filament in a flowing soap
    film (L. Zhu)
  • Deformation of red blood cells in shear flow
    (Eggleon and Popel)

87
Immersed Boundary Simulation Framework
Model Builder
Immersed Boundary Simulation
Visualization Data Analysis
Titanium Vector Machines Shared and Distributed
memory parallel machines PC Clusters
C/OpenGL Java3D workstation PC
C workstation
88
Old Heart Model
  • Full structure shows cone shape
  • Includes atria, ventricles, valves, and some
    arteries
  • The rest of the circulatory system is modeled by
  • sources inflow
  • sinks outflow

89
New Heart Model
  • New model replaces the geodesics with
    triangulated surfaces
  • Based on CT scans from a healthy human.
  • Triangulated surface of left ventricle is shown
  • Work by
  • Peskin McQueen, NYU
  • Paragios ODonnell, Siemens
  • Setserr, Cleveland Clinic

90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
Immersed Boundary Equations
Navier Stokes
Force on fluid from material
Movement of material (with force from fluid)
u, p, F fluid velocity, pressure, force r, m
force applied by the immersed matter t, q
time, material coordinate x
position of fluid particle X position
of the immersed material particle d
Dirac delta function (fluid/material
interactions) f(q, t) application-specific
material force
106
Immersed Boundary Method Structure
  • 4 steps in each timestep

2D Dirac Delta Function
1.Material activation force calculation
Material Points
4. Interpolate move material
2. Spread Force
Interaction
3. Navier-Stokes Solver
Fluid Lattice
107
Challenges to Parallelization
  • Irregular material points need to interact with
    regular fluid lattice
  • Efficient scatter-gather across processors
  • Material points need to interact with each other
  • Spring force law between points on muscle
  • Placement of materials across processors
  • Locality store material points with underlying
    fluid and with nearby material points
  • Load balance distribute points evenly
  • Need a scalable fluid solver
  • Currently based on 3D FFT
  • Multigrid and AMR explored by others

108
Material Interaction
  • Communication within a material can be high
  • E.g., spring force law in heart fibers to
    contract
  • Instead, replicate point uses linearity in
    spread
  • Use graph partitioning (Metis) on materials
  • Improve locality in interaction within material
    and to fluid

P1
P1
P2
P2
communication
redundant work
Joint work with A. Solar, J. Su
109
Data Structures for Interaction








0 1 2 3









0 1 2 3
  • Metadata and indexing overhead can be high
  • Old method send entire bounding box faster than
    exact set of points
  • Newer method Logical grid of 4x4x4 cubes used
  • Recent change Logical grid of k1xk2xk3 boxes
    used
  • Communication aggregation also performed

110
Fluid Solver
  • Incompressible fluid needs an elliptic solver
  • High communication demand
  • Information propagates across domain
  • FFT-based solver divides domain into slabs
  • Transposes before last direction of FFT
  • Limits parallelism to n processors for n3 problem

1D FFTs
111
Immersed Boundary Simulation in Titanium
Code Size in Lines Code Size in Lines
Fortran Titanium
8000 4000
  • Using Seaborg (Power3) at NERSC and DataStar
    (Power4) at SDSC

Joint work with Ed Givelberg, Armando Solar-Lezama
112
(No Transcript)
113
Building a Performance Model
  • Based on measurements/scaling of components
  • FFT is time is
  • 5nlogn flops flops/sec (measured for FFT)
  • Other costs are linear in either material or
    fluid points
  • Measure constants
  • flops/point (independent of machine and problem
    size)
  • Flops/sec (measured per machine, per phase)
  • Time is a b points
  • Communication done similarly
  • Find formula for message size as function of
    problem size
  • Check the formula using tracing of some kind
  • Use a/b model to predict running time a b
    size

114
A Performance Model
  • 5123 in lt 1 second per timestep not possible
  • Primarily limited by bisection bandwidth

115
Summary
  • Heading towards an era of Petascale systems with
    100K-1M processor cores
  • High end architectures are dominated by clusters
    physically distributed memory
  • Message passing (MPI) is de facto programming
    standard
  • PGAS languages offer advantages
  • Ease of programming, especially for higher level
    base language
  • Performance due to one-sided communication
  • Heart simulation
  • Complex computation required algorithm
    experimentation
  • Classic locality and load balancing trade-offs
  • Strategy for language adoption
  • Allow for hand-tuning optimizations can automate
    some of these
  • Provide performance, portability, and
    interoperability

116
Titanium Group (Past and Present)
  • Susan Graham
  • Katherine Yelick
  • Paul Hilfinger
  • Phillip Colella (LBNL)
  • Alex Aiken
  • Greg Balls
  • Andrew Begel
  • Dan Bonachea
  • Kaushik Datta
  • David Gay
  • Ed Givelberg
  • Amir Kamil
  • Arvind Krishnamurthy
  • Ben Liblit
  • Peter McQuorquodale (LBNL)
  • Sabrina Merchant
  • Carleton Miyamoto
  • Chang Sun Lin
  • Geoff Pike
  • Luigi Semenzato (LBNL)
  • Armando Solar-Lezama
  • Jimmy Su
  • Tong Wen (LBNL)
  • Siu Man Yau
  • and many undergraduate researchers

http//titanium.cs.berkeley.edu
117
UPC Group (Past and Present)
  • Katherine Yelick
  • Dan Bonachea
  • Wei Chen
  • Jason Duell
  • Paul Hargrove
  • Parry Husbands
  • Costin Iancu
  • Rajesh Nishtala
  • Michael Welcome
  • Former
  • Christian Bell

http//upc.lbl.gov
Write a Comment
User Comments (0)
About PowerShow.com