Unified Parallel C (UPC)

About This Presentation

Title:

Unified Parallel C (UPC)

Description:

Data movement: broadcast, scatter, gather, ... Computational: reduce, prefix, ... Should non-blocking communication be a first class language citizen? Synchronization ... – PowerPoint PPT presentation

Number of Views:192

Avg rating:3.0/5.0

Slides: 65

Provided by: cost83

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Unified Parallel C (UPC)

1
Unified Parallel C (UPC)

Costin Iancu
The Berkeley UPC Group C. Bell, D. Bonachea, W.
Chen, J. Duell,
P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,
M. Welcome, K. Yelick
http//upc.lbl.gov

Slides edited by K. Yelick, T. El-Ghazawi, P.
Husbands, C.Iancu
2
Context

Most parallel programs are written using either
Message passing with a SPMD model (MPI)
Usually for scientific applications with
C/Fortran
Scales easily
Shared memory with threads in OpenMP,
ThreadsC/C/F or Java
Usually for non-scientific applications
Easier to program, but less scalable performance
Partitioned Global Address Space (PGAS) Languages
take the best of both
SPMD parallelism like MPI (performance)
Local/global distinction, i.e., layout matters
(performance)
Global address space like threads
(programmability)
3 Current languages UPC (C), CAF (Fortran), and
Titanium (Java)
3 New languages Chapel, Fortress, X10

3
Partitioned Global Address Space

Shared memory is logically partitioned by
processors
Remote memory may stay remote no automatic
caching implied
One-sided communication reads/writes of shared
variables
Both individual and bulk memory copies
Some models have a separate private memory area
Distributed array generality and how they are
constructed

4
Partitioned Global Address Space Languages

Explicitly-parallel programming model with SPMD
parallelism
Fixed at program start-up, typically 1 thread per
processor
Global address space model of memory
Allows programmer to directly represent
distributed data structures
Address space is logically partitioned
Local vs. remote memory (two-level hierarchy)
Programmer control over performance critical
decisions
Data layout and communication
Performance transparency and tunability are goals
Initial implementation can use fine-grained
shared memory

5
Current Implementations

A successful language/library must run everywhere
UPC
Commercial compilers Cray, SGI, HP, IBM
Open source compilers LBNL/UCB
(source-to-source), Intrepid (gcc)
CAF
Commercial compilers Cray
Open source compilers Rice (source-to-source)
Titanium
Open source compilers UCB (source-to-source)
Common tools
Open64 open source research compiler
infrastructure
ARMCI, GASNet for distributed memory
implementations
Pthreads, System V shared memory

6
Talk Overview

UPC Language Design
Data Distribution (layout, memory management)
Work Distribution (data parallelism)
Communication (implicit,explicit, collective
operations)
Synchronization (memory model, locks)
Programming in UPC
Performance (one-sided communication)
Application examples FFT, PC
Productivity (compiler support)
Performance tuning and modeling

7
UPC Overview and Design

Unified Parallel C (UPC) is
An explicit parallel extension of ANSI C with
common and familiar syntax and semantics for
parallel C and simple extensions to ANSI C
A partitioned global address space language
(PGAS)
Based on ideas in Split-C, AC, and PCP
Similar to the C language philosophy
Programmers are clever and careful, and may need
to get close to hardware
to get performance, but
can get in trouble
SPMD execution model (THREADS, MYTHREAD),
static vs. dynamic threads

8
Data Distribution
9
Data Distribution

Distinction between memory spaces through
extensions of the type system (shared qualifier)
shared int ours
shared int XTHREADS
shared int ptr
int mine
Data in shared address space
Static scalars (T0), distributed
arrays
Dynamic dynamic memory management
(upc_alloc, upc_global_alloc, upc_all_alloc)

10
Data Layout

Data layout controlled through extensions of the
type system (layout specifiers)
0 or (indefinite layout, all on 1 thread)
shared int p
Empty (cyclic layout)
shared int arrayTHREADSM
(blocked layout)
shared int arrayTHREADSM
b or b1b2bn b1b2bn (block cyclic)
shared B int arrayTHREADSM
Element arrayi has affinity with thread
(i / B) THREADS
Layout determines pointer arithmetic rules
Introspection (upc_threadof, upc_phaseof,
upc_blocksize)

11
UPC Pointers Implementation

In UPC pointers to shared objects have three
fields
thread number
local address of block
phase (specifies position in the block)
Example Cray T3E implementation
Pointer arithmetic can be expensive in UPC

Virtual Address Thread Phase
Phase Thread Virtual Address
0
12
UPC Pointers
Where does the pointer point?
Local Shared
Private PP (p1) PS (p3)
Shared SP (p2) SS (p4)
Where does the pointer reside?
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
13
UPC Pointers
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
14
Common Uses for UPC Pointer Types

int p1
These pointers are fast (just like C pointers)
Use to access local data in part of code
performing local work
Often cast a pointer-to-shared to one of these to
get faster access to shared data that is local
shared int p2
Use to refer to remote data
Larger and slower due to test-for-local
possible communication
int shared p3
Not recommended
shared int shared p4
Use to build shared linked structures, e.g., a
linked list
typedef is the UPC programmers best friend

15
UPC Pointers Usage Rules

Pointer arithmetic supports blocked and
non-blocked array distributions
Casting of shared to private pointers is allowed
but not vice versa !
When casting a pointer-to-shared to a
pointer-to-local, the thread number of the
pointer to shared may be lost
Casting of shared to local is well defined only
if the object pointed to by the pointer to shared
has affinity with the thread performing the cast

16
Work Distribution
17
Work Distribution upc_forall()

Owner computes rule loop over all, work on those
owned by you
UPC adds a special type of loop
upc_forall(init test step affinity)
statement
Programmer indicates the iterations are
independent
Undefined if there are dependencies across
threads
Affinity expression indicates which iterations to
run on each thread. It may have one of two
types
Integer affinityTHREADS MYTHREAD
Pointer upc_threadof(affinity) MYTHREAD
Syntactic sugar for
for(iMYTHREAD iltN iTHREADS)
for(i0 iltN i)
if (MYTHREAD iTHREADS)

18
Inter-Processor Communication
19
Data Communication

Implicit (assignments)
shared int p
p 7
Explicit (bulk synchronous) point-to-point
(upc_memget, upc_memput, upc_memcpy,
upc_memset)
Collective operations http//www.gwu.edu/upc/docs
/
Data movement broadcast, scatter, gather,
Computational reduce, prefix,
Interface has synchronization modes (??)
Avoid over-synchronizing (barrier before/after is
simplest semantics, but may be unnecessary)
Data being collected may be read/written by any
thread simultaneously

20
Data Communication

The UPC Language Specification V 1.2 does not
contain non-blocking communication primitives
Extensions for non-blocking communication
available in the BUPC implementation
UPC V1.2 does not have higher level communication
primitives for point-to-point communication.
See BUPC extensions for
scatter, gather
VIS
Should non-blocking communication be a first
class language citizen?

21
Synchronization
22
Synchronization

Point-to-point synchronization locks
opaque type upc_lock_t
dynamically managed upc_all_lock_alloc,
upc_global_lock_alloc
Global synchronization
Barriers (unaligned) upc_barrier
Split-phase barriers
upc_notify this thread is ready for barrier
do computation unrelated to barrier
upc_wait wait for others to be ready

23
Memory Consistency in UPC

The consistency model defines the order in which
one thread may see another threads accesses to
memory
If you write a program with un-synchronized
accesses, what happens?
Does this work?
data while (!flag)
flag 1 data // use the data
UPC has two types of accesses
Strict will always appear in order
Relaxed May appear out of order to other threads
Consistency is associated either with a program
scope (file, statement)
pragma upc strict flag 1
or with a type
shared strict int flag

24
Sample UPC Code
25
Matrix Multiplication in UPC

Given two integer matrices A(NxP) and B(PxM), we
want to compute C A x B.
Entries cij in C are computed by the formula

26
Serial C code

define N 4
define P 4
define M 4
int aNP 1,2,3,4,5,6,7,8,9,10,11,12,14,14,1
5,16, cNM
int bPM 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
void main (void)
int i, j , l
for (i 0 iltN i)
for (j0 jltM j)
cij 0
for (l 0 lltP l)
cij ailblj

27
Domain Decomposition

Exploits locality in matrix multiplication

A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below

B(P ? M) is decomposed column wise into M/
THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1

Note N and M are assumed to be multiples of
THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
28
UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP/THREADS int aNP
1,..,16, cNM // data distribution a and c
are blocked shared matrices shared M/THREADS
int bPM 0,1,0,1, ,0,1 void main (void)
int i, j , l // private variables upc_forall(
i 0 iltN i ci0) //work
distribution for (j0 jltM j)
cij 0 for (l 0 lltP l)
//implicit communication
cij ailblj
29
UPC Matrix Multiplication With Block Copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP, cNM // a and c are blocked
shared matrices sharedM/THREADS int
bPM int b_localPM void main (void)
int i, j , l // private variables
//explicit bulk communication upc_memget(b_local,
b, PMsizeof(int)) //work distribution
(c aligned with a??) upc_forall(i 0 iltN
i ci0) for (j0 jltM j)
cij 0 for (l 0 lltP l)
cij ailb_locallj
30
Programming in UPC

Dont ask yourself what can my compiler do for
me, ask yourself what can I do for my compiler!

31
Principles of Performance Software

To minimize the cost of communication
Use the best available communication mechanism on
a given machine
Hide communication by overlapping
(programmer or compiler
or runtime)
Avoid synchronization using data-driven execution
(programmer or runtime)
Tune communication using performance models when
they work (??) search when they dont
(programmer or
compiler/runtime)

32
Best Available Communication Mechanism

Performance is determined by overhead, latency
and bandwidth
Data transfer (one-sided communication) is often
faster than (two sided) message passing
Semantics limit performance
In-order message delivery
Message and tag matching
Need to acquire information from remote host
processor
Synchronization (message receipt) tied to data
transfer

33
One-Sided vs Two-Sided Theory
host CPU
two-sided message
message id
data payload
network interface
one-sided put message
memory
address
data payload

A two-sided messages needs to be matched with a
receive to identify memory address to put data
Offloaded to Network Interface in networks like
Quadrics
Need to download match tables to interface (from
host)
A one-sided put/get message can be handled
directly by a network interface with RDMA support
Avoid interrupting the CPU or storing data from
CPU (preposts)

34
GASNet Portability and High-Performance
GASNet better for overhead and latency across
machines
UPC Group GASNet design by Dan Bonachea
35
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
GASNet at least as high (comparable) for large
messages
Joint work with UPC Group GASNet design by Dan
Bonachea
36
One-Sided vs. Two-Sided Practice
NERSC Jacquard machine with Opteron processors

InfiniBand GASNet vapi-conduit and OSU MVAPICH
0.9.5
Half power point (N ½ ) differs by one order of
magnitude
This is not a criticism of the implementation!

Yelick,Hargrove, Bonachea
37
Overlap
38
Hide Communication by Overlapping

A programming model that decouples data transfer
and synchronization (init, sync)
BUPC has several extensions (programmer)
explicit handle based
region based
implicit handle based
Examples
3D FFT (programmer)
split-phase optimizations (compiler)
automatic overlap (runtime)

39
Performing a 3D FFT

NX x NY x NZ elements spread across P processors
Will Use 1-Dimensional Layout in Z dimension
Each processor gets NZ / P planes of NX x NY
elements per plane

Example P 4
NZ
NZ/P
1D Partition
NX
p3
p2
p1
NY
p0
Bell, Nishtala, Bonachea, Yelick
40
Performing a 3D FFT (part 2)

Perform an FFT in all three dimensions
With 1D layout, 2 out of the 3 dimensions are
local while the last Z dimension is distributed

Step 1 FFTs on the columns (all elements local)
Step 2 FFTs on the rows (all elements local)
Step 3 FFTs in the Z-dimension (requires
communication)
Bell, Nishtala, Bonachea, Yelick
41
Performing the 3D FFT (part 3)

Can perform Steps 1 and 2 since all the data is
available without communication
Perform a Global Transpose of the cube
Allows step 3 to continue

Transpose
Bell, Nishtala, Bonachea, Yelick
42
Communication Strategies for 3D FFT
chunk all rows with same destination

Three approaches
Chunk
Wait for 2nd dim FFTs to finish
Minimize messages
Slab
Wait for chunk of rows destined for 1 proc to
finish
Overlap with computation
Pencil
Send each row as it completes
Maximize overlap and
Match natural layout

pencil 1 row
slab all rows in a single plane with same
destination
Bell, Nishtala, Bonachea, Yelick
43
NAS FT Variants Performance Summary
.5 Tflops
Chunk (NAS FT with FFTW) Best MPI (always
slabs) Best UPC (always pencils)
MFlops per Thread

Slab is always best for MPI small message cost
too high
Pencil is always best for UPC more overlap

Myrinet Infiniband Elan3
Elan3 Elan4 Elan4 procs
64 256 256
512 256 512
44
Bisection Bandwidth Limits

Full bisection bandwidth is (too) expensive
During an all-to-all communication phase
Effective (per-thread) bandwidth is fractional
share
Significantly lower than link bandwidth
Use smaller messages mixed with computation to
avoid swamping the network

Bell, Nishtala, Bonachea, Yelick
45
Compiler Optimizations

Naïve scheme (blocking call for each load/store)
not good enough
PRE on shared expressions
Reduce the amount of unnecessary communication
Apply also to UPC shared pointer arithmetic
Split-phase communication
Hide communication latency through overlapping
Message coalescing
Reduce number of messages to save startup
overhead and achieve better bandwidth

Chen, Iancu, Yelick
46
Benchmarks

Gups
Random access (read/modify/write) to distributed
array
Mcop
Parallel dynamic programming algorithm
Sobel
Image filter
Psearch
Dynamic load balancing/work stealing
Barnes Hut
Shared memory style code from SPLASH2
NAS FT/IS
Bulk communication

47
Performance Improvements
improvement over unoptimized
Chen, Iancu, Yelick
48
Data Driven Execution
49
Data-Driven Execution

Many algorithms require synchronization with
remote processor
Mechanisms (BUPC extensions)
Signaling store Raise a semaphore upon transfer
Remote enqueue Put a task in a remote queue
Remote execution Floating functions (X10
activities)
Many algorithms have irregular data dependencies
(LU)
Mechanisms (BUPC extensions)
Cooperative multithreading

50
Matrix Factorization
Completed part of U
A(i,j)
A(i,k)
Panel factorizations involve communication for
pivoting
Completed part of L
A(j,i)
A(j,k)
Trailing matrix to be updated
Panel being factored
Husbands,Yelick
51
Three Strategies for LU Factorization

Organize in bulk-synchronous phases (ScaLAPACK)
Factor a block column, then perform updates
Relatively easy to understand/debug, but extra
synchronization
Overlapping phases (HPL)
Work associated with on block column
factorization can be overlapped
Parameter to determine how many (need temp space
accordingly)
Event-driven multithreaded (UPC Linpack)
Each thread runs an event handler loop
Tasks factorization (w/ pivoting), update
trailing, update upper
Tasks my suspend (voluntarily) to wait for data,
synchronization, etc.
Data moved with remote gets (synchronization
built-in)
Must gang together for factorizations
Scheduling priorities are key to performance and
deadlock avoidance

Husbands,Yelick
52
UPC-HP Linpack Performance

Comparable to HPL (numbers from HPCC database)
Faster than ScaLAPACK due to less synchronization
Large scaling of UPC code on Itanium/Quadrics
(Thunder)
2.2 TFlops on 512p and 4.4 TFlops on 1024p

Husbands, Yelick
53
Performance Tuning
Iancu, Strohmaier
54
Efficient Use of One-Sided

Implementations need to be efficient and have
scalable performance
Application level use of NB benefits from new
design techniques finer grained decompositions
and overlap
Overlap exercises the system in un-expected
ways
Prototyping of implementations for large scale
systems is a hard problem non-linear behavior of
networks, communication scheduling is NP-hard
Need methodology for fast prototyping
understand interaction network/CPU at large scale

55
Performance Tuning

Performance is determined by overhead, latency
and bandwidth, computational characteristics and
communication topology
Its all relative Performance characteristics
are determined by system load
Basic principles
Minimize communication overhead
Avoid congestion
control injection rate (end-point)
avoid hotspots (end-point, network routes)
Have to use models.
What kind of answers can a model answer?

56
Example Vector-Add

shared double rdata
double ldata, buf
upc_memget(buf, rdata, N)
for(i0 iltN i)
ldatai bufi
for(i0 iltN/B i)
hiupc_memget_nb(bufiB,rdataiB,B)
for(i0 iltN/B i)
sync(hi)
for(j0jltB j)
ldataiBjbufiBj

GET_nb(B0) GET_nb(Bb) GET_nb(Bb1) GET_nb(B2b)
sync(B0) compute(B0) sync(Bb) compute(Bb) GET_n
b(B2b1) GET_nb(B3b) sync(BN) compute(BN)
b
b
b
Which implementation is faster? What is B,b?
57
Prototyping

Usual approach use time accurate performance
model (applications, automatically tuned
collectives)
Models (LogP..) dont capture important behavior
(parallelism, congestion, resource constraints,
non-linear behavior)
Exhaustive search of the optimization space
Validated only at low concurrency (tens of
procs), might break at high concurrency, might
break for torus networks
Our approach
Use performance model for ideal implementation
Understand hardware resource constraints and the
variation of performance parameters (understand
trends not absolute values)
Derive implementation constraints to satisfy both
optimal implementation and hardware constraints
Force implementation parameters to converge
towards optimal

58
Performance

Network bandwidth and overhead
Application communication pattern and schedule
(congestion), computation

Overhead is determined by message size,
communication schedule, hardware flow of control
Bandwidth is determined by message size,
communication schedule, fairness of allocation
Iancu, Strohmaier
59
Validation

Understand network behavior in the presence of
non-blocking communication (Infiniband, Elan)
Develop performance model for scenarios widely
encountered in applications (p2p, scatter,
gather, all-to-all) and a variety of aggressive
optimization techniques (strip mining,
pipelining, communication schedule skewing)
Use both micro-benchmarks and application kernels
to validate approach

Iancu, Strohmaier
60
Findings

Can choose optimal values for implementation
parameters
Time accurate model for an implementation hard
to develop, inaccurate at high concurrency
Methodology does not require exhaustive search of
the optimization space (only p2p and qualitative
behavior of gather)
In practice one can produce templatized
implementations for an algorithm and use our
approach to determine optimal values code
generation (UPC), automatic tuning of collective
operations, application development
Need to further understand the mathematical and
statistical properties

61
End
62
Avoid synchronization Data-driven Execution

Many algorithms require synchronization with
remote processor
What is the right mechanism in a PGAS model for
doing this?
Is it still one-sided?
Part 3 Event-Driven Execution Models

63
Mechanisms for Event-Driven Execution

Put operation does a send side notification
Needed for memory consistency model ordering
Need to signal remote side on completion
Strategies
Have remote side do a get (works in some
algorithms)
Put strict flag write do a put, wait for
completion, then do another (strict) put
Pipelined put put-flag works only on ordered
networks
Signaling put add new store operation that
embeds signal (2nd remote address) into single
message