Unified Parallel C at NERSC

About This Presentation

Title:

Unified Parallel C at NERSC

Description:

Top 500 Supercomputers. Listing of the 500 most powerful computers in the world ... Maxwells Equations on an Unstructured 3D Mesh: Explicit Method ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 43

Provided by: yel3

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Unified Parallel C at NERSC

1
Unified Parallel C at NERSC

Kathy Yelick
EECS, U.C. Berkeley and NERSC/LBNL
UPC Team Dan Bonachea, Jason Duell, Paul
Hargrove, Parry Husbands, Costin Iancu, Mike
Welcome, Christian Bell

2
Outline

Motivation for a new class of languages
Programming models
Architectural trends
Overview of Unified Parallel C (UPC)
Programmability advantage
Performance opportunity
Status
Next step
Related projects

3
Programming Model 1 Shared Memory

Program is a collection of threads of control.
Many languages allow threads to be created
dynamically,
Each thread has a set of private variables, e.g.
local variables on the stack.
Collectively with a set of shared variables,
e.g., static variables, shared common blocks,
global heap.
Threads communicate implicitly by writing/reading
shared variables.
Threads coordinate using synchronization
operations on shared variables

x ...
Shared
y ..x ...
Private
. . .
Pn
P0
4
Programming Model 2 Message Passing

Program consists of a collection of named
processes.
Usually fixed at program startup time
Thread of control plus local address space -- NO
shared data.
Logically shared data is partitioned over local
processes.
Processes communicate by explicit send/receive
pairs
Coordination is implicit in every communication
event.
MPI is the most common example

send P0,X
recv Pn,Y
Y
X
. . .
Pn
P0
5
Advantages/Disadvantages of Each Model

Shared memory
Programming is easier
Can build large shared data structures
Machines dont scale
SMPs typically lt 16 processors (Sun, DEC, Intel,
IBM)
Distributed shared memory lt 128 (SGI)
Performance is hard to predict and control
Message passing
Machines easier to build from commodity parts
Can scale (given sufficient network)
Programming is harder
Distributed data structures only in the
programmers mind
Tedious packing/unpacking of irregular data
structures

6
Global Address Space Programming

Intermediate point between message passing and
shared memory
Program consists of a collection of processes.
Fixed at program startup time, like MPI
Local and shared data, as in shared memory model
But, shared data is partitioned over local
processes
Remote data stays remote on distributed memory
machines
Processes communicate by reads/writes to shared
variables
Examples are UPC, Titanium, CAF, Split-C
Note These are not data-parallel languages
heroic compilers not required

7
GAS Languages on Clusters of SMPs

SMPs are the fastest commodity machine, so used
as a node in large-scale clusters
Common names
CLUMP Cluster of SMPs
Hierarchical machines, constellations
Most modern machines look like this
Millennium, IBM SPs, (not the t3e)...
What is an appropriate programming model?
Use message passing throughout
Unnecessary packing/unpacking overhead
Hybrid models
Write 2 parallel programs (MPI OpenMP or
Threads)
Global address space
Only adds test (on/off node) before local
read/write

8
Top 500 Supercomputers

Listing of the 500 most powerful computers in the
world
- Yardstick Rmax from LINPACK MPP benchmark
Axb, dense problem
- Dense LU Factorization (dominated by matrix
multiply)
Updated twice a year SCxy in the States in
November
Meeting in Mannheim, Germany in June
All data (and slides) available from
www.top500.org
Also measures N-1/2 (size required to get ½ speed)

performance
Rate
Size
9
(No Transcript)
10
(No Transcript)
11
Outline

Motivation for a new class of languages
Programming models
Architectural trends
Overview of Unified Parallel C (UPC)
Programmability advantage
Performance opportunity
Status
Next step
Related projects

12
Parallelism Model in UPC

UPC uses an SPMD model of parallelism
A set if THREADS threads working independently
Two compilation models
THREADS may be fixed at compile time or
Dynamically set at program startup time
MYTHREAD specifies thread index (0..THREADS-1)
Basic synchronization mechanisms
Barriers (normal and split-phase), locks
What UPC does not do automatically
Determine data layout
Load balance move computations
Caching move data
These are intentionally left to the programmer

13
Shared and Private Variables in UPC

A shared variable has one instance, shared by all
threads.
Affinity to thread 0 by default (allocated in
processor 0s memory)
A private variable has an instance per thread
Example
int x // private copy for each
processor
shared int y // one copy on P0, shared by
all others
x 0 y 0
x 1 y 1
After executing this code
x will be 1 in all threads y will be between 1
and THREADS
Shared scalar variable are somewhat rare because
cannot be automatic (declared in a function) (Why
not?)

14
UPC Pointers

Pointers may point to shared or private variables
Same syntax for use, just add qualifier
shared int sp
int lp
sp is a pointer to an integer residing in the
shared memory space.
sp is called a shared pointer (somewhat sloppy).

x 3
Shared
sp
sp
sp
Global address space
Private
15
UPC Pointers

May also have a pointer variable that is shared.
shared int shared sps
int shared spl // does this make
sense?
The most common case is a private variable that
points to a shared object (called a shared
pointer)

sps
Shared
Global address space
Private
16
Shared and Private Rules

Default Types that are neither shared-qualified
nor private-qualified are considered private.
This makes porting uniprocessor libraries easy
Makes porting shared memory code somewhat harder
Casting pointers
A pointer to a private variable may not be cast
to a shared type.
If a pointer to a shared variable is cast to a
pointer to a private object
If the object has affinity with the casting
thread, this is fine.
If not, attempts to de-reference that private
pointer are undefined. (Some compilers may give
better errors than others.)
Why?

17
Shared Arrays in UPV

Shared array elements are spread across the
threads
shared int xTHREADS /One element per
thread /
shared int y3THREADS / 3 elements per
thread /
shared int z3THREADS / 3 elements per
thread, cyclic /
In the pictures below
Assume THREADS 4
Elements with affinity to processor 0 are red

Of course, this is really a 2D array
x
y
blocked
z
cyclic
18
Example Vector Addition

Questions about parallel vector additions
How to layout data (here it is cyclic)
Which processor does what (here it is owner
computes)

/ vadd.c /
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i for(i0 iltN i)
if (MYTHREAD iTHREADS) sumiv1iv2
i

cyclic layout
owner computes
19
Shared Pointers

In the C tradition, array can be access through
pointers
Here is the vector addition example using pointers

include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i shared int p1,
p2 p1v1 p2v2 for (i0 iltN i, p1,
p2) if (i THREADS MYTHREAD) sumip1p2

v1
p1
20
Work Sharing with upc_forall()

Iterations are independent
Each thread gets a bunch of iterations
Simple C-like syntax and semantics
upc_forall(init test loop affinity)
statement
Affinity field to distribute the work
Round robin
Chunks of iterations
Semantics are undefined if there are dependencies
between iterations
Programmer has indicated iterations are
independent

21
Vector Addition with upc_forall

The loop in vadd is common, so there is
upc_forall
4th argument is int expression that gives
affinity
Iteration executes when
affinityTHREADS is MYTHREAD

/ vadd.c /
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i upc_forall(i0
iltN i i)
sumiv1iv2i

22
UPC Vector Matrix Multiplication Code

Here is one possible matrix-vector multiplication

// vect_mat_mult.c include ltupc_relaxed.hgt share
d int aTHREADSTHREADS shared int bTHREADS,
cTHREADS void main (void) int i, j , l
upc_forall( i 0 i lt THREADS i i)
ci 0 for ( l 0 l? THREADS
l) ci ailbl
23
Data Distribution
B

Thread 0
Thread 1
Thread 2
A
B
C
24
A Better Data Distribution
B
Th. 0
Thread 0

Th. 1
Thread 1
Th. 2
Thread 2
A
B
C
25
Layouts in General

All non-array objects have affinity with thread
zero.
Array layouts are controlled by layout
specifiers.
layout_specifier
null
layout_specifier integer_expression
The affinity of an array element is defined in
terms of the
block size, a compile-time constant, and THREADS
a runtime constant.
Element i has affinity with thread
( i / block_size) PROCS.

26
Layout Terminology

Notation is HPF, but terminology is
language-independent
Assume there are 4 processors

(Block, )
(, Block)
(Block, Block)
(Cyclic, )
(Cyclic, Block)
(Cyclic, Cyclic)
27
2D Array Layouts in UPC

Array a1 has a row layout and array a2 has a
block row layout.
shared m int a1 nm
shared km int a2 nm
If (k m) THREADS 0 them a3 has a row
layout
shared int a3 nmk
To get more general HPF and ScaLAPACK style 2D
blocked layouts, one needs to add dimensions.
Assume rc THREADS
shared b1b2 int a5 mnrcb1b2
or equivalently
shared b1b2 int a5 mnrcb1b2

28
UPC Vector Matrix Multiplication Code

Matrix-vector multiplication with better layout

// vect_mat_mult.c include ltupc_relaxed.hgt shar
ed THREADS int aTHREADSTHREADS shared int
bTHREADS, cTHREADS void main (void) int
i, j , l upc_forall( i 0 i lt THREADS
i i) ci 0 for ( l 0 l? THREADS
l) ci ailbl
29
Example Matrix Multiplication in UPC

Given two integer matrices A(NxP) and B(PxM)
Compute C A x B.
Entries Cij in C are computed by the formula

30
Matrix Multiply in C

include ltstdlib.hgt
include lttime.hgt
define N 4
define P 4
define M 4
int aNP, cNM
int bPM
void main (void)
int i, j , l
for (i 0 iltN i)
for (j0 jltM j)
cij 0
for (l 0 l?P l) cij
ailblj

31
Domain Decomposition for UPC

Exploits locality in matrix multiplication

A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below

B(P ? M) is decomposed column wise into M/
THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1

Note N and M are assumed to be multiples of
THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
32
UPC Matrix Multiplication Code
/ mat_mult_1.c / include ltupc_relaxed.hgt share
d NP /THREADS int aNP, cNM // a and c
are row-wise blocked shared matrices sharedM/THR
EADS int bPM //column-wise blocking void
main (void) int i, j , l // private
variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
33
Notes on the Matrix Multiplication Example

The UPC code for the matrix multiplication is
almost the same size as the sequential code
Shared variable declarations include the keyword
shared
Making a private copy of matrix B in each thread
might result in better performance since many
remote memory operations can be avoided
Can be done with the help of upc_memget

34
Overlapping Communication in UPC

Programs with fine-grained communication require
overlap for performance
UPC compiler does this automatically for
relaxed accesses.
Acesses may be designated as strict, relaxed, or
unqualified (the default).
There are several ways of designating the
ordering type.
A type qualifier, strict or relaxed can be used
to affect all variables of that type.
Labels strict or relaxed can be used to control
the accesses within a statement.
strict x y z y1
A strict or relaxed cast can be used to override
the current label or type qualifier.

35
Performance of UPC

Reason why UPC may be slower than MPI
Shared array indexing is expensive
Small messages encouraged by model
Reasons why UPC may be faster than MPI
MPI encourages synchrony
Buffering required for many MPI calls
Remote read/write of a single word may require
very little overhead
Cray t3e, Quadrics interconnect (next version)
Assuming overlapped communication, the real
issues is overhead how much time does it take to
issue a remote read/write?

36
UPC versus MPI for Edge detection
b. Scalability
a. Execution time

Performance from Cray T3E
Benchmark developed by El Ghazawis group at GWU

37
UPC versus MPI for Matrix Multiplication
a. Execution time
b. Scalability

Performance from Cray T3E
Benchmark developed by El Ghazawis group at GWU

38
UPC vs. MPI for Sparse Matrix-Vector Multiply

Short term goal
Evaluate language and compilers using small
applications
Longer term, identify large application

Show advantage of t3e network model and UPC
Performance on Compaq machine worse
Serial code
Communication performance
New compiler just released

39
Particle/Grid Methods in UPC ?

Experience so far in a related language
Titanium, Java-based GAS language
Immersed boundary method

Most time in communication between mesh and
particles
Currently uses bulk communication
May benefit from SPMV trick

40
EM3D Performance in Split-C Language on CM-5
Maxwells Equations on an Unstructured 3D Mesh
Explicit Method
Irregular Bipartite Graph of varying
degree (about 20) with weighted edges
v1
v2
w1
w2
H
E
B
Basic operation is to subtract weighted sum
of neighboring values for all E nodes for
all H nodes
D
41
Split-C Performance Tuning on the CM5

Tuning affects application performance

42
Outline

Motivation for a new class of languages
Programming models
Architectural trends
Overview of Unified Parallel C (UPC)
Programmability advantage
Performance opportunity
Status
Next step
Related projects

43
UPC Implementation Effort

UPC efforts elsewhere
IDA t3e implementation based on old gcc
GMU (documentation) and UMC (benchmarking)
Compaq (Alpha cluster and CMPI compiler (with
MTU))
Cray, Sun, and HP (implementations)
Intrepid (SGI compiler and t3e compiler)
UPC Book
T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick
Three components of NERSC effort
Compilers (SP and PC clusters) optimization
(DOE)
Runtime systems for multiple compilers (DOE
NSA)
Applications and benchmarks
(DOE)

44
Compiler Status

NERSC compiler (Costin Iancu)
Based on Open64 compiler for C
Parses and type-checks UPC
Code generation for SMPs underway
Generate C on most machines, possibly IA64 later
Investigating optimization opportunities
Focus of this compiler is high level
optimizations
Intrepid compiler
Based on gcc (3.x)
Will target our runtime layer on most machines
Initial focus is t3e, then Pentium clusters

45
Runtime System

Characterizing network performance
Low latency (low overhead) -gt programmability
Optimization depend on network characteristics
T3e was ideal
Quadrics reports very low overhead coming
Difficult to access low level SP and Myrinet

46
Next Step

Undertake larger application effort
What type of application?
Challenging to write in MPI (e.g., sparse direct
solvers)
Irregular communication (e.g., PIC)
Well-understood algorithm

47
Outline

Motivation for a new class of languages
Programming models
Architectural trends
Overview of Unified Parallel C (UPC)
Programmability advantage
Performance opportunity
Status
Next step
Related projects

48
3 Related Projects on Campus

Titanium
High performance Java dialect
Collaboration with Phil Colella and Charlie
Peskin
BeBOP Berkeley Benchmarking and Optimization
Self-tuning numerical kernels
Sparse matrix operations
Pyramid mesh generator (Jonathan Shewchuk)

49
Locality and Parallelism

Large memories are slow, fast memories are small.
Storage hierarchies are large and fast on
average.
Parallel processors, collectively, have large,
fast memories -- the slow accesses to remote
data we call communication.
Algorithm should do most work on local data.

50
Tuning pays off ATLAS (Dongarra, Whaley)
Extends applicability of PHIPAC Incorporated in
Matlab (with rest of LAPACK)
51
Speedups on SPMV from Sparsity on Sun Ultra 1/170
1 RHS
52
Speedups on SPMV from Sparsity on Sun Ultra 1/170
9 RHS
53
Future Work