Title: Overview
1 Overview Unified Parallel C is an
extension to ANSI C. UPC is a global address
space language for parallel programming. UPC
extends C by providing shared arrays, data
affinity to processors, a parallel loop
construct, locks and split-barrier
synchronization primitives. The first UPC
compiler was written for the Cray T3E. UPC
compilers are now available for AlphaServer and
SGI platforms.
2Example UPC Program
Memory Layout
Thread 0
Thread 1
a0 0
Shared
a1 0
b 0
Local
shared int aTHREADS shared int b void
main(void) if(MYTHREAD 0) a0
4 a1 2 upc_barrier
shared int aTHREADS shared int b void
main(void) upc_barrier if(MYTHREAD 1)
b a0
a0 4
Shared
a1 2
b 4
Local
3The Big Picture
UPC Code
EDG UPC to C
Translator
UPC Intermediate code in C
C
UPC Executable Code
MuPC RTS Object Code
Compiler
MPI Library
4The Run Time System Interface
The run time system interface is divided into six
parts. Initialization and finalization Gets
and put to implement one-sided remote
references. Synchronization functions to
implement the UPC builtins barrier, notify and
wait Locks to implement upc_lock, upc_unlock
and upc_lockattempt Dynamic memory allocation
functions to implement upc_local_alloc,
upc_global_alloc and upc_all_alloc String
functions to implement upc_memcpy, upc_memget,
upc_memset and upc_memput
5MuPC
MuPC is Michigan Technological Universitys
implementation of Compaqs runtime system
interface. MuPC is open source. MuPC
available on Alpha Server, Sun Solaris and Linux
Clusters. MuPC is a user level implementation
based on Pthreads and MPI.
6MuPC Design
mupcrun -n 3 a.out
pthread_create
pthread_create
pthread_create
Send Recv Pthread
Send Recv Pthread
User UPC Pthread
User UPC Pthread
Send Recv Pthread
User UPC Pthread
upc_finalize
upc_finalize
upc_finalize
1 UPC 2 Pthreads 1 Unix process The user
UPC Pthread is the users code. The send/recv
Pthread uses MPI for interprocess communication.
7Ping-Pong Test Performance
LAM MPI 37ms
MuPC 63ms
2GHz Intel Processors, (Gigabit ethernet)
MuPC 55ms
Elan MPI 40ms
AlphaServer
MuPC 75ms
Sun MPI 7ms
Sun Enterprise 4500
Time
8Matrix Multiplication (naïve)
16x2x2GHz Intel processors, Gigabit ethernet
Total problem size 128x128 integer
sharedP int aNP shared int
bPM sharedM int cNM forall(i0iltNi
ai0) for(j0jltMj) sum0
for(k0kltPk) sumaikbkj
cijsum
1 2 4 8
16
9Matrix Multiplication (with prefetching)
16x2x2GHz Intel processors, Gigabit ethernet
Total problem size 128x128 integer
int local_aP forall(j0jltMjb0j)
for(i0iltNi) upc_memget(local_a,ai,
Psizeof(int)) sum0
for(k0kltPk) sumlocal_akbkj
cijsum
1 2 4 8
16
10Matrix Multiplication (prefetching local
pointer)
16x2x2GHz Intel processors, Gigabit ethernet
Total problem size 128x128 integer
int local_aP int pb int strideM/THREADS fo
rall(j0jltMjb0j) for(i0iltNi)
pb(int)b0j upc_memget(local_a,ai,
Psizeof(int)) sum0
for(k0,s0kltPk, sstride)
sumlocal_akpbs cijsum
1 2 4 8
16