Title: Steven Seidel
1- Steven Seidel
- Department of Computer Science
- Michigan Technological University
- steve_at_mtu.edu
2Overview
- Background
- Collective operations in the UPC language
- The V1.0 UPC collectives specification
- Relocalization operations
- Computational operations
- Performance and implementation issues
- Extensions
- Other work
3Background
- UPC is an extension of C that provides a
partitioned shared memory programming model. - The V1.1 UPC spec was adopted on March 25.
- Processes in UPC are called threads.
- Each thread has a private (local) address space.
- All threads share a global address space that is
partitioned among the threads. - A shared object that resides in thread is
partition is said to have affinity to thread i. - If thread i has affinity to a shared object x, it
is expected that accesses to x take less time
than accesses to shared objects to which thread i
does not have affinity.
4UPC programming model
5Collective operations in UPC
- If any thread calls a collective function, then
all threads must also call that function. - Collectives arguments are single-valued
corresponding function arguments have the same
value. - V1.1 UPC contains several collective functions
- upc_notify and upc_wait
- upc_barrier
- upc_all_alloc
- upc_all_lock_alloc
- These collectives provide synchronization and
memory allocation across all threads.
6shared void upc_all_alloc(nblocks, nbytes)
- This function allocates shared nbytes
charnblocksnbytes
shared 5 char p
shared
local
pupc_all_alloc(4,5)
pupc_all_alloc(4,5)
pupc_all_alloc(4,5)
7The V1.0 UPC Collectives Spec
- First draft by Wiebel and Greenberg, March 2002.
- Spec discussed at May, 2002, and SC02 UPC
workshops. - Many helpful comments from Dan Bonachea and Brian
Wibecan. - V1.0 will be released shortly.
8Collective functions
- Initialization
- upc_all_init
- Relocalization collectives change data
affinity. - upc_all_broadcast
- upc_all_scatter
- upc_all_gather
- upc_all_gather_all
- upc_all_exchange
- upc_all_permute
- Computational collectives for reduction and
sorting. - upc_all_reduce
- upc_all_prefix_reduce
- upc_all_sort
9void upc_all_broadcast(dst, src, blk)
Thread 0 sends the same block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblk
blk
10void upc_all_scatter(dst, src, blk)
Thread 0 sends a unique block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblkTHREADS
11void upc_all_gather(dst, src, blk)
Each thread sends a block of data to thread 0.
shared char dstblkTHREADS
shared blk char srcblkTHREADS
12void upc_all_gather_all(dst, src, blk)
Each thread sends one block of data to all
threads.
13void upc_all_exchange(dst, src, blk)
Each thread sends a unique block of data to each
thread.
14void upc_all_permute(dst, src, perm, blk)
Thread i sends a block of data to thread perm(i).
15Computational collectives
- Reduce and prefix reduce
- One function for each C scalar type, e.g.,
- upc_all_reduceI() returns an integer
- Operations
- , , , , XOR, , , min, max
- user-defined binary function
- Sort
- User-defined comparison function
- void upc_all_sort(shared void A,
- size_t size, size_t n, size_t blk,
- int (func)(shared void , shared void ))
16int upc_all_reduceI(src, UPC_ADD, n, blk, NULL)
int i
shared 3 int src4THREADS
0
6
3
4
8
1
16
64
128
256
2
32
shared
512
1024
2048
S
S
448
56
S
3591
9
4095
local
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
17void upc_all_prefix_reduceI(dst, src, UPC_ADD, n,
blk, NULL)
shared int src3THREADS, dst3THREADS
63
3
7
1
32
4
16
2
8
64
128
256
127
15
511
3
7
63
127
1
31
255
15
31
255
18Performance and implementation issues
- Push or pull?
- Synchronization semantics
- Effects of data distribution
19A pull implementation of upc_all_broadcast
void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) upc_memcpy(
(shared char )dst MYTHREAD, (shared char
)src, blk )
0
2
1
20A push implementation of upc_all_broadcast
void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) int i
upc_forall( i0 iltTHREADS i 0) // Thread
0 only upc_memcpy( (shared char )dst i,
(shared char )src, blk )
0
2
1
2
0
1
21Synchronization semantics
- When are function arguments ready?
- When are function results available?
22Synchronization semantics
- Arguments with affinity to thread i are ready
when thread i calls the function results with
affinity to thread i are ready when thread i
returns. - This is appealing but it is incorrect In a
broadcast, thread 1 does not know when thread 0
is ready.
0
2
1
23Synchronization semantics
- Require the implementation to provide barriers at
function entry and exit. - This is convenient for the programming but it is
likely to adversely affect performance.
void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) upc_barrier
// pull upc_memcpy( (shared char )dst
MYTHREAD, (shared char )src, blk )
upc_barrier
24Synchronization semantics
- V1.0 spec Synchronization is a user
responsibility.
define numelems 10 shared int
Anumelems shared numelems int
BnumelemsTHREADS void upc_all_broadcast(
shared void dst, shared const void src, size_t
blk ) upc_memcpy( (shared char )dst
MYTHREAD, (shared char )src, blk ) . . //
Initialize A. . . upc_barrier upc_all_broadcast(
B, A, sizeof(int)numelems ) upc_barrier
25Performance and implementation issues
- Data distribution affects both performance and
implementation.
26void upc_all_prefix_reduceI(dst, src, UPC_ADD, n,
blk, NULL)
shared int src3THREADS, dst3THREADS
1
2
32
16
4
8
64
128
256
1
127
3
7
15
31
63
255
511
3
7
15
31
63
255
127
27Extensions
- Strided copying
- Vectors of offsets for src and dst arrays
- Variable-sized blocks
- Reblocking (cf preceding example of prefix
reduce) -
- shared int src3THREADS
- shared 3 int dst3THREADS
- upc_forall(i0 ilt3THREADS i ?)
- dsti srci
28More sophisticated synchronization semantics
- Consider the pull implementation of broadcast.
There is no need for arbitrary threads i and j
(i, j ! 0) to synchronize with each other. Each
thread does a pairwise synchronization with
thread 0. Thread i will not have to wait if it
reaches its synchronization point after thread
0. Thread 0 returns from the call after it has
syncd with each thread.
29Whats next?
- The V1.0 collective spec will be adopted in the
next few weeks. - A reference implementation will be available from
MTU immediately afterwards.
30- MuPC run time system for UPC
- UPC memory model (Chuck Wallace)
- UPC programmability (Phil Merkey)
- UPC test suite (Phil Merkey)
http//www.upc.mtu.edu