Steven Seidel presentation

About This Presentation

Transcript and Presenter's Notes

Title: Steven Seidel

1

Steven Seidel
Department of Computer Science
Michigan Technological University
steve_at_mtu.edu

2
Overview

Background
Collective operations in the UPC language
The V1.0 UPC collectives specification
Relocalization operations
Computational operations
Performance and implementation issues
Extensions
Other work

3
Background

UPC is an extension of C that provides a
partitioned shared memory programming model.
The V1.1 UPC spec was adopted on March 25.
Processes in UPC are called threads.
Each thread has a private (local) address space.
All threads share a global address space that is
partitioned among the threads.
A shared object that resides in thread is
partition is said to have affinity to thread i.
If thread i has affinity to a shared object x, it
is expected that accesses to x take less time
than accesses to shared objects to which thread i
does not have affinity.

4
UPC programming model
5
Collective operations in UPC

If any thread calls a collective function, then
all threads must also call that function.
Collectives arguments are single-valued
corresponding function arguments have the same
value.
V1.1 UPC contains several collective functions
upc_notify and upc_wait
upc_barrier
upc_all_alloc
upc_all_lock_alloc
These collectives provide synchronization and
memory allocation across all threads.

6
shared void upc_all_alloc(nblocks, nbytes)

This function allocates shared nbytes
charnblocksnbytes

shared 5 char p
shared
local
pupc_all_alloc(4,5)
pupc_all_alloc(4,5)
pupc_all_alloc(4,5)
7
The V1.0 UPC Collectives Spec

First draft by Wiebel and Greenberg, March 2002.
Spec discussed at May, 2002, and SC02 UPC
workshops.
Many helpful comments from Dan Bonachea and Brian
Wibecan.
V1.0 will be released shortly.

8
Collective functions

Initialization
upc_all_init
Relocalization collectives change data
affinity.
upc_all_broadcast
upc_all_scatter
upc_all_gather
upc_all_gather_all
upc_all_exchange
upc_all_permute
Computational collectives for reduction and
sorting.
upc_all_reduce
upc_all_prefix_reduce
upc_all_sort

9
void upc_all_broadcast(dst, src, blk)
Thread 0 sends the same block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblk

blk
10
void upc_all_scatter(dst, src, blk)
Thread 0 sends a unique block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblkTHREADS
11
void upc_all_gather(dst, src, blk)
Each thread sends a block of data to thread 0.
shared char dstblkTHREADS
shared blk char srcblkTHREADS
12
void upc_all_gather_all(dst, src, blk)
Each thread sends one block of data to all
threads.
13
void upc_all_exchange(dst, src, blk)
Each thread sends a unique block of data to each
thread.
14
void upc_all_permute(dst, src, perm, blk)
Thread i sends a block of data to thread perm(i).
15
Computational collectives

Reduce and prefix reduce
One function for each C scalar type, e.g.,
upc_all_reduceI() returns an integer
Operations
, , , , XOR, , , min, max
user-defined binary function
Sort
User-defined comparison function
void upc_all_sort(shared void A,
size_t size, size_t n, size_t blk,
int (func)(shared void , shared void ))

16
int upc_all_reduceI(src, UPC_ADD, n, blk, NULL)
int i
shared 3 int src4THREADS
0
6
3
4
8
1
16
64
128
256
2
32
shared
512
1024
2048
S
S
448
56
S
3591
9
4095
local
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
17
void upc_all_prefix_reduceI(dst, src, UPC_ADD, n,
blk, NULL)
shared int src3THREADS, dst3THREADS
63
3
7
1
32
4
16
2
8
64
128
256
127
15
511
3
7
63
127
1
31
255
15
31
255
18
Performance and implementation issues

Push or pull?
Synchronization semantics
Effects of data distribution

19
A pull implementation of upc_all_broadcast
void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) upc_memcpy(
(shared char )dst MYTHREAD, (shared char
)src, blk )
0
2
1
20
A push implementation of upc_all_broadcast
void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) int i
upc_forall( i0 iltTHREADS i 0) // Thread
0 only upc_memcpy( (shared char )dst i,
(shared char )src, blk )
0
2
1
2
0
1
21
Synchronization semantics

When are function arguments ready?
When are function results available?

22
Synchronization semantics

Arguments with affinity to thread i are ready
when thread i calls the function results with
affinity to thread i are ready when thread i
returns.
This is appealing but it is incorrect In a
broadcast, thread 1 does not know when thread 0
is ready.

0
2
1
23
Synchronization semantics

Require the implementation to provide barriers at
function entry and exit.
This is convenient for the programming but it is
likely to adversely affect performance.

void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) upc_barrier
// pull upc_memcpy( (shared char )dst
MYTHREAD, (shared char )src, blk )
upc_barrier
24
Synchronization semantics

V1.0 spec Synchronization is a user
responsibility.

define numelems 10 shared int
Anumelems shared numelems int
BnumelemsTHREADS void upc_all_broadcast(
shared void dst, shared const void src, size_t
blk ) upc_memcpy( (shared char )dst
MYTHREAD, (shared char )src, blk ) . . //
Initialize A. . . upc_barrier upc_all_broadcast(
B, A, sizeof(int)numelems ) upc_barrier
25
Performance and implementation issues

Data distribution affects both performance and
implementation.

26
void upc_all_prefix_reduceI(dst, src, UPC_ADD, n,
blk, NULL)
shared int src3THREADS, dst3THREADS
1
2
32
16
4
8
64
128
256
1
127
3
7
15
31
63
255
511
3
7
15
31
63
255
127
27
Extensions

Strided copying
Vectors of offsets for src and dst arrays
Variable-sized blocks
Reblocking (cf preceding example of prefix
reduce)
shared int src3THREADS
shared 3 int dst3THREADS
upc_forall(i0 ilt3THREADS i ?)
dsti srci

28
More sophisticated synchronization semantics

Consider the pull implementation of broadcast.

There is no need for arbitrary threads i and j
(i, j ! 0) to synchronize with each other. Each
thread does a pairwise synchronization with
thread 0. Thread i will not have to wait if it
reaches its synchronization point after thread
0. Thread 0 returns from the call after it has
syncd with each thread.
29
Whats next?

The V1.0 collective spec will be adopted in the
next few weeks.
A reference implementation will be available from
MTU immediately afterwards.

MuPC run time system for UPC
UPC memory model (Chuck Wallace)
UPC programmability (Phil Merkey)
UPC test suite (Phil Merkey)

http//www.upc.mtu.edu

Write a Comment

User Comments (0)

About PowerShow.com

Steven Seidel PowerPoint PPT Presentation