Steven Seidel - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Steven Seidel

Description:

All threads share a global address space that is partitioned among the threads. ... These collectives provide synchronization and memory allocation across all threads. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 31
Provided by: Ste8252
Category:
Tags: seidel | steven | threads

less

Transcript and Presenter's Notes

Title: Steven Seidel


1
  • Steven Seidel
  • Department of Computer Science
  • Michigan Technological University
  • steve_at_mtu.edu

2
Overview
  • Background
  • Collective operations in the UPC language
  • The V1.0 UPC collectives specification
  • Relocalization operations
  • Computational operations
  • Performance and implementation issues
  • Extensions
  • Other work

3
Background
  • UPC is an extension of C that provides a
    partitioned shared memory programming model.
  • The V1.1 UPC spec was adopted on March 25.
  • Processes in UPC are called threads.
  • Each thread has a private (local) address space.
  • All threads share a global address space that is
    partitioned among the threads.
  • A shared object that resides in thread is
    partition is said to have affinity to thread i.
  • If thread i has affinity to a shared object x, it
    is expected that accesses to x take less time
    than accesses to shared objects to which thread i
    does not have affinity.

4
UPC programming model
5
Collective operations in UPC
  • If any thread calls a collective function, then
    all threads must also call that function.
  • Collectives arguments are single-valued
    corresponding function arguments have the same
    value.
  • V1.1 UPC contains several collective functions
  • upc_notify and upc_wait
  • upc_barrier
  • upc_all_alloc
  • upc_all_lock_alloc
  • These collectives provide synchronization and
    memory allocation across all threads.

6
shared void upc_all_alloc(nblocks, nbytes)
  • This function allocates shared nbytes
    charnblocksnbytes

shared 5 char p
shared
local
pupc_all_alloc(4,5)
pupc_all_alloc(4,5)
pupc_all_alloc(4,5)
7
The V1.0 UPC Collectives Spec
  • First draft by Wiebel and Greenberg, March 2002.
  • Spec discussed at May, 2002, and SC02 UPC
    workshops.
  • Many helpful comments from Dan Bonachea and Brian
    Wibecan.
  • V1.0 will be released shortly.

8
Collective functions
  • Initialization
  • upc_all_init
  • Relocalization collectives change data
    affinity.
  • upc_all_broadcast
  • upc_all_scatter
  • upc_all_gather
  • upc_all_gather_all
  • upc_all_exchange
  • upc_all_permute
  • Computational collectives for reduction and
    sorting.
  • upc_all_reduce
  • upc_all_prefix_reduce
  • upc_all_sort

9
void upc_all_broadcast(dst, src, blk)
Thread 0 sends the same block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblk

blk
10
void upc_all_scatter(dst, src, blk)
Thread 0 sends a unique block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblkTHREADS
11
void upc_all_gather(dst, src, blk)
Each thread sends a block of data to thread 0.
shared char dstblkTHREADS
shared blk char srcblkTHREADS
12
void upc_all_gather_all(dst, src, blk)
Each thread sends one block of data to all
threads.
13
void upc_all_exchange(dst, src, blk)
Each thread sends a unique block of data to each
thread.
14
void upc_all_permute(dst, src, perm, blk)
Thread i sends a block of data to thread perm(i).
15
Computational collectives
  • Reduce and prefix reduce
  • One function for each C scalar type, e.g.,
  • upc_all_reduceI() returns an integer
  • Operations
  • , , , , XOR, , , min, max
  • user-defined binary function
  • Sort
  • User-defined comparison function
  • void upc_all_sort(shared void A,
  • size_t size, size_t n, size_t blk,
  • int (func)(shared void , shared void ))

16
int upc_all_reduceI(src, UPC_ADD, n, blk, NULL)
int i
shared 3 int src4THREADS
0
6
3
4
8
1
16
64
128
256
2
32
shared
512
1024
2048
S
S
448
56
S
3591
9
4095
local
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
iupc_all_reduceI(src,UPC_ADD,12,3,NULL)
17
void upc_all_prefix_reduceI(dst, src, UPC_ADD, n,
blk, NULL)
shared int src3THREADS, dst3THREADS
63
3
7
1
32
4
16
2
8
64
128
256
127
15
511
3
7
63
127
1
31
255
15
31
255
18
Performance and implementation issues
  • Push or pull?
  • Synchronization semantics
  • Effects of data distribution

19
A pull implementation of upc_all_broadcast
void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) upc_memcpy(
(shared char )dst MYTHREAD, (shared char
)src, blk )
0
2
1
20
A push implementation of upc_all_broadcast
void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) int i
upc_forall( i0 iltTHREADS i 0) // Thread
0 only upc_memcpy( (shared char )dst i,
(shared char )src, blk )
0
2
1
2
0
1
21
Synchronization semantics
  • When are function arguments ready?
  • When are function results available?

22
Synchronization semantics
  • Arguments with affinity to thread i are ready
    when thread i calls the function results with
    affinity to thread i are ready when thread i
    returns.
  • This is appealing but it is incorrect In a
    broadcast, thread 1 does not know when thread 0
    is ready.

0
2
1
23
Synchronization semantics
  • Require the implementation to provide barriers at
    function entry and exit.
  • This is convenient for the programming but it is
    likely to adversely affect performance.

void upc_all_broadcast( shared void dst, shared
const void src, size_t blk ) upc_barrier
// pull upc_memcpy( (shared char )dst
MYTHREAD, (shared char )src, blk )
upc_barrier
24
Synchronization semantics
  • V1.0 spec Synchronization is a user
    responsibility.

define numelems 10 shared int
Anumelems shared numelems int
BnumelemsTHREADS void upc_all_broadcast(
shared void dst, shared const void src, size_t
blk ) upc_memcpy( (shared char )dst
MYTHREAD, (shared char )src, blk ) . . //
Initialize A. . . upc_barrier upc_all_broadcast(
B, A, sizeof(int)numelems ) upc_barrier
25
Performance and implementation issues
  • Data distribution affects both performance and
    implementation.

26
void upc_all_prefix_reduceI(dst, src, UPC_ADD, n,
blk, NULL)
shared int src3THREADS, dst3THREADS
1
2
32
16
4
8
64
128
256
1
127
3
7
15
31
63
255
511
3
7
15
31
63
255
127
27
Extensions
  • Strided copying
  • Vectors of offsets for src and dst arrays
  • Variable-sized blocks
  • Reblocking (cf preceding example of prefix
    reduce)
  • shared int src3THREADS
  • shared 3 int dst3THREADS
  • upc_forall(i0 ilt3THREADS i ?)
  • dsti srci

28
More sophisticated synchronization semantics
  • Consider the pull implementation of broadcast.

There is no need for arbitrary threads i and j
(i, j ! 0) to synchronize with each other. Each
thread does a pairwise synchronization with
thread 0. Thread i will not have to wait if it
reaches its synchronization point after thread
0. Thread 0 returns from the call after it has
syncd with each thread.
29
Whats next?
  • The V1.0 collective spec will be adopted in the
    next few weeks.
  • A reference implementation will be available from
    MTU immediately afterwards.

30
  • MuPC run time system for UPC
  • UPC memory model (Chuck Wallace)
  • UPC programmability (Phil Merkey)
  • UPC test suite (Phil Merkey)

http//www.upc.mtu.edu
Write a Comment
User Comments (0)
About PowerShow.com