Zhang Zhang, Jeevan Savant, Steve Seidel - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Zhang Zhang, Jeevan Savant, Steve Seidel

Description:

UPC compilers are available for platforms ranging from Linux clusters to the ... calls to MuPC functions to perform remote accesses and other nonlocal actions, ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 39

Provided by: upc5

Learn more at: http://www.upc.mtu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Zhang Zhang, Jeevan Savant, Steve Seidel

1
A UPC Runtime System Basedon MPI and POSIX
Threads

Zhang Zhang, Jeevan Savant, Steve Seidel
Department of Computer Science
Michigan Technological University
zhazhang,jvsavant,steve_at_mtu.edu
http//www.upc.mtu.edu

2
Outline

Introduction
The UPC programming model
MuPC overview
Related work
Runtime system design
Performance features
Benchmark measurements
Summary and continuing work

3
1. Introduction

Unified Parallel C (UPC) is an extension of ANSI
C that provides a partitioned shared memory model
for parallel programming.
UPC programs are SPMD.
UPC compilers are available for platforms ranging
from Linux clusters to the Cray X1.
MuPC is a runtime system that manages the
execution of the users UPC program.
The design and performance of MuPC will be
discussed.

4
2. The UPC programming model

A short history
An overview of UPC

5
A short history
Culler, Yelick, et al. 1993, UC Berkeley
Draper, Carlson 1995, IDA
Eugene Brooks, et al., 1991 Lawrence Livermore Lab
PCP AC Split-C
Unified Parallel C
El-Ghazawi, Carlson, Draper v1.0 February,
2001 v1.1 July, 2003 v1.2 June, 2005
6
UPC, the language

UPC is an extension of ISO C99.
Every C program is a UPC program.
UPC is not a library like MPI, though UPC has
libraries.
UPC processes are called threads.
Predefined identifiers THREADS and MYTHREAD are
provided.
UPC programs are SPMD.
UPC is based on a partitioned shared memory
model.

7
Partitioned shared memory model

Shared memory A single address space shared by
all processors. (Sun Enterprise, Cray T3E)
Distributed memory Each process has its own,
private address space. (Beowulf clusters)
Distributed shared memory A single address space
that is distributed among processors. (An
illusion usually provided by a run time system.)
Partitioned shared memory A single address space
that is logically partitioned among processors.
The distribution of memory is built in to the
language. A physical partition may or may not be
present on the hardware.

8
UPC partitioned shared address space

Each thread has a private (local) address space.
All threads share a global address space that is
partitioned among the threads.
A shared object in thread is region of the
partition is said to have affinity to thread i.
If thread i has affinity to a shared object x, it
is likely that accesses to x take less time than
accesses to shared objects to which thread i does
not have affinity.
Shared objects must be declared at file scope.

9
UPC programming model
10
Shared arrays and pointers-to-shared

Arrays are the fundamental shared objects in UPC.
Shared arrays are distributed block-cyclically
(round robin, in blocksize chunks).
shared 5 int p can be used to point to the
array in the previous example.
Pointer arithmetic on p is transparent, that is,
if p points to A4, then p points to A5.
int q q
(int)p
casts p to a true private pointer that can
access elements of A that have affinity to this
thread.

11
Parallel matrix multiply in UPC
include ltupc.hgt shared100 double
a100100 shared100 double
b100100 shared100 double c100100 int
i, j, k upc_forall (i0 ilt100 i ai0)
for (j0 jlt100 j) ci,j
0.0 for (k0 klt100 k) ci,j
aikbkj
12
3. MuPC Overview

The complete MuPC system consists of a UPC
compiler and a runtime system based on MPI-1 and
POSIX threads.
Users code app.c is translated by the EDG front
end into an intermediate language (IL) tree.
The IL tree is lowered to pure C which includes
local structures representing shared data objects
calls to MuPC functions to perform remote
accesses and other nonlocal actions, such as
synchronization.
app.int.c
mupcc compiles app.int.c and links with
libmupc.a.
Run using mpirun np n ./app

13
MuPC Overview

The MuPC runtime system is portable to any system
that supports POSIX threads and MPI-1.
MuPC has been ported to Linux clusters,
Alphaserver clusters, and Sun Enterprise
platforms.
MuPC currently supports UPC v1.1.1
(not yet at v1.2, though the
differences are minor).
MuPC is open source. The EDG front end is
distributed as a binary.

14
4. Related work

Berkeley UPC (Yelick, Bonachea, et al.)
Core and extended GASnet communication APIs allow
mating UPC and Titanium runtime systems with many
different transport layers, e.g., Myrinet,
Quadrics, and even MPI.
Front end is Open64 source-to-source translator
Runtime system is platform independent
Highly portable
Open64 translator has a large footprint that
encourages remote compilation at Berkeley.

15
Other UPC compilers

Hewlett-Packard
the first commercial UPC compiler
supports Tru64 Unix, HP-UX, and XC Linux clusters
offers a runtime cache and a trainable prefetcher
Intrepid UPC
extends GNU GCC compiler
only for shared memory platforms such as SGI Irix
and Cray T3E
Cray UPC for the X1 and other current Cray
platforms

16
5. Runtime system design

Runtime objects
Shared memory management
Shared memory accesses
Synchronization
Shared memory consistency
2-threaded design

17
a) Runtime objects

Constants THREADS and MYTHREAD
A pointer to shared type is a structure
64 address bits
32 phase (offset) bits
32 thread number bits
linked lists of shared object attributes
init and fini routines handle startup/shutdown
protocol.

18
b) Shared memory management

The front end allocates static shared objects
On distributed memory platforms corresponding
elements have the same local address in each
thread.
The memory image of shared objects is the same
for each thread. Some space may be wasted.
At startup, part of the heap is allocated for
dynamically created shared objects.
The same approach to object allocation and
addressing is used with dynamically created
shared objects.

19
c) Shared memory accesses

Reads and writes of shared memory are performed
by corresponding get and put functions.
get functions are synchronous (blocking)
put functions are asynchronous, in general
Nonscalar shared objects are always moved
synchronously as blocks of raw bytes.
UPC provides one-sided message passing functions
such as upc_memcpy(). MuPC implements these with
its block get and put functions.

20
d) Synchronization

upc_barrier and its variants are implemented with
a tree-based synchronization routine.
(MPI_Barrier() cannot be used to implement all of
the variants provided in UPC.)
Fences force completion of shared memory
accesses. MuPC implements fences by blocking
until pending accesses are complete.
UPC provides locks so the programmer can
synchronize accesses to shared memory. MuPC
implements a lock with a shared array of THREADS
bits, one per thread.

21
e) Shared memory consistency

UPC supports a noncoherent memory model.
relaxed accesses are the default.
The programmer can force consistency by
explicitly stating that a memory operation is
strict.
All threads must see strict operations occur in
the same order.
A fence forces the completion of all outstanding
memory operations in this thread.
Strict accesses consist of a relaxed access
surrounded by fences.
A fence requires an ack from all threads written
to since the last fence.

22
f) 2-threaded design

Each UPC thread is implemented as two POSIX
threads
The user Pthread is the users UPC program.
compiled C program with calls to MuPC functions
The communication Pthread handles remote
accesses.
an event-driven MPI program servicing requests
for operations on shared objects
includes a persistent MPI_Irecv() to catch
requests from other threads
yields when there are no requests on the queue
Thread safety is guaranteed by isolating all MPI
calls within the communication Pthread.

23
6. Performance features

Detecting accesses to local shared memory
easy, and critical to good performance
Runtime software cache
improves performance in some cases

24
Detecting local shared accesses

Accesses to local shared memory are detected at
run time.
shared 10 int a10THREADS
int i, b10
int p
// detected by MuPC
i 10MYTHREAD
bi ai
// detected by user
p (int )ai
bi p

25
Runtime software cache

Scalar remote references can be cached.
Direct mapped, write back
LRU replacement
small victim cache
THREADS cache segments in each thread (own is
unused)
bit vector used to avoid false sharing
Performance of stride 1 reads and writes improved
by more that a factor of 10.

26
7. Benchmark measurements

Streaming remote access
measures single-thread remote access rate
stride-1 reads and writes
random reads and writes
Natural ring
all threads read and write to neighbor in a ring
similar to all-processes-in-a-ring HPC benchmark
Units are thousands of references per second.

27
Parallel systems

Compiler and RTS
MuPC V1.1 without cache
MuPC V1.1 with cache
Berkeley V2.0 (has no cache)
HP V2.2 with cache
caches configured for 256xTHREADS 1K blocks
Platforms
HP AlphaServer SC
8 4-way 667MHz EV67 nodes
runs HP UPC and MuPC
Linux/Myrinet cluster
16 2-way 2GHz Pentiums
runs MuPC and Berkeley UPC

28
Streaming remote accesses

Stride-1 reads
Stride-1 writes
Random reads
Random writes

29
Single stream accessesno cache, Pentium cluster
103 References/sec
30
Single stream accessescache, AlphaServer cluster
103 References/sec
31
Single stream accessesPentium clusterMuPC with
and without cache
103 References/sec
32
Natural ringno cache, Pentium cluster
103 References/sec
33
Natural ring cache, AlphaServer cluster
103 References/sec
34
Natural ring, Pentium cluster,MuPC with and
without cache
103 References/sec
35
Performance summary

MuPC runtime cache significantly improves
performance for stride-1 accesses.
MuPC runtime cache penalizes random accesses much
less than stride-1 accesses benefit.
HP runtime cache is much slower for writes than
reads, perhaps due to its write-through design.
Without a runtime cache, Berkeley cannot match
stride-1 access performance of other systems.

36
Performance summary

Heavy network traffic increased MuPC and Berkeley
access times as much as a factor of 10. HP held
up better except for random reads.
HP performs better, sometimes much better, than
MuPC except for stride-1 writes.
When cache is turned off or is not available,
MuPC and Berkeley performance is a toss up.

37
Other MuPC performance results

NAS Parallel Benchmarks
EP, CG, FT, IS, MG
IPDPS05 PMEO Workshop
A UPC performance model
histogramming, matrix multiply, Sobel edge
detection
IPDPS06 (to appear)

38
8. Summary and continuing work

MuPC is a portable, open source implementation of
UPC that provides performance comparable to other
systems of similar design.
MuPC is a practical testbed for experimental
features of partitioned shared memory languages.
Work on MuPC is continuing in the areas of
performance improvements
atomic memory operations
one-sided collective operations

Write a Comment

User Comments (0)