Zhang Zhang, Jeevan Savant, Steve Seidel - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Zhang Zhang, Jeevan Savant, Steve Seidel

Description:

UPC compilers are available for platforms ranging from Linux clusters to the ... calls to MuPC functions to perform remote accesses and other nonlocal actions, ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 39
Provided by: upc5
Learn more at: http://www.upc.mtu.edu
Category:
Tags: jeevan | savant | seidel | steve | zhang

less

Transcript and Presenter's Notes

Title: Zhang Zhang, Jeevan Savant, Steve Seidel


1
A UPC Runtime System Basedon MPI and POSIX
Threads
  • Zhang Zhang, Jeevan Savant, Steve Seidel
  • Department of Computer Science
  • Michigan Technological University
  • zhazhang,jvsavant,steve_at_mtu.edu
  • http//www.upc.mtu.edu


2
Outline
  • Introduction
  • The UPC programming model
  • MuPC overview
  • Related work
  • Runtime system design
  • Performance features
  • Benchmark measurements
  • Summary and continuing work

3
1. Introduction
  • Unified Parallel C (UPC) is an extension of ANSI
    C that provides a partitioned shared memory model
    for parallel programming.
  • UPC programs are SPMD.
  • UPC compilers are available for platforms ranging
    from Linux clusters to the Cray X1.
  • MuPC is a runtime system that manages the
    execution of the users UPC program.
  • The design and performance of MuPC will be
    discussed.

4
2. The UPC programming model
  • A short history
  • An overview of UPC

5
A short history
Culler, Yelick, et al. 1993, UC Berkeley
Draper, Carlson 1995, IDA
Eugene Brooks, et al., 1991 Lawrence Livermore Lab
PCP AC Split-C
Unified Parallel C
El-Ghazawi, Carlson, Draper v1.0 February,
2001 v1.1 July, 2003 v1.2 June, 2005
6
UPC, the language
  • UPC is an extension of ISO C99.
  • Every C program is a UPC program.
  • UPC is not a library like MPI, though UPC has
    libraries.
  • UPC processes are called threads.
  • Predefined identifiers THREADS and MYTHREAD are
    provided.
  • UPC programs are SPMD.
  • UPC is based on a partitioned shared memory
    model.

7
Partitioned shared memory model
  • Shared memory A single address space shared by
    all processors. (Sun Enterprise, Cray T3E)
  • Distributed memory Each process has its own,
    private address space. (Beowulf clusters)
  • Distributed shared memory A single address space
    that is distributed among processors. (An
    illusion usually provided by a run time system.)
  • Partitioned shared memory A single address space
    that is logically partitioned among processors.
    The distribution of memory is built in to the
    language. A physical partition may or may not be
    present on the hardware.

8
UPC partitioned shared address space
  • Each thread has a private (local) address space.
  • All threads share a global address space that is
    partitioned among the threads.
  • A shared object in thread is region of the
    partition is said to have affinity to thread i.
  • If thread i has affinity to a shared object x, it
    is likely that accesses to x take less time than
    accesses to shared objects to which thread i does
    not have affinity.
  • Shared objects must be declared at file scope.

9
UPC programming model
10
Shared arrays and pointers-to-shared
  • Arrays are the fundamental shared objects in UPC.
  • Shared arrays are distributed block-cyclically
    (round robin, in blocksize chunks).
  • shared 5 int p can be used to point to the
    array in the previous example.
  • Pointer arithmetic on p is transparent, that is,
    if p points to A4, then p points to A5.
  • int q q
    (int)p
  • casts p to a true private pointer that can
    access elements of A that have affinity to this
    thread.

11
Parallel matrix multiply in UPC
include ltupc.hgt shared100 double
a100100 shared100 double
b100100 shared100 double c100100 int
i, j, k upc_forall (i0 ilt100 i ai0)
for (j0 jlt100 j) ci,j
0.0 for (k0 klt100 k) ci,j
aikbkj
12
3. MuPC Overview
  • The complete MuPC system consists of a UPC
    compiler and a runtime system based on MPI-1 and
    POSIX threads.
  • Users code app.c is translated by the EDG front
    end into an intermediate language (IL) tree.
  • The IL tree is lowered to pure C which includes
  • local structures representing shared data objects
  • calls to MuPC functions to perform remote
    accesses and other nonlocal actions, such as
    synchronization.
  • app.int.c
  • mupcc compiles app.int.c and links with
    libmupc.a.
  • Run using mpirun np n ./app

13
MuPC Overview
  • The MuPC runtime system is portable to any system
    that supports POSIX threads and MPI-1.
  • MuPC has been ported to Linux clusters,
    Alphaserver clusters, and Sun Enterprise
    platforms.
  • MuPC currently supports UPC v1.1.1
    (not yet at v1.2, though the
    differences are minor).
  • MuPC is open source. The EDG front end is
    distributed as a binary.

14
4. Related work
  • Berkeley UPC (Yelick, Bonachea, et al.)
  • Core and extended GASnet communication APIs allow
    mating UPC and Titanium runtime systems with many
    different transport layers, e.g., Myrinet,
    Quadrics, and even MPI.
  • Front end is Open64 source-to-source translator
  • Runtime system is platform independent
  • Highly portable
  • Open64 translator has a large footprint that
    encourages remote compilation at Berkeley.

15
Other UPC compilers
  • Hewlett-Packard
  • the first commercial UPC compiler
  • supports Tru64 Unix, HP-UX, and XC Linux clusters
  • offers a runtime cache and a trainable prefetcher
  • Intrepid UPC
  • extends GNU GCC compiler
  • only for shared memory platforms such as SGI Irix
    and Cray T3E
  • Cray UPC for the X1 and other current Cray
    platforms

16
5. Runtime system design
  • Runtime objects
  • Shared memory management
  • Shared memory accesses
  • Synchronization
  • Shared memory consistency
  • 2-threaded design

17
a) Runtime objects
  • Constants THREADS and MYTHREAD
  • A pointer to shared type is a structure
  • 64 address bits
  • 32 phase (offset) bits
  • 32 thread number bits
  • linked lists of shared object attributes
  • init and fini routines handle startup/shutdown
    protocol.

18
b) Shared memory management
  • The front end allocates static shared objects
  • On distributed memory platforms corresponding
    elements have the same local address in each
    thread.
  • The memory image of shared objects is the same
    for each thread. Some space may be wasted.
  • At startup, part of the heap is allocated for
    dynamically created shared objects.
  • The same approach to object allocation and
    addressing is used with dynamically created
    shared objects.

19
c) Shared memory accesses
  • Reads and writes of shared memory are performed
    by corresponding get and put functions.
  • get functions are synchronous (blocking)
  • put functions are asynchronous, in general
  • Nonscalar shared objects are always moved
    synchronously as blocks of raw bytes.
  • UPC provides one-sided message passing functions
    such as upc_memcpy(). MuPC implements these with
    its block get and put functions.

20
d) Synchronization
  • upc_barrier and its variants are implemented with
    a tree-based synchronization routine.
    (MPI_Barrier() cannot be used to implement all of
    the variants provided in UPC.)
  • Fences force completion of shared memory
    accesses. MuPC implements fences by blocking
    until pending accesses are complete.
  • UPC provides locks so the programmer can
    synchronize accesses to shared memory. MuPC
    implements a lock with a shared array of THREADS
    bits, one per thread.

21
e) Shared memory consistency
  • UPC supports a noncoherent memory model.
  • relaxed accesses are the default.
  • The programmer can force consistency by
    explicitly stating that a memory operation is
    strict.
  • All threads must see strict operations occur in
    the same order.
  • A fence forces the completion of all outstanding
    memory operations in this thread.
  • Strict accesses consist of a relaxed access
    surrounded by fences.
  • A fence requires an ack from all threads written
    to since the last fence.

22
f) 2-threaded design
  • Each UPC thread is implemented as two POSIX
    threads
  • The user Pthread is the users UPC program.
  • compiled C program with calls to MuPC functions
  • The communication Pthread handles remote
    accesses.
  • an event-driven MPI program servicing requests
    for operations on shared objects
  • includes a persistent MPI_Irecv() to catch
    requests from other threads
  • yields when there are no requests on the queue
  • Thread safety is guaranteed by isolating all MPI
    calls within the communication Pthread.

23
6. Performance features
  • Detecting accesses to local shared memory
  • easy, and critical to good performance
  • Runtime software cache
  • improves performance in some cases

24
Detecting local shared accesses
  • Accesses to local shared memory are detected at
    run time.
  • shared 10 int a10THREADS
  • int i, b10
  • int p
  • // detected by MuPC
  • i 10MYTHREAD
  • bi ai
  • // detected by user
  • p (int )ai
  • bi p

25
Runtime software cache
  • Scalar remote references can be cached.
  • Direct mapped, write back
  • LRU replacement
  • small victim cache
  • THREADS cache segments in each thread (own is
    unused)
  • bit vector used to avoid false sharing
  • Performance of stride 1 reads and writes improved
    by more that a factor of 10.

26
7. Benchmark measurements
  • Streaming remote access
  • measures single-thread remote access rate
  • stride-1 reads and writes
  • random reads and writes
  • Natural ring
  • all threads read and write to neighbor in a ring
  • similar to all-processes-in-a-ring HPC benchmark
  • Units are thousands of references per second.

27
Parallel systems
  • Compiler and RTS
  • MuPC V1.1 without cache
  • MuPC V1.1 with cache
  • Berkeley V2.0 (has no cache)
  • HP V2.2 with cache
  • caches configured for 256xTHREADS 1K blocks
  • Platforms
  • HP AlphaServer SC
  • 8 4-way 667MHz EV67 nodes
  • runs HP UPC and MuPC
  • Linux/Myrinet cluster
  • 16 2-way 2GHz Pentiums
  • runs MuPC and Berkeley UPC

28
Streaming remote accesses
  • Stride-1 reads
  • Stride-1 writes
  • Random reads
  • Random writes

29
Single stream accessesno cache, Pentium cluster
103 References/sec
30
Single stream accessescache, AlphaServer cluster
103 References/sec
31
Single stream accessesPentium clusterMuPC with
and without cache
103 References/sec
32
Natural ringno cache, Pentium cluster
103 References/sec
33
Natural ring cache, AlphaServer cluster
103 References/sec
34
Natural ring, Pentium cluster,MuPC with and
without cache
103 References/sec
35
Performance summary
  • MuPC runtime cache significantly improves
    performance for stride-1 accesses.
  • MuPC runtime cache penalizes random accesses much
    less than stride-1 accesses benefit.
  • HP runtime cache is much slower for writes than
    reads, perhaps due to its write-through design.
  • Without a runtime cache, Berkeley cannot match
    stride-1 access performance of other systems.

36
Performance summary
  • Heavy network traffic increased MuPC and Berkeley
    access times as much as a factor of 10. HP held
    up better except for random reads.
  • HP performs better, sometimes much better, than
    MuPC except for stride-1 writes.
  • When cache is turned off or is not available,
    MuPC and Berkeley performance is a toss up.

37
Other MuPC performance results
  • NAS Parallel Benchmarks
  • EP, CG, FT, IS, MG
  • IPDPS05 PMEO Workshop
  • A UPC performance model
  • histogramming, matrix multiply, Sobel edge
    detection
  • IPDPS06 (to appear)

38
8. Summary and continuing work
  • MuPC is a portable, open source implementation of
    UPC that provides performance comparable to other
    systems of similar design.
  • MuPC is a practical testbed for experimental
    features of partitioned shared memory languages.
  • Work on MuPC is continuing in the areas of
  • performance improvements
  • atomic memory operations
  • one-sided collective operations
Write a Comment
User Comments (0)
About PowerShow.com