Shared Memory Programming for Large Scale Machines - PowerPoint PPT Presentation

About This Presentation
Title:

Shared Memory Programming for Large Scale Machines

Description:

Shared Memory Programming for Large Scale Machines. C. Barton1, ... 64 x 32 x 32 3d packet-switched torus network. XL UPC compiler and UPC runtime system (RTS) ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 28
Provided by: csl1
Learn more at: https://www.csl.mtu.edu
Category:

less

Transcript and Presenter's Notes

Title: Shared Memory Programming for Large Scale Machines


1
Shared Memory Programming for Large Scale Machines
C. Barton1, C. Cascaval2, G. Almasi2, Y. Zheng3,
M. Farreras4, J. Nelson Amaral1 1University of
Alberta 2IBM Watson Research Center 3Purdue 4Un
iversitat Politecnica de Catalunya IBM Research
Report RC23853 January 27, 2006
2
Abstract
  • UPC is scalable and competitive with MPI on
    hundreds of thousands of processors.
  • This paper discusses the compiler and runtime
    system features that achieve this performance on
    the IBM BlueGene/L.
  • Three benchmarks are used
  • HPC RandomAccess
  • HPC STREAMS
  • NAS Conjugate Gradient (CG).

3
1. BlueGene/L
  • 65,536 x 2-way 700MHz processors (low power)
  • 280 sustained Tflops on HPL Linpack
  • 64 x 32 x 32 3d packet-switched torus network
  • XL UPC compiler and UPC runtime system (RTS)

4
2.1 XL Compiler Structure
  • UPC source is translated to W-code
  • An early version did as MuPC calls to the RTS
    were inserted into W-code. This prevents
    optimizations such as copy propagation and common
    sub-expression elimination.
  • The current version delays the insertion of RTS
    calls. W-code is extended to represent shared
    variables and the memory access mode (strict or
    relaxed).

5
XL Compiler (contd)
  • Toronto Portable Optimizer (TPO) can apply all
    the classical optimizations to shared memory
    accesses.
  • UPC-specific optimizations are also performed.

6
2.2 UPC Runtime System
  • The RTS targets
  • SMPs using Pthreads
  • Ethernet and LAPI clusters using LAPI
  • BlueGene/L using the BlueGene/L message layer
  • TPO does link-time optimizations between the user
    program and the RTS.
  • Shared objects are accessed through handles.

7
Shared objects
  • The RTS identifies five shared object types
  • shared scalars
  • shared structures/unions/enumerations
  • shared arrays
  • shared pointers sic with shared targets
  • shared pointers sic with private targets
  • Fat pointers increase remote access costs and
    limit scalability.
  • (optimizing remote accesses is discussed soon)

8
Shared Variable Directory (SVD)
  • Each thread on a distributed memory machine
    contains a two-level SVD containing handles
    pointing to all shared objects.
  • The SVD in each thread has THREADS1 partitions.
  • Partition i contains handles for shared objects
    in thread i, except the last partition which
    contains handles for statically declared shared
    arrays.
  • Local sections of shared arrays do not have to be
    mapped to the same address on each thread.

9
SVD benefits
  • Scalability Pointers to shared do not have to
    span all of shared memory. Only the owner knows
    the addresses of its shared object. Remote
    access are made via handles.
  • Each thread mediates access to its shared objects
    so coherence problems are reduced1.
  • Only nonblocking synchronization is needed for
    upc_global_alloc(), for example.

1 Runtime caching is beyond the scope of this
paper.
10
2.3 Messaging Library
  • This topic is beyond the scope of this talk.
  • Note, however, that the network layer does not
    support one-sided communication.

11
3. Compiler Optimizations
  • 3.1 upc_forall(init limit incr affinity)
  • 3.2 local memory optimizations
  • 3.3 update optimizations

12
3.1 upc_forall
  • The affinity parameter may be
  • pointer-to-shared
  • integer type
  • continue
  • If the (unmodified) induction variable is used
    the conditional is eliminated.
  • This is the only optimization technique used.
  • ... even this simple optimization captures most
    of the loops in the existing UPC benchmarks.

13
Observations
  • upc_forall loops cannot be meaningfully nested.
  • upc_forall loops must be inner loops for this
    optimization to pay off.

14
3.2 Local Memory Operations
  • Try to turn dereferences of fat pointers into
    dereferences of ordinary C pointers.
  • Optimization is attempted only when affinity can
    be statically determined.
  • Move the base address calculation to the loop
    preheader (initialization block).
  • Generate code to access intrinsic types directly,
    otherwise use memcpy.

15
3.3 Update Optimizations
  • Consider operations of the form r r op B, where
    r is a remote shared object and B is local or
    remote.
  • Implement this as an active message Culler, UC
    Berkeley.
  • Send the entire instruction to the thread with
    affinity to r.

16
4. Experimental Results
  • 4.1 Hardware
  • 4.2 HPC RandomAccess benchmark
  • 4.3 Embarrassingly Parallel (EP) STREAM triad
  • 4.4 NAS CG
  • 4.5 Performance evaluation

17
4.1 Hardware
  • Development done on 64-processor node cards.
  • TJ Watson 20 racks, 40960 processors
  • LLNL 64 racks, 131072 processors

18
4.2 HPC RandomAccess
  • 111 lines of code
  • Read-modify-write randomly selected remote
    objects.
  • Use 50 of memory.
  • Seems a good match for the update optimization.

19
4.3 EP STREAM Triad
  • 105 lines of code
  • All computation is done locally within a
    upc_forall loop.
  • Seems like a good match for the loop
    optimization.

20
4.4 NAS CG
  • GWs translation of MPI code into UPC.
  • Uses upc_memcpy in place of MPI sends and
    receives.
  • It is not clear whether IBM used GWs
    hand-optimized version.
  • IBM mentions that they manually privatized some
    pointers, which is what is done in GWs optimized
    version.

21
4.5 Performance Evaluation
  • Table 1
  • FE is MuPC-style front end containing some
    optimizations
  • Others all use TPO front end
  • no optimizations
  • indexing is shared to local pointer reduction
  • update is active messages
  • forall is upc_forall affinity optimization
  • Speedups are relative to no TPO optimization
  • maximum speedup for random and stream is 2.11

22
Combined Speedup
  • The combined stream speedup is 241!
  • This is attributed to the shared to local pointer
    reductions.
  • This seems inconsistent with indexing speedups
    of 1.01 and 1.32 for random and streams
    benchmarks, respectively.

23
Table 2 Random Access
  • This is basically a measurement of how many
    asynchronous messages can be started up.
  • It is not known whether the network can do
    combining.
  • Beats MPI (0.56 vs. 0.45) on 2048 processors.

24
Table 3 Streams
  • This is EP.

25
CG
  • Speedup tracks MPI through 512 processors.
  • Speedup exceeds MPI on 1024 and 2048 processors.
  • This is a fixed-problem-size benchmark so network
    latency eventually dominates.
  • The improvement over MPI is explained In the
    UPC implementation, due to the use of one-sided
    communication, the overheads are smaller
    compared to MPI two-sided overhead. But the
    BlueGene/L network does not implement one-sided
    communication.

26
Comments? I have some
  • Recall slide 12 ... even this simple
    upc_forall affinity optimization captures most
    of the loops in the existing UPC benchmarks.
  • From the abstract We demonstrate not only that
    shared memory programming for hundreds of
    thousands of processors is possible, but also
    that with the right support from the compiler and
    run-time system, the performance of the resulting
    codes is comparable to MPI implementations.
  • The large-scale scalability demonstrated is for
    two 100-line codes for the simplest of all
    benchmarks.
  • The scalability of CG was knowingly limited by
    fixed problem size. Only two data points are
    offered that outperform MPI.

27
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com