Shared Memory Programming for Large Scale Machines - PowerPoint PPT Presentation

About This Presentation

Title:

Shared Memory Programming for Large Scale Machines

Description:

Shared Memory Programming for Large Scale Machines. C. Barton1, ... 64 x 32 x 32 3d packet-switched torus network. XL UPC compiler and UPC runtime system (RTS) ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 28

Provided by: csl1

Learn more at: https://www.csl.mtu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory Programming for Large Scale Machines

1
Shared Memory Programming for Large Scale Machines
C. Barton1, C. Cascaval2, G. Almasi2, Y. Zheng3,
M. Farreras4, J. Nelson Amaral1 1University of
Alberta 2IBM Watson Research Center 3Purdue 4Un
iversitat Politecnica de Catalunya IBM Research
Report RC23853 January 27, 2006
2
Abstract

UPC is scalable and competitive with MPI on
hundreds of thousands of processors.
This paper discusses the compiler and runtime
system features that achieve this performance on
the IBM BlueGene/L.
Three benchmarks are used
HPC RandomAccess
HPC STREAMS
NAS Conjugate Gradient (CG).

3
1. BlueGene/L

65,536 x 2-way 700MHz processors (low power)
280 sustained Tflops on HPL Linpack
64 x 32 x 32 3d packet-switched torus network
XL UPC compiler and UPC runtime system (RTS)

4
2.1 XL Compiler Structure

UPC source is translated to W-code
An early version did as MuPC calls to the RTS
were inserted into W-code. This prevents
optimizations such as copy propagation and common
sub-expression elimination.
The current version delays the insertion of RTS
calls. W-code is extended to represent shared
variables and the memory access mode (strict or
relaxed).

5
XL Compiler (contd)

Toronto Portable Optimizer (TPO) can apply all
the classical optimizations to shared memory
accesses.
UPC-specific optimizations are also performed.

6
2.2 UPC Runtime System

The RTS targets
SMPs using Pthreads
Ethernet and LAPI clusters using LAPI
BlueGene/L using the BlueGene/L message layer
TPO does link-time optimizations between the user
program and the RTS.
Shared objects are accessed through handles.

7
Shared objects

The RTS identifies five shared object types
shared scalars
shared structures/unions/enumerations
shared arrays
shared pointers sic with shared targets
shared pointers sic with private targets
Fat pointers increase remote access costs and
limit scalability.
(optimizing remote accesses is discussed soon)

8
Shared Variable Directory (SVD)

Each thread on a distributed memory machine
contains a two-level SVD containing handles
pointing to all shared objects.
The SVD in each thread has THREADS1 partitions.
Partition i contains handles for shared objects
in thread i, except the last partition which
contains handles for statically declared shared
arrays.
Local sections of shared arrays do not have to be
mapped to the same address on each thread.

9
SVD benefits

Scalability Pointers to shared do not have to
span all of shared memory. Only the owner knows
the addresses of its shared object. Remote
access are made via handles.
Each thread mediates access to its shared objects
so coherence problems are reduced1.
Only nonblocking synchronization is needed for
upc_global_alloc(), for example.

1 Runtime caching is beyond the scope of this
paper.
10
2.3 Messaging Library

This topic is beyond the scope of this talk.
Note, however, that the network layer does not
support one-sided communication.

11
3. Compiler Optimizations

3.1 upc_forall(init limit incr affinity)
3.2 local memory optimizations
3.3 update optimizations

12
3.1 upc_forall

The affinity parameter may be
pointer-to-shared
integer type
continue
If the (unmodified) induction variable is used
the conditional is eliminated.
This is the only optimization technique used.
... even this simple optimization captures most
of the loops in the existing UPC benchmarks.

13
Observations

upc_forall loops cannot be meaningfully nested.
upc_forall loops must be inner loops for this
optimization to pay off.

14
3.2 Local Memory Operations

Try to turn dereferences of fat pointers into
dereferences of ordinary C pointers.
Optimization is attempted only when affinity can
be statically determined.
Move the base address calculation to the loop
preheader (initialization block).
Generate code to access intrinsic types directly,
otherwise use memcpy.

15
3.3 Update Optimizations

Consider operations of the form r r op B, where
r is a remote shared object and B is local or
remote.
Implement this as an active message Culler, UC
Berkeley.
Send the entire instruction to the thread with
affinity to r.

16
4. Experimental Results

4.1 Hardware
4.2 HPC RandomAccess benchmark
4.3 Embarrassingly Parallel (EP) STREAM triad
4.4 NAS CG
4.5 Performance evaluation

17
4.1 Hardware

Development done on 64-processor node cards.
TJ Watson 20 racks, 40960 processors
LLNL 64 racks, 131072 processors

18
4.2 HPC RandomAccess

111 lines of code
Read-modify-write randomly selected remote
objects.
Use 50 of memory.
Seems a good match for the update optimization.

19
4.3 EP STREAM Triad

105 lines of code
All computation is done locally within a
upc_forall loop.
Seems like a good match for the loop
optimization.

20
4.4 NAS CG

GWs translation of MPI code into UPC.
Uses upc_memcpy in place of MPI sends and
receives.
It is not clear whether IBM used GWs
hand-optimized version.
IBM mentions that they manually privatized some
pointers, which is what is done in GWs optimized
version.

21
4.5 Performance Evaluation

Table 1
FE is MuPC-style front end containing some
optimizations
Others all use TPO front end
no optimizations
indexing is shared to local pointer reduction
update is active messages
forall is upc_forall affinity optimization
Speedups are relative to no TPO optimization
maximum speedup for random and stream is 2.11

22
Combined Speedup

The combined stream speedup is 241!
This is attributed to the shared to local pointer
reductions.
This seems inconsistent with indexing speedups
of 1.01 and 1.32 for random and streams
benchmarks, respectively.

23
Table 2 Random Access

This is basically a measurement of how many
asynchronous messages can be started up.
It is not known whether the network can do
combining.
Beats MPI (0.56 vs. 0.45) on 2048 processors.

24
Table 3 Streams

This is EP.

25
CG

Speedup tracks MPI through 512 processors.
Speedup exceeds MPI on 1024 and 2048 processors.
This is a fixed-problem-size benchmark so network
latency eventually dominates.
The improvement over MPI is explained In the
UPC implementation, due to the use of one-sided
communication, the overheads are smaller
compared to MPI two-sided overhead. But the
BlueGene/L network does not implement one-sided
communication.

26
Comments? I have some

Recall slide 12 ... even this simple
upc_forall affinity optimization captures most
of the loops in the existing UPC benchmarks.
From the abstract We demonstrate not only that
shared memory programming for hundreds of
thousands of processors is possible, but also
that with the right support from the compiler and
run-time system, the performance of the resulting
codes is comparable to MPI implementations.
The large-scale scalability demonstrated is for
two 100-line codes for the simplest of all
benchmarks.
The scalability of CG was knowingly limited by
fixed problem size. Only two data points are
offered that outperform MPI.