Title: Shared Memory Programming for Large Scale Machines
1Shared Memory Programming for Large Scale Machines
C. Barton1, C. Cascaval2, G. Almasi2, Y. Zheng3,
M. Farreras4, J. Nelson Amaral1 1University of
Alberta 2IBM Watson Research Center 3Purdue 4Un
iversitat Politecnica de Catalunya IBM Research
Report RC23853 January 27, 2006
2Abstract
- UPC is scalable and competitive with MPI on
hundreds of thousands of processors. - This paper discusses the compiler and runtime
system features that achieve this performance on
the IBM BlueGene/L. - Three benchmarks are used
- HPC RandomAccess
- HPC STREAMS
- NAS Conjugate Gradient (CG).
31. BlueGene/L
- 65,536 x 2-way 700MHz processors (low power)
- 280 sustained Tflops on HPL Linpack
- 64 x 32 x 32 3d packet-switched torus network
- XL UPC compiler and UPC runtime system (RTS)
42.1 XL Compiler Structure
- UPC source is translated to W-code
- An early version did as MuPC calls to the RTS
were inserted into W-code. This prevents
optimizations such as copy propagation and common
sub-expression elimination. - The current version delays the insertion of RTS
calls. W-code is extended to represent shared
variables and the memory access mode (strict or
relaxed).
5XL Compiler (contd)
- Toronto Portable Optimizer (TPO) can apply all
the classical optimizations to shared memory
accesses. - UPC-specific optimizations are also performed.
62.2 UPC Runtime System
- The RTS targets
- SMPs using Pthreads
- Ethernet and LAPI clusters using LAPI
- BlueGene/L using the BlueGene/L message layer
- TPO does link-time optimizations between the user
program and the RTS. - Shared objects are accessed through handles.
7Shared objects
- The RTS identifies five shared object types
- shared scalars
- shared structures/unions/enumerations
- shared arrays
- shared pointers sic with shared targets
- shared pointers sic with private targets
- Fat pointers increase remote access costs and
limit scalability. - (optimizing remote accesses is discussed soon)
8Shared Variable Directory (SVD)
- Each thread on a distributed memory machine
contains a two-level SVD containing handles
pointing to all shared objects. - The SVD in each thread has THREADS1 partitions.
- Partition i contains handles for shared objects
in thread i, except the last partition which
contains handles for statically declared shared
arrays. - Local sections of shared arrays do not have to be
mapped to the same address on each thread.
9SVD benefits
- Scalability Pointers to shared do not have to
span all of shared memory. Only the owner knows
the addresses of its shared object. Remote
access are made via handles. - Each thread mediates access to its shared objects
so coherence problems are reduced1. - Only nonblocking synchronization is needed for
upc_global_alloc(), for example.
1 Runtime caching is beyond the scope of this
paper.
102.3 Messaging Library
- This topic is beyond the scope of this talk.
- Note, however, that the network layer does not
support one-sided communication.
113. Compiler Optimizations
- 3.1 upc_forall(init limit incr affinity)
- 3.2 local memory optimizations
- 3.3 update optimizations
123.1 upc_forall
- The affinity parameter may be
- pointer-to-shared
- integer type
- continue
- If the (unmodified) induction variable is used
the conditional is eliminated. - This is the only optimization technique used.
- ... even this simple optimization captures most
of the loops in the existing UPC benchmarks.
13Observations
- upc_forall loops cannot be meaningfully nested.
- upc_forall loops must be inner loops for this
optimization to pay off.
143.2 Local Memory Operations
- Try to turn dereferences of fat pointers into
dereferences of ordinary C pointers. - Optimization is attempted only when affinity can
be statically determined. - Move the base address calculation to the loop
preheader (initialization block). - Generate code to access intrinsic types directly,
otherwise use memcpy.
153.3 Update Optimizations
- Consider operations of the form r r op B, where
r is a remote shared object and B is local or
remote. - Implement this as an active message Culler, UC
Berkeley. - Send the entire instruction to the thread with
affinity to r.
164. Experimental Results
- 4.1 Hardware
- 4.2 HPC RandomAccess benchmark
- 4.3 Embarrassingly Parallel (EP) STREAM triad
- 4.4 NAS CG
- 4.5 Performance evaluation
174.1 Hardware
- Development done on 64-processor node cards.
- TJ Watson 20 racks, 40960 processors
- LLNL 64 racks, 131072 processors
184.2 HPC RandomAccess
- 111 lines of code
- Read-modify-write randomly selected remote
objects. - Use 50 of memory.
- Seems a good match for the update optimization.
194.3 EP STREAM Triad
- 105 lines of code
- All computation is done locally within a
upc_forall loop. - Seems like a good match for the loop
optimization.
204.4 NAS CG
- GWs translation of MPI code into UPC.
- Uses upc_memcpy in place of MPI sends and
receives. - It is not clear whether IBM used GWs
hand-optimized version. - IBM mentions that they manually privatized some
pointers, which is what is done in GWs optimized
version.
214.5 Performance Evaluation
- Table 1
- FE is MuPC-style front end containing some
optimizations - Others all use TPO front end
- no optimizations
- indexing is shared to local pointer reduction
- update is active messages
- forall is upc_forall affinity optimization
- Speedups are relative to no TPO optimization
- maximum speedup for random and stream is 2.11
22Combined Speedup
- The combined stream speedup is 241!
- This is attributed to the shared to local pointer
reductions. - This seems inconsistent with indexing speedups
of 1.01 and 1.32 for random and streams
benchmarks, respectively.
23Table 2 Random Access
- This is basically a measurement of how many
asynchronous messages can be started up. - It is not known whether the network can do
combining. - Beats MPI (0.56 vs. 0.45) on 2048 processors.
24Table 3 Streams
25CG
- Speedup tracks MPI through 512 processors.
- Speedup exceeds MPI on 1024 and 2048 processors.
- This is a fixed-problem-size benchmark so network
latency eventually dominates. - The improvement over MPI is explained In the
UPC implementation, due to the use of one-sided
communication, the overheads are smaller
compared to MPI two-sided overhead. But the
BlueGene/L network does not implement one-sided
communication.
26Comments? I have some
- Recall slide 12 ... even this simple
upc_forall affinity optimization captures most
of the loops in the existing UPC benchmarks. - From the abstract We demonstrate not only that
shared memory programming for hundreds of
thousands of processors is possible, but also
that with the right support from the compiler and
run-time system, the performance of the resulting
codes is comparable to MPI implementations. - The large-scale scalability demonstrated is for
two 100-line codes for the simplest of all
benchmarks. - The scalability of CG was knowingly limited by
fixed problem size. Only two data points are
offered that outperform MPI.
27(No Transcript)