Title: Evolution of the NERSC SP System NERSC User Services
1Evolution of the NERSC SP SystemNERSC User
Services
- Original Plans
- Phase 1
- Phase 2
- Programming Models and Code Porting
- Using the System
2Original Plans The NERSC-3 Procurement
- Complete, reliable, high-end scientific system
- High availability and MTBF
- Fully configured - processing, storage, software,
networking, support - Commercially available components
- The greatest amount of computational power for
the money - Can be integrated with existing computing
environment - Can be evolved with product line
- Much careful benchmarking and acceptance testing
done
3Original Plans The NERSC-3 Procurement
- What we wanted
- gt1 teraflop of peak performance
- 10 terabytes of storage
- 1 terabyte of memory
- What we got in phase 1
- 410 gigaflops of peak performance
- 10 terabytes of storage
- 512 gigabytes of memory
- What we will get in phase 2
- 3 teraflops of peak performance
- 15 terabytes of storage
- 1 terabyte of memory
4Hardware, Phase 1
- 304 Power 3 nodes Nighthawk 1
- Node usage
- 256 compute/batch nodes 512 CPUs
- 8 login nodes 16 CPUs
- 16 GPFS nodes 32 CPUs
- 8 network nodes 16 CPUs
- 16 service nodes 32 CPUs
- 2 processors/node
- 200 MHz clock
- 4 flops/clock (2 multiply-add ops) 800
Mflops/CPU, 1.6 Gflops/node - 64 KB L-1 d-cache per CPU _at_ 5 nsec 3.2
GB/sec - 4 MB L-2 cache per CPU _at_ 45 nsec 6.4
GB/sec - 1 GB RAM per node _at_ 175 nsec
1.6 GB/sec - 150 MB/sec switch bandwidth
- 9 GB local disk (two-way RAID)
5Hardware, Phase 2
- 152 Power 3 nodes Winterhawk 2
- Node usage
- 128 compute/batch nodes 2048 CPUs
- 2 login nodes 32 CPUs
- 16 GPFS nodes 256 CPUs
- 2 network nodes 32 CPUs
- 4 service nodes 64 CPUs
- 16 processors/node
- 375 MHz clock
- 4 flops/clock (2 multiply-add ops) 1.5
Gflops/CPU, 22.4 Gflops/node - 64 KB L-1 d-cache per CPU _at_ 5 nsec 3.2
GB/sec - 8 MB L-2 cache per CPU _at_ 45 nsec 6.4
GB/sec - 8 GB RAM per node _at_ 175 nsec 14.0
GB/sec - 2000 (?) MB/sec switch bandwidth
- 9 GB local disk (two-way RAID)
6Programming Models, phase 1
- Phase 1 will reply on MPI, with availability of
threading - OpenMP directives
- Pthreads
- IBM SMP directives
- MPI now does intra-node communications
efficiently - Mixed-model programming not currently very
advantageous - PVM and LAPI messaging systems are also available
- SHMEM is planned
- The SP has cache and virtual memory, which means
- There are more ways to reduce code performance
- There are more ways to lose portability
7Programming Models, phase 2
- Phase 2 will offer more payback for mixed model
programming - Single node parallelism is a good target for PVP
users - Vector and shared-memory codes can be expanded
into MPI - MPI codes can be ported from the T3E
- Threading can be added within MPI
- In both cases, re-engineering will be required,
to exploit new and different levels of
granularity - This can be done along with increasing problem
sizes
8Porting Considerations, part 1
- Things to watch out for in porting codes to the
SP - Cache
- Not enough on the T3E to make worrying about it
worth the trouble - Enough on the SP to boost performance, if its
used well - Tuning for cache is different than tuning for
vectorization - False sharing caused by cache can reduce
perfomrance - Virtual memory
- Gives you access to 1.75 GB of (virtual) RAM
address space - To use all of virtual (or even real) memory, must
explicitly request segments - Causes performance degradation due to paging
- Data types
- Default sizes are different on PVP, T3E, and SP
systems - integer, int, real, and float must be
used carefully - Best to say what you mean real8, integer4
- Do the same in MPI calls MPI_REAL8,
MPI_INTEGER4 - Be careful with intrinsic function use, as well
9Porting Considerations, part 2
- More things to watch out for in porting codes to
the SP - Arithmetic
- Architecture tuning can help exploit special
processor instructions - Both T3E and SP can optimize beyond IEEE
arithmetic - T3E and PVP can also do fast reduced precision
arithmetic - Compiler options on T3E and SP can force IEEE
compliance - Compiler options can also throttle other
optimizations for safety - Special libraries offer faster intrinsics
- MPI
- SP compilers and runtime will catch loose usage
that was accepted on the T3E - Communication bandwidth on SP Phase 1 is lower
than on the T3E - Message latency on the SP Phase 1 is higher than
on the T3E - We expect approximate parity with T3E in these
areas, on the Phase 2 system - Limited number of communication ports per node -
approximately one per CPU - Default versus eager buffer management in
MPI_SEND
10Porting Considerations, part 3
- Compiling linking
- Version is dependent on language and
parallelization scheme - Language version
- Fortran 77 f77, xlf
- Fortran 90 xlf90
- Fortran 95 xlf95
- C cc, xlc, c89
- C xlC
- MPI-included mpxlf, mpxlf90, mpcc, mpCC
- Thread-safe xlf_r, xlf90_r, xlf95_r, mpxlf_r,
mpxlf90_r - Preprocessing can be ordered by compiler flag or
source file suffix - Use consistently, for all related compilations
the following may NOT produce a parallel
executable - mpxlf90 -c .F
- xlf90 -o foo .o
- Use -bmaxdatabytes option to get more than a
single 256 MB segment (up to 7 segments, or 1.75
GB can be specified only 3, or 0.75 GB, are real)
11Porting MPI
- MPI codes should port relatively well
- Use one MPI task per node or processor
- One per node during porting
- One per processor during production
- Let MPI worry about where its communicating to
- Environment variables, execution parameters,
and/or batch options can specify - tasks per node
- Total tasks
- Total processors
- Total nodes
- Communications subsystem in use
- User Space is best in batch jobs
- IP may be best for interactive developmental runs
- There is a debug queue/class in batch
12Porting Shared Memory
- Dont throw away old shared memory directives
- OpenMP will work as is
- Cray Tasking directives will be useful for
documentation - We recommend porting Cray directives to OpenMP
- Even small-scale parallelism can be useful
- Larger scale parallelism will be available next
year - If your problems and/or algorithms will scale to
larger granularities and greater parallelism,
prepare for message passing - We recommend MPI
13From Loop-slicing to MPI, before...
- allocate(A(1imax,1jmax))
- !OMP PARALLEL DO PRIVATE(I, J), SHARED(A, imax,
jmax) - do I 1, imax
- do J 1, jmax
- A(I,J) deep_thought(A, I, J,)
- enddo
- enddo
- Sanity checking
- Run the program on one CPU to get baseline
answers - Run on several CPUs to see parallel speedups and
answers - Optimization
- Consider changing memory access patterns to
improve cache usage - How big can your problem get before you run out
of real memory?
14From Loop-slicing to MPI, after...
- call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
- call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
- call my_indices(my_id, nprocs, my_imin, my_imax,
my_jmin, my_jmax) - allocate(A(my_imin my_imax, my_jmin my_jmax))
- !OMP PARALLEL DO PRIVATE(I, J), SHARED(A,
my_imax, my_jmax, my_imax, my_jmax) - do I my_imin, my_imax
- do J my_jmin, my_jmax
- A(I,J) deep_thought(A, I, J,)
- enddo
- enddo
- ! Communicate the shared values with neighbors
- if(odd(my_ID)) then
- call MPI_SEND(my_left(...), leftsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr) - call MPI_RECV(my_right(...), rightsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr) - call MPI_SEND(my_top(...), leftsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr) - call MPI_RECV(my_bottom(...), rightsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr) - else
- call MPI_RECV(my_right(...), rightsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr) - call MPI_SEND(my_left(...), leftsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr)
15From Loop-slicing to MPI, after...
- You now have one MPI task and many OpenMP threads
per node - The MPI task does all the communicating between
nodes - The OpenMP threads do the parallelizable work
- Do NOT use MPI within an OpenMP parallel region
- Sanity checking
- Run on one node and one CPU to check baseline
answers - Run on one node and several CPUs to see parallel
speedup and answers - Run on several nodes, one CPU per node, and check
answers - Run on several nodes, several CPUs per node, and
check answers - Scaling checking
- Run a larger version of a similar problem on the
same set of ensemble sizes - Run the same sized problem on a larger ensemble
- (Re-)Consider your I/O strategy
16From MPI to Loop-slicing
- Add OpenMP directives to existing code
- Perform sanity and scaling checks, as before
- Results in same overall code structure as on
previous slides - One MPI task and several OpenMP threads per node
- For irregular codes, Pthreads may serve better,
at the cost of increased complexity - Nobody really expects it to be this easy...
17Using the Machine, part 1
- Somewhat similar to the Crays
- Interactive and batch jobs are possible
18Using the Machine , part 2
- Interactive runs
- Sequential executions run immediately on your
login node - Every login will likely put you on a different
node, so be careful about looking for your
executions - ps returns info about only the
node youre logged into. - Small scale parallel jobs may be rejected if
LoadLeveler cant find the resources - There are two pools of nodes that can be used for
interactive jobs - Login nodes
- A small subset of the compute nodes
- Parallel execution can often be achieved by
- Trying again, after initial rejection
- Changing communication mechanisms from User Space
to IP - Using the other pool
19Using the Machine , part 3
- Batch jobs
- Currently, very similar in capability to the T3E
- Similar run times, processor counts
- More memory available on the SP
- Limits and capabilities may change, as we learn
the machine - LoadLeveler is similar to, but simpler than
NQE/NQS on the T3E - Jobs are submitted, monitored, and cancelled by
special commands - Each batch job requires a script that is
essentially a shell script - The first few lines contain batch options that
look like comments to the shell - The rest of the script can contain any shell
constructs - Scripts can be debugged by executing them
interactively - Users are limited to 3 running jobs, 10 queued
jobs, and 30 submitted jobs, at any given time
20Using the Machine , part 4
- File systems
- Use the environment variables to let the system
manage your file usage - Sequential work can be done in HOME (not backed
up) or TMPDIR (transient) - Medium performance, node-local
- Parallel work can be done in SCRATCH (transient)
or /scratch/username
(purgeable) - High performance, located in GPFS
- HPSS is available from batch jobs, via HSI,
and
interactively via FTP, PFTP, and HIS - There are quotas on space and inode usage
21Using the Machine , part 4
- The future?
- The allowed scale of parallelism (CPU counts) may
change - Max now 512 CPUs, same as on T3E
- The allowed duration of runs may change
- Max now 4 hours Max on T3E 12 hours
- The size of possible problems will definitely
change - More CPUs in phase 1 than the T3E
- More memory per cpu, in both phases, than on T3e
- The amount of work possible per unit time will
definitely change - CPUs in both phases are faster than those on the
T3E - Phase 2 interconnect will be faster than on Phase
1 - Better machine management
- Checkpointing will be available
- We will learn what can be adjusted in the batch
system - There will be more and better tools for
monitoring and tuning - HPM, KAP, Tau, PAPI...
- Some current problems will go away (e.g. memory
mapped files)