Evolution of the NERSC SP System NERSC User Services - PowerPoint PPT Presentation

About This Presentation

Title:

Evolution of the NERSC SP System NERSC User Services

Description:

NERSC User Services Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 22

Provided by: Thoma545

Learn more at: https://www.nersc.gov

Category:

more less

Transcript and Presenter's Notes

Title: Evolution of the NERSC SP System NERSC User Services

1
Evolution of the NERSC SP SystemNERSC User
Services

Original Plans
Phase 1
Phase 2
Programming Models and Code Porting
Using the System

2
Original Plans The NERSC-3 Procurement

Complete, reliable, high-end scientific system
High availability and MTBF
Fully configured - processing, storage, software,
networking, support
Commercially available components
The greatest amount of computational power for
the money
Can be integrated with existing computing
environment
Can be evolved with product line
Much careful benchmarking and acceptance testing
done

3
Original Plans The NERSC-3 Procurement

What we wanted
gt1 teraflop of peak performance
10 terabytes of storage
1 terabyte of memory
What we got in phase 1
410 gigaflops of peak performance
10 terabytes of storage
512 gigabytes of memory
What we will get in phase 2
3 teraflops of peak performance
15 terabytes of storage
1 terabyte of memory

4
Hardware, Phase 1

304 Power 3 nodes Nighthawk 1
Node usage
256 compute/batch nodes 512 CPUs
8 login nodes 16 CPUs
16 GPFS nodes 32 CPUs
8 network nodes 16 CPUs
16 service nodes 32 CPUs
2 processors/node
200 MHz clock
4 flops/clock (2 multiply-add ops) 800
Mflops/CPU, 1.6 Gflops/node
64 KB L-1 d-cache per CPU _at_ 5 nsec 3.2
GB/sec
4 MB L-2 cache per CPU _at_ 45 nsec 6.4
GB/sec
1 GB RAM per node _at_ 175 nsec
1.6 GB/sec
150 MB/sec switch bandwidth
9 GB local disk (two-way RAID)

5
Hardware, Phase 2

152 Power 3 nodes Winterhawk 2
Node usage
128 compute/batch nodes 2048 CPUs
2 login nodes 32 CPUs
16 GPFS nodes 256 CPUs
2 network nodes 32 CPUs
4 service nodes 64 CPUs
16 processors/node
375 MHz clock
4 flops/clock (2 multiply-add ops) 1.5
Gflops/CPU, 22.4 Gflops/node
64 KB L-1 d-cache per CPU _at_ 5 nsec 3.2
GB/sec
8 MB L-2 cache per CPU _at_ 45 nsec 6.4
GB/sec
8 GB RAM per node _at_ 175 nsec 14.0
GB/sec
2000 (?) MB/sec switch bandwidth
9 GB local disk (two-way RAID)

6
Programming Models, phase 1

Phase 1 will reply on MPI, with availability of
threading
OpenMP directives
Pthreads
IBM SMP directives
MPI now does intra-node communications
efficiently
Mixed-model programming not currently very
advantageous
PVM and LAPI messaging systems are also available
SHMEM is planned
The SP has cache and virtual memory, which means
There are more ways to reduce code performance
There are more ways to lose portability

7
Programming Models, phase 2

Phase 2 will offer more payback for mixed model
programming
Single node parallelism is a good target for PVP
users
Vector and shared-memory codes can be expanded
into MPI
MPI codes can be ported from the T3E
Threading can be added within MPI
In both cases, re-engineering will be required,
to exploit new and different levels of
granularity
This can be done along with increasing problem
sizes

8
Porting Considerations, part 1

Things to watch out for in porting codes to the
SP
Cache
Not enough on the T3E to make worrying about it
worth the trouble
Enough on the SP to boost performance, if its
used well
Tuning for cache is different than tuning for
vectorization
False sharing caused by cache can reduce
perfomrance
Virtual memory
Gives you access to 1.75 GB of (virtual) RAM
address space
To use all of virtual (or even real) memory, must
explicitly request segments
Causes performance degradation due to paging
Data types
Default sizes are different on PVP, T3E, and SP
systems
integer, int, real, and float must be
used carefully
Best to say what you mean real8, integer4
Do the same in MPI calls MPI_REAL8,
MPI_INTEGER4
Be careful with intrinsic function use, as well

9
Porting Considerations, part 2

More things to watch out for in porting codes to
the SP
Arithmetic
Architecture tuning can help exploit special
processor instructions
Both T3E and SP can optimize beyond IEEE
arithmetic
T3E and PVP can also do fast reduced precision
arithmetic
Compiler options on T3E and SP can force IEEE
compliance
Compiler options can also throttle other
optimizations for safety
Special libraries offer faster intrinsics
MPI
SP compilers and runtime will catch loose usage
that was accepted on the T3E
Communication bandwidth on SP Phase 1 is lower
than on the T3E
Message latency on the SP Phase 1 is higher than
on the T3E
We expect approximate parity with T3E in these
areas, on the Phase 2 system
Limited number of communication ports per node -
approximately one per CPU
Default versus eager buffer management in
MPI_SEND

10
Porting Considerations, part 3

Compiling linking
Version is dependent on language and
parallelization scheme
Language version
Fortran 77 f77, xlf
Fortran 90 xlf90
Fortran 95 xlf95
C cc, xlc, c89
C xlC
MPI-included mpxlf, mpxlf90, mpcc, mpCC
Thread-safe xlf_r, xlf90_r, xlf95_r, mpxlf_r,
mpxlf90_r
Preprocessing can be ordered by compiler flag or
source file suffix
Use consistently, for all related compilations
the following may NOT produce a parallel
executable
mpxlf90 -c .F
xlf90 -o foo .o
Use -bmaxdatabytes option to get more than a
single 256 MB segment (up to 7 segments, or 1.75
GB can be specified only 3, or 0.75 GB, are real)

11
Porting MPI

MPI codes should port relatively well
Use one MPI task per node or processor
One per node during porting
One per processor during production
Let MPI worry about where its communicating to
Environment variables, execution parameters,
and/or batch options can specify
tasks per node
Total tasks
Total processors
Total nodes
Communications subsystem in use
User Space is best in batch jobs
IP may be best for interactive developmental runs
There is a debug queue/class in batch

12
Porting Shared Memory

Dont throw away old shared memory directives
OpenMP will work as is
Cray Tasking directives will be useful for
documentation
We recommend porting Cray directives to OpenMP
Even small-scale parallelism can be useful
Larger scale parallelism will be available next
year
If your problems and/or algorithms will scale to
larger granularities and greater parallelism,
prepare for message passing
We recommend MPI

13
From Loop-slicing to MPI, before...

allocate(A(1imax,1jmax))
!OMP PARALLEL DO PRIVATE(I, J), SHARED(A, imax,
jmax)
do I 1, imax
do J 1, jmax
A(I,J) deep_thought(A, I, J,)
enddo
enddo
Sanity checking
Run the program on one CPU to get baseline
answers
Run on several CPUs to see parallel speedups and
answers
Optimization
Consider changing memory access patterns to
improve cache usage
How big can your problem get before you run out
of real memory?

14
From Loop-slicing to MPI, after...

call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
call my_indices(my_id, nprocs, my_imin, my_imax,
my_jmin, my_jmax)
allocate(A(my_imin my_imax, my_jmin my_jmax))
!OMP PARALLEL DO PRIVATE(I, J), SHARED(A,
my_imax, my_jmax, my_imax, my_jmax)
do I my_imin, my_imax
do J my_jmin, my_jmax
A(I,J) deep_thought(A, I, J,)
enddo
enddo
! Communicate the shared values with neighbors
if(odd(my_ID)) then
call MPI_SEND(my_left(...), leftsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_RECV(my_right(...), rightsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_SEND(my_top(...), leftsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_RECV(my_bottom(...), rightsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr)
else
call MPI_RECV(my_right(...), rightsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_SEND(my_left(...), leftsize,
MPI_REAL, tag, MPI_COMM_WORLD, ierr)

15
From Loop-slicing to MPI, after...

You now have one MPI task and many OpenMP threads
per node
The MPI task does all the communicating between
nodes
The OpenMP threads do the parallelizable work
Do NOT use MPI within an OpenMP parallel region
Sanity checking
Run on one node and one CPU to check baseline
answers
Run on one node and several CPUs to see parallel
speedup and answers
Run on several nodes, one CPU per node, and check
answers
Run on several nodes, several CPUs per node, and
check answers
Scaling checking
Run a larger version of a similar problem on the
same set of ensemble sizes
Run the same sized problem on a larger ensemble
(Re-)Consider your I/O strategy

16
From MPI to Loop-slicing

Add OpenMP directives to existing code
Perform sanity and scaling checks, as before
Results in same overall code structure as on
previous slides
One MPI task and several OpenMP threads per node
For irregular codes, Pthreads may serve better,
at the cost of increased complexity
Nobody really expects it to be this easy...

17
Using the Machine, part 1

Somewhat similar to the Crays
Interactive and batch jobs are possible

18
Using the Machine , part 2

Interactive runs
Sequential executions run immediately on your
login node
Every login will likely put you on a different
node, so be careful about looking for your
executions - ps returns info about only the
node youre logged into.
Small scale parallel jobs may be rejected if
LoadLeveler cant find the resources
There are two pools of nodes that can be used for
interactive jobs
Login nodes
A small subset of the compute nodes
Parallel execution can often be achieved by
Trying again, after initial rejection
Changing communication mechanisms from User Space
to IP
Using the other pool

19
Using the Machine , part 3

Batch jobs
Currently, very similar in capability to the T3E
Similar run times, processor counts
More memory available on the SP
Limits and capabilities may change, as we learn
the machine
LoadLeveler is similar to, but simpler than
NQE/NQS on the T3E
Jobs are submitted, monitored, and cancelled by
special commands
Each batch job requires a script that is
essentially a shell script
The first few lines contain batch options that
look like comments to the shell
The rest of the script can contain any shell
constructs
Scripts can be debugged by executing them
interactively
Users are limited to 3 running jobs, 10 queued
jobs, and 30 submitted jobs, at any given time

20
Using the Machine , part 4

File systems
Use the environment variables to let the system
manage your file usage
Sequential work can be done in HOME (not backed
up) or TMPDIR (transient)
Medium performance, node-local
Parallel work can be done in SCRATCH (transient)
or /scratch/username
(purgeable)
High performance, located in GPFS
HPSS is available from batch jobs, via HSI,
and
interactively via FTP, PFTP, and HIS
There are quotas on space and inode usage

21
Using the Machine , part 4