Evolution of the NERSC SP System NERSC User Services - PowerPoint PPT Presentation

About This Presentation
Title:

Evolution of the NERSC SP System NERSC User Services

Description:

NERSC User Services Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 22
Provided by: Thoma545
Learn more at: https://www.nersc.gov
Category:

less

Transcript and Presenter's Notes

Title: Evolution of the NERSC SP System NERSC User Services


1
Evolution of the NERSC SP SystemNERSC User
Services
  • Original Plans
  • Phase 1
  • Phase 2
  • Programming Models and Code Porting
  • Using the System

2
Original Plans The NERSC-3 Procurement
  • Complete, reliable, high-end scientific system
  • High availability and MTBF
  • Fully configured - processing, storage, software,
    networking, support
  • Commercially available components
  • The greatest amount of computational power for
    the money
  • Can be integrated with existing computing
    environment
  • Can be evolved with product line
  • Much careful benchmarking and acceptance testing
    done

3
Original Plans The NERSC-3 Procurement
  • What we wanted
  • gt1 teraflop of peak performance
  • 10 terabytes of storage
  • 1 terabyte of memory
  • What we got in phase 1
  • 410 gigaflops of peak performance
  • 10 terabytes of storage
  • 512 gigabytes of memory
  • What we will get in phase 2
  • 3 teraflops of peak performance
  • 15 terabytes of storage
  • 1 terabyte of memory

4
Hardware, Phase 1
  • 304 Power 3 nodes Nighthawk 1
  • Node usage
  • 256 compute/batch nodes 512 CPUs
  • 8 login nodes 16 CPUs
  • 16 GPFS nodes 32 CPUs
  • 8 network nodes 16 CPUs
  • 16 service nodes 32 CPUs
  • 2 processors/node
  • 200 MHz clock
  • 4 flops/clock (2 multiply-add ops) 800
    Mflops/CPU, 1.6 Gflops/node
  • 64 KB L-1 d-cache per CPU _at_ 5 nsec 3.2
    GB/sec
  • 4 MB L-2 cache per CPU _at_ 45 nsec 6.4
    GB/sec
  • 1 GB RAM per node _at_ 175 nsec
    1.6 GB/sec
  • 150 MB/sec switch bandwidth
  • 9 GB local disk (two-way RAID)

5
Hardware, Phase 2
  • 152 Power 3 nodes Winterhawk 2
  • Node usage
  • 128 compute/batch nodes 2048 CPUs
  • 2 login nodes 32 CPUs
  • 16 GPFS nodes 256 CPUs
  • 2 network nodes 32 CPUs
  • 4 service nodes 64 CPUs
  • 16 processors/node
  • 375 MHz clock
  • 4 flops/clock (2 multiply-add ops) 1.5
    Gflops/CPU, 22.4 Gflops/node
  • 64 KB L-1 d-cache per CPU _at_ 5 nsec 3.2
    GB/sec
  • 8 MB L-2 cache per CPU _at_ 45 nsec 6.4
    GB/sec
  • 8 GB RAM per node _at_ 175 nsec 14.0
    GB/sec
  • 2000 (?) MB/sec switch bandwidth
  • 9 GB local disk (two-way RAID)

6
Programming Models, phase 1
  • Phase 1 will reply on MPI, with availability of
    threading
  • OpenMP directives
  • Pthreads
  • IBM SMP directives
  • MPI now does intra-node communications
    efficiently
  • Mixed-model programming not currently very
    advantageous
  • PVM and LAPI messaging systems are also available
  • SHMEM is planned
  • The SP has cache and virtual memory, which means
  • There are more ways to reduce code performance
  • There are more ways to lose portability

7
Programming Models, phase 2
  • Phase 2 will offer more payback for mixed model
    programming
  • Single node parallelism is a good target for PVP
    users
  • Vector and shared-memory codes can be expanded
    into MPI
  • MPI codes can be ported from the T3E
  • Threading can be added within MPI
  • In both cases, re-engineering will be required,
    to exploit new and different levels of
    granularity
  • This can be done along with increasing problem
    sizes

8
Porting Considerations, part 1
  • Things to watch out for in porting codes to the
    SP
  • Cache
  • Not enough on the T3E to make worrying about it
    worth the trouble
  • Enough on the SP to boost performance, if its
    used well
  • Tuning for cache is different than tuning for
    vectorization
  • False sharing caused by cache can reduce
    perfomrance
  • Virtual memory
  • Gives you access to 1.75 GB of (virtual) RAM
    address space
  • To use all of virtual (or even real) memory, must
    explicitly request segments
  • Causes performance degradation due to paging
  • Data types
  • Default sizes are different on PVP, T3E, and SP
    systems
  • integer, int, real, and float must be
    used carefully
  • Best to say what you mean real8, integer4
  • Do the same in MPI calls MPI_REAL8,
    MPI_INTEGER4
  • Be careful with intrinsic function use, as well

9
Porting Considerations, part 2
  • More things to watch out for in porting codes to
    the SP
  • Arithmetic
  • Architecture tuning can help exploit special
    processor instructions
  • Both T3E and SP can optimize beyond IEEE
    arithmetic
  • T3E and PVP can also do fast reduced precision
    arithmetic
  • Compiler options on T3E and SP can force IEEE
    compliance
  • Compiler options can also throttle other
    optimizations for safety
  • Special libraries offer faster intrinsics
  • MPI
  • SP compilers and runtime will catch loose usage
    that was accepted on the T3E
  • Communication bandwidth on SP Phase 1 is lower
    than on the T3E
  • Message latency on the SP Phase 1 is higher than
    on the T3E
  • We expect approximate parity with T3E in these
    areas, on the Phase 2 system
  • Limited number of communication ports per node -
    approximately one per CPU
  • Default versus eager buffer management in
    MPI_SEND

10
Porting Considerations, part 3
  • Compiling linking
  • Version is dependent on language and
    parallelization scheme
  • Language version
  • Fortran 77 f77, xlf
  • Fortran 90 xlf90
  • Fortran 95 xlf95
  • C cc, xlc, c89
  • C xlC
  • MPI-included mpxlf, mpxlf90, mpcc, mpCC
  • Thread-safe xlf_r, xlf90_r, xlf95_r, mpxlf_r,
    mpxlf90_r
  • Preprocessing can be ordered by compiler flag or
    source file suffix
  • Use consistently, for all related compilations
    the following may NOT produce a parallel
    executable
  • mpxlf90 -c .F
  • xlf90 -o foo .o
  • Use -bmaxdatabytes option to get more than a
    single 256 MB segment (up to 7 segments, or 1.75
    GB can be specified only 3, or 0.75 GB, are real)

11
Porting MPI
  • MPI codes should port relatively well
  • Use one MPI task per node or processor
  • One per node during porting
  • One per processor during production
  • Let MPI worry about where its communicating to
  • Environment variables, execution parameters,
    and/or batch options can specify
  • tasks per node
  • Total tasks
  • Total processors
  • Total nodes
  • Communications subsystem in use
  • User Space is best in batch jobs
  • IP may be best for interactive developmental runs
  • There is a debug queue/class in batch

12
Porting Shared Memory
  • Dont throw away old shared memory directives
  • OpenMP will work as is
  • Cray Tasking directives will be useful for
    documentation
  • We recommend porting Cray directives to OpenMP
  • Even small-scale parallelism can be useful
  • Larger scale parallelism will be available next
    year
  • If your problems and/or algorithms will scale to
    larger granularities and greater parallelism,
    prepare for message passing
  • We recommend MPI

13
From Loop-slicing to MPI, before...
  • allocate(A(1imax,1jmax))
  • !OMP PARALLEL DO PRIVATE(I, J), SHARED(A, imax,
    jmax)
  • do I 1, imax
  • do J 1, jmax
  • A(I,J) deep_thought(A, I, J,)
  • enddo
  • enddo
  • Sanity checking
  • Run the program on one CPU to get baseline
    answers
  • Run on several CPUs to see parallel speedups and
    answers
  • Optimization
  • Consider changing memory access patterns to
    improve cache usage
  • How big can your problem get before you run out
    of real memory?

14
From Loop-slicing to MPI, after...
  • call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
  • call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
  • call my_indices(my_id, nprocs, my_imin, my_imax,
    my_jmin, my_jmax)
  • allocate(A(my_imin my_imax, my_jmin my_jmax))
  • !OMP PARALLEL DO PRIVATE(I, J), SHARED(A,
    my_imax, my_jmax, my_imax, my_jmax)
  • do I my_imin, my_imax
  • do J my_jmin, my_jmax
  • A(I,J) deep_thought(A, I, J,)
  • enddo
  • enddo
  • ! Communicate the shared values with neighbors
  • if(odd(my_ID)) then
  • call MPI_SEND(my_left(...), leftsize,
    MPI_REAL, tag, MPI_COMM_WORLD, ierr)
  • call MPI_RECV(my_right(...), rightsize,
    MPI_REAL, tag, MPI_COMM_WORLD, ierr)
  • call MPI_SEND(my_top(...), leftsize,
    MPI_REAL, tag, MPI_COMM_WORLD, ierr)
  • call MPI_RECV(my_bottom(...), rightsize,
    MPI_REAL, tag, MPI_COMM_WORLD, ierr)
  • else
  • call MPI_RECV(my_right(...), rightsize,
    MPI_REAL, tag, MPI_COMM_WORLD, ierr)
  • call MPI_SEND(my_left(...), leftsize,
    MPI_REAL, tag, MPI_COMM_WORLD, ierr)

15
From Loop-slicing to MPI, after...
  • You now have one MPI task and many OpenMP threads
    per node
  • The MPI task does all the communicating between
    nodes
  • The OpenMP threads do the parallelizable work
  • Do NOT use MPI within an OpenMP parallel region
  • Sanity checking
  • Run on one node and one CPU to check baseline
    answers
  • Run on one node and several CPUs to see parallel
    speedup and answers
  • Run on several nodes, one CPU per node, and check
    answers
  • Run on several nodes, several CPUs per node, and
    check answers
  • Scaling checking
  • Run a larger version of a similar problem on the
    same set of ensemble sizes
  • Run the same sized problem on a larger ensemble
  • (Re-)Consider your I/O strategy

16
From MPI to Loop-slicing
  • Add OpenMP directives to existing code
  • Perform sanity and scaling checks, as before
  • Results in same overall code structure as on
    previous slides
  • One MPI task and several OpenMP threads per node
  • For irregular codes, Pthreads may serve better,
    at the cost of increased complexity
  • Nobody really expects it to be this easy...

17
Using the Machine, part 1
  • Somewhat similar to the Crays
  • Interactive and batch jobs are possible

18
Using the Machine , part 2
  • Interactive runs
  • Sequential executions run immediately on your
    login node
  • Every login will likely put you on a different
    node, so be careful about looking for your
    executions - ps returns info about only the
    node youre logged into.
  • Small scale parallel jobs may be rejected if
    LoadLeveler cant find the resources
  • There are two pools of nodes that can be used for
    interactive jobs
  • Login nodes
  • A small subset of the compute nodes
  • Parallel execution can often be achieved by
  • Trying again, after initial rejection
  • Changing communication mechanisms from User Space
    to IP
  • Using the other pool

19
Using the Machine , part 3
  • Batch jobs
  • Currently, very similar in capability to the T3E
  • Similar run times, processor counts
  • More memory available on the SP
  • Limits and capabilities may change, as we learn
    the machine
  • LoadLeveler is similar to, but simpler than
    NQE/NQS on the T3E
  • Jobs are submitted, monitored, and cancelled by
    special commands
  • Each batch job requires a script that is
    essentially a shell script
  • The first few lines contain batch options that
    look like comments to the shell
  • The rest of the script can contain any shell
    constructs
  • Scripts can be debugged by executing them
    interactively
  • Users are limited to 3 running jobs, 10 queued
    jobs, and 30 submitted jobs, at any given time

20
Using the Machine , part 4
  • File systems
  • Use the environment variables to let the system
    manage your file usage
  • Sequential work can be done in HOME (not backed
    up) or TMPDIR (transient)
  • Medium performance, node-local
  • Parallel work can be done in SCRATCH (transient)
    or /scratch/username
    (purgeable)
  • High performance, located in GPFS
  • HPSS is available from batch jobs, via HSI,
    and
    interactively via FTP, PFTP, and HIS
  • There are quotas on space and inode usage

21
Using the Machine , part 4
  • The future?
  • The allowed scale of parallelism (CPU counts) may
    change
  • Max now 512 CPUs, same as on T3E
  • The allowed duration of runs may change
  • Max now 4 hours Max on T3E 12 hours
  • The size of possible problems will definitely
    change
  • More CPUs in phase 1 than the T3E
  • More memory per cpu, in both phases, than on T3e
  • The amount of work possible per unit time will
    definitely change
  • CPUs in both phases are faster than those on the
    T3E
  • Phase 2 interconnect will be faster than on Phase
    1
  • Better machine management
  • Checkpointing will be available
  • We will learn what can be adjusted in the batch
    system
  • There will be more and better tools for
    monitoring and tuning
  • HPM, KAP, Tau, PAPI...
  • Some current problems will go away (e.g. memory
    mapped files)
Write a Comment
User Comments (0)
About PowerShow.com