Getting Started on Bluefire - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Getting Started on Bluefire

Description:

Up to 5 instructions for one thread, up to 2 for other ... To change, log in to bems.ucar.edu from bluefire. Blueice users' dotfiles have been transferred ... – PowerPoint PPT presentation

Number of Views:393
Avg rating:3.0/5.0
Slides: 43
Provided by: Julia4
Category:

less

Transcript and Presenter's Notes

Title: Getting Started on Bluefire


1
Getting Started on Bluefire
  • CISL High Performance Systems Section
  • August 14, 2008

2
Agenda
  • System Architecture Overview
  • Security Model and File Transfer
  • User environment and porting
  • Compilers, tools, libraries
  • Optimization, compile and run time
  • Performance - What to expect
  • Question and answer

3
POWER6 Architecture
  • Ultra-high frequency dual-core chip (3.5 to 5
    GHz)
  • 7 way superscalar, 2-way SMT core
  • Up to 5 instructions for one thread, up to 2 for
    other thread in same cycle
  • Multiple Execution Units (nine)
  • 2LSU (load/store)
  • 2FPU (Binary Floating Point
  • 2FXU (fixed point)
  • 1 Branch
  • 1DFU (Decimal Floating Point)
  • 1VMX (SIMD)
  • 790M transistors, 341 mm2 die
  • Up to 64-core SMP systems
  • 2x4MB on-chip L2 - point of coherency
  • On-chip L3 directory and controller
  • Two memory controllers on-chip
  • Technology
  • CMOS 65nm lithography, SOI Cu
  • High-speed elastic bus interface at 21

Note Sourced from IBM documentation
4
Configuration
  • 127 32-Way 4.7 GHz nodes
  • 4,064 POWER6 processors
  • SMT enabled (64 SMT threads per node)
  • 76.4 TFLOPS
  • 117 Compute nodes (70.4 TFLOPS peak)
  • 3,744 POWER6 processors (32 per node)
  • 69 compute nodes have 64 GB memory
  • 48 compute nodes have 128 GB memory
  • 2 Interactive/Login nodes
  • 256 GB memory
  • 2 Share-queue nodes
  • 256 GB memory
  • 4 GPFS/VSD nodes
  • 2 Service nodes (to be reclaimed)
  • Infiniband Switch
  • 4X Infiniband DDR links capable of 2 GB/sec with
    3.4 microsecond latency
  • 8 links per node
  • 150 TeraBytes of usable file system space

5
Peripherals
  • 6 HMC (Hardware Management Controllers)
  • Manages virtualization on the nodes
  • Manages system configuration
  • 2 IB subnet managers
  • 4 Infiniband Switches
  • 1 Force10 Network Switch
  • 1 Management Server
  • 2 Login/Authentication Proxy Servers
  • 1 Storage Manager
  • 4 DS4800 Disk Controllers
  • 4 SAN Switches
  • 768 300 GB RAID Disks

6
Memory
  • 2 to 256 GB buffered DDR2 memory
  • 16 - 128 GB/sec bandwidth
  • On-Chip L1 Cache
  • 128 KB per processor
  • 64 KB data
  • 64 KB instruction
  • On-Chip L2 Cache
  • 4 MB per processor
  • Off-Chip L3 Cache
  • 32 MB shared by two processor
  • Connected to the chip via 80 GB/sec bus
  • Multiple Page Size Support
  • Supports multiple virtual memory page sizes
  • Supports 4 page sizes 4KB, 64KB, 16MB and 16GB
  • Supports using 64KB pages in segments with base
    page size 4KB

Note Sourced from IBM documentation
7
Infiniband
  • Infiniband Galaxy-2 Adapter
  • IB DDR 4X Bandwidth
  • 2 GB/s per link
  • Latency 3.4 microsecs
  • 8 Links per node on Bluefire
  • Local Identifiers (LIDs)
  • similar to IP addresses
  • I/O - Infiniband Stack
  • Low Latency
  • CPU offload
  • Recovery at multiple levels
  • 4X link can fail down to 1X link
  • GPFS and VSD use Infiniband on Bluefire

Note Sourced from IBM documentation
8
User Access Authentication
Bluefire
BLuefire Network
Login Nodes
NCAR User logins data (Internal)
Host Firewall/IP Filters
External Network
OTP
External user data
Keys
External user login and data
9
Security Data Movement
  • Single OTP Challenge only.
  • To login use ssh -X bluefire.ucar.edu
  • Shared high-performance file system between
    Bluefire, data analysis and visualization servers
  • File transfer can be automated to /ptmp
  • User can specify file transfer destination
    location in /ptmp
  • File transfer file size limited by user quota
    only
  • Users on UCAR network can transfer data to any
    bluefire file system using OTP
  • SCD portal is online

10
File Systems
Name Size (TB) Quota Scheduled Backup /home
4.4 TB 17.6 TB 5 GB 20GB Yes /ccsm 4.4
TB Div Q /cgd 2.2 TB Div Q /hao 1.1
TB Div Q /m3/mmmtmp 1.1 TB Div Q /ptmp
65.3 TB 109 TB 250GB 400GB /rap 2.2
TB Div Q /ncar 1.1 TB SSG /m3/users
1.1 TB Div Q /waccm 2.2 TB Div Q /acd
1.1 TB Div Q -------------
TOTAL 143.1 TB -------------
11
/fis File Systems
The /fis functionality has been kept. The current
hardware and file system services were moved from
blueice.
File System Name Backed up Quota /contrib
/fis/cgd
/fis/cgd/home /fis/scd/home
/fis/other/home /fis/hao/home
/fis/hao/tgcm /fis/m3/projects/mesouse
r /fis/m3/projects/wrfhelp
/fis replacement
12
Software Stack (compiler, etc)
  • Operating System AIX 5.3
  • Batch system Load Sharing Facility (LSF 7.02).
  • Compilers XL Fortran V11, XL C/C V9
  • (Note The compilers will produce 64-bit APIs. To
    produce 32-bit APIs, set environment variable
    OBJECT_MODE to 32.)
  • Utilities Like pmrinfo, spinfo, batchview, etc.
  • Please refer to /bin and /usr/local/bin on
    bluefire for a more complete list of user
    utilities.
  • Debugger TotalView available soon
  • File System General Parallel File System (GPFS)

13
Note Sourced from IBM documentation
14
Note Sourced from IBM documentation
15
(No Transcript)
16

17
POWER6 Tour
CISL has planned a POWER6 Tour of 15-20 minutes
in groups of 4-6 users daily (business day) at
900AM and 930 AM from August 25, 2008 until
September 05, 2008 If you are interested in a
personal tour to see and discuss thePOWER6
Architecture, then send e-mail to juliana_at_ucar.e
duirfan_at_ucar.edu Please include the following
information in your e-mail You Name Your Phone
Number Preferred Tour Date Preferred Tour Time
18
Transferring Files
  • Initiated from bluefire, works as previouslyscp
    file remote_at_meteo.edufile
  • May now initiate from remote machine using ssh
    keys that you install on bluefire
  • scp /localdir/file loginid_at_bfft.ucar.edu/ptmp/lo
    ginid/
  • Files must go to/from /ptmp/loginid tree

19
Good Practices
  • Use share queue for batch file transfer,
    including msrcp to MSS
  • Mixing LSF and MSS usage
  • Multistep applications
  • See /usr/local/examples/lsf/multistep
  • Synchronous reads/writes
  • Step 1 - read data from mss (share queue)
  • Step 2 - run model (regular queue)
  • Step 3 - write data to mss (share queue)
  • Saves GAUs by reducing processor count
  • Pre- and/or Post-processing can follow the same
    outline as this example

20
User Environment and Porting
  • Korn shell is default shell- To change, log in
    to bems.ucar.edu from bluefire
  • Blueice users dotfiles have been transferred
  • Blueice files have been transferred
  • Quotas 5 GB home (soon to be 20), 400 GB in /ptmp

21
Queues and Charging
  • Queues for regular memory (64 GB) and large
    memory (128 GB) nodes
  • 69 regular memory nodes and 48 large memory
  • Charges began on July 1
  • Machine charge factor is 1.4

22
Queues and Charging, cont.
  • Queue Name Queue Charge Factor Run
    Limitcapability 1 12 hoursdebug,
    dedicated 1 6 hourseconomy 0.5 6
    hourshold 0.33 6 hourspremium 1.5 6
    hoursregular,special 1 6 hoursshare (2
    nodes) 1 12 hoursstandby 0.1 6
    hourslrg_capability 1 12 hourslrg_economy
    0.5 6 hourslrg_hold 0.33 6
    hourslrg_premium 1.5 6 hourslrg_regular 1
    6 hourslrg_standby 0.1 6 hours

23
Libraries
  • ?ESSL and PESSL IBM Engineering and Scientific
    Subroutine Library?high-performance,
    general-purpose math libraries?linear algebra,
    eigensystems, random numbers, quadrature,
    sorting, etc?ESSL supports sequential SMP
    versions PESSL is parallel, MPI-compatible ?doe
    s not use MPI naming constructs ?is not in 1-1
    correlation with ESSL

24
Libraries, cont.
  • MASS and MASSV IBM Math Acceleration SubSystem
    library?optimized special functions?trigonometri
    c functions, exp, sqrt, and more?MASSV is a
    vector version of MASS
  • /contrib libraries e.g. FFTW, as requested by
    users
  • Other NCAR Graphics?Interpolation routines
    sequential LAPACK

25
Compilers
  • Fortran 90 XLF v. 11.1 for AIX/Linux is
    defaultIf needed, XLF v. 10.0.0.4 is available
    in /contrib. Consult user guide for linking
    instructions
  • C/C Visual Age v. 9.0 is default, mostly same
    features as XLF11.1

26
XLF 11.1- Fortran 2003
  • Full support of procedure pointers and
    allocatable object semantics
  • Object-oriented Fortran programming with
    constructs similar to C classes, methods, and
    destructors
  • User-defined derived type I/O
  • Derived Type Parameters (similar to C
    templates) will be only major feature not
    available in 11.1

27
XLF 11.1, cont.
  • Compliant to OpenMP v. 2.5
  • Perform subset of loop transformations at -O3
    optimization level
  • Tuned BLAS routines (DGEMM and DGEMV) are
    included in compiler runtime (libxlopt)
  • Recognizes matrix multiply and replaces with call
    to DGEMM

28
XLF 11.1, cont.
  • Runtime check for availability of ESSL
  • Support for auto-simdization and VMX (Altivec)
  • Intrinsics (and data types) on AIX
  • Inline MASS library functions (math functions)

29
XLF 11.1 - New Options
  • New suboptions to -qfloat-qfloatfenv asserts
    that FPSCR may be accessed (default is
    nofenv)-qfloathscmplx better performance for
    complex divide/abs (default is nohscmplx)-qfloat
    nosingle does not generate single precision float
    operations (default is single)-qfloat
    norngchk does not generate range check for
    software divide (default is rngchk)
  • -qoptdebug for debugging optimized code
  • Expected value directive for function arguments

30
XLF 11.1 - new options, cont.
  • -qxlf90nosignedzero default when -qnostrict
    (improves max/min perf)
  • Little-endian data I/O support
  • -qsmpstackcheck to detect if a threads stack
    goes beyond its limit
  • -qtunebalanced to obtain good performance on
    POWER6 without causing major degradation on
    POWER5
  • Builtin functions for new Power6 instructions
    dcbfl (Local flush), new dcbt variant (prefetch
    depth), dcbst (store stream)

31
Optimization, compile time
  • Compiler options -O2 through -qhot offer
    optimizations
  • Use of medium page size (64KB) beneficial Add
    to existing executableldedit -btextpsize64K
    -bdatapsize64K \-bstackpsize64K a.outOr use
    environment variable (Korn shell)export
    LDR_CNTRLDATAPSIZE64K_at_TEXTPSIZE64K\_at_STACKPSIZE
    64K

32
Optimization, runtimeProcessor binding is
mandatory!
  • bindproc.x in /contrib on blueice replaced with
    IBM-provided launch script on bluefire. Set
    export TARGET_CPU_LIST"-1 mpirun.lsf
    /usr/local/bin/launch ./wrf.exe
  • For hybrid programs, useexport
    TARGET_CPU_RANGE"-1mpirun.lsf
    /usr/local/bin/hybrid_launch \ ./wrf.exealong
    with OMP_NUM_THREADS
  • All parallel jobs should begin using one of the
    launch scripts mentioned above with their
    mpirun.lsf command.

33
Optimization, runtime Simultaneous
MultiThreading (SMT)
  • Doubles the number of active threads on a
    processor by implementing a second, on-board
    "virtual" processor on the chip
  • Easy to use Tell LSF you have twice as many
    processors. Simply double the value of the LSF
    ptile parameter, i.e. ptile64 instead of
    ptile32.
  • Boosts performance by 20 or more on some
    applications.

34
Bluefire Performance
  • blueice was 4.12 bluesky-equivalents
  • bluefire is 15.58 bluesky-equivalents

Weighted Average 45 CAM, 10 HD3D, 20 POP,
25 WRF
While clock speed is 2.47x blueices, we measured
an average speedup of 1.6x per-processor over
blueice for typical NCAR applications (your
mileage may vary). We expect the bluefire
numbers to improve as IBM makes improvements to
the POWER6 compiler and InfiniBand communications
drivers.
35
User Experience Community Atmosphere Model (CAM)
Performance
  • Standard CAM configuration using FV dycore at
    1.9x2.5º
  • On bluefirenodes PEs perf(yr/day)
    efficiency (w.r.t. 1 node)1 32
    20.5 1002 64 33.4
    814 128 54.6 67
  • On blueicenodes PEs perf(yr/day) efficiency
    (w.r.t. 1 node)1 16 6.8
    1002 32 12.3 904
    64 21.0 778 128 36.4
    67
  • Results show a 67 performance increase at 32 PEs
    (from 12.3 to 20.5yrs/day) and 50 performance
    increase at 128

36
Tools Totalview debugger
37
Tools Memory Monitoring
  • job_memusage prints total (peak) memory usage of
    a job (serial, MPI, OpenMP, or hybrid)
  • C program based on the getrusage() system call
    provided by AIX. Located in /contrib/bin
  • Simple to use Prefix to your executable
    /contrib/bin/job_memusage.exe program args
  • For MPI and hybrid jobs, we recommend the
    MP_LABELIO environment variable to recognize the
    memory usage of every taskexport MP_LABELIOyes
    kshmpirun.lsf job_memusage.exe ./cam lt namelist

38
Fall is here - plan now!
  • DecommissioningsBluevista decom September 30
  • Accelerated Scientific Discovery (ASD) Campaign
    (Successor to BTS)
  • 1 September - 30 NovemberExpect a handful of
    projects to consume 3 million GAUs

39
Questions or Problems?
  • Documentationhttp//www.cisl.ucar.edu/computers/
    bluefirehttp//www.cisl.ucar.edu/docs/bluefire/be
    _quickstart.html
  • Contact CISL Customer Support for
    helphttp//www.cisl.ucar/support(ExtraView
    ticket)
  • Telephone (303) 497-1278

40
Additional Reading
  • POWER6http//www.research.ibm.com/journal/rd51-6.
    htmlhttp//www-03.ibm.com/systems/power/hardware/
    575/index.html
  • AIX Operating Systemhttp//www-03.ibm.com/systems
    /power/software/aix/index.html
  • Infinibandhttp//www.infinibandta.org/itinfo
    http//www.mellanox.com/pdf/whitepapers/IB_intro_W
    P_190.pdf http///www.open-mpi.org/papers/worksho
    p2006/thu_01_mpi_on_infiniband.pdf

41
Information We Need From You
  • Job ID, date
  • Nodes job ran on, if known
  • Description of the problem
  • Command you typed to submit job
  • Error code you are getting
  • These are best provided in ExtraView ticket or
    via email to consult1_at_ucar.edu

42
Questions?
Write a Comment
User Comments (0)
About PowerShow.com