Title: Getting Started on Bluefire
1Getting Started on Bluefire
- CISL High Performance Systems Section
- August 14, 2008
2Agenda
- System Architecture Overview
- Security Model and File Transfer
- User environment and porting
- Compilers, tools, libraries
- Optimization, compile and run time
- Performance - What to expect
- Question and answer
3POWER6 Architecture
- Ultra-high frequency dual-core chip (3.5 to 5
GHz) - 7 way superscalar, 2-way SMT core
- Up to 5 instructions for one thread, up to 2 for
other thread in same cycle - Multiple Execution Units (nine)
- 2LSU (load/store)
- 2FPU (Binary Floating Point
- 2FXU (fixed point)
- 1 Branch
- 1DFU (Decimal Floating Point)
- 1VMX (SIMD)
- 790M transistors, 341 mm2 die
- Up to 64-core SMP systems
- 2x4MB on-chip L2 - point of coherency
- On-chip L3 directory and controller
- Two memory controllers on-chip
- Technology
- CMOS 65nm lithography, SOI Cu
- High-speed elastic bus interface at 21
Note Sourced from IBM documentation
4Configuration
- 127 32-Way 4.7 GHz nodes
- 4,064 POWER6 processors
- SMT enabled (64 SMT threads per node)
- 76.4 TFLOPS
- 117 Compute nodes (70.4 TFLOPS peak)
- 3,744 POWER6 processors (32 per node)
- 69 compute nodes have 64 GB memory
- 48 compute nodes have 128 GB memory
- 2 Interactive/Login nodes
- 256 GB memory
- 2 Share-queue nodes
- 256 GB memory
- 4 GPFS/VSD nodes
- 2 Service nodes (to be reclaimed)
- Infiniband Switch
- 4X Infiniband DDR links capable of 2 GB/sec with
3.4 microsecond latency - 8 links per node
- 150 TeraBytes of usable file system space
5Peripherals
- 6 HMC (Hardware Management Controllers)
- Manages virtualization on the nodes
- Manages system configuration
- 2 IB subnet managers
- 4 Infiniband Switches
- 1 Force10 Network Switch
- 1 Management Server
- 2 Login/Authentication Proxy Servers
- 1 Storage Manager
- 4 DS4800 Disk Controllers
- 4 SAN Switches
- 768 300 GB RAID Disks
6Memory
- 2 to 256 GB buffered DDR2 memory
- 16 - 128 GB/sec bandwidth
- On-Chip L1 Cache
- 128 KB per processor
- 64 KB data
- 64 KB instruction
- On-Chip L2 Cache
- 4 MB per processor
- Off-Chip L3 Cache
- 32 MB shared by two processor
- Connected to the chip via 80 GB/sec bus
- Multiple Page Size Support
- Supports multiple virtual memory page sizes
- Supports 4 page sizes 4KB, 64KB, 16MB and 16GB
- Supports using 64KB pages in segments with base
page size 4KB
Note Sourced from IBM documentation
7Infiniband
- Infiniband Galaxy-2 Adapter
- IB DDR 4X Bandwidth
- 2 GB/s per link
- Latency 3.4 microsecs
- 8 Links per node on Bluefire
- Local Identifiers (LIDs)
- similar to IP addresses
- I/O - Infiniband Stack
- Low Latency
- CPU offload
- Recovery at multiple levels
- 4X link can fail down to 1X link
- GPFS and VSD use Infiniband on Bluefire
Note Sourced from IBM documentation
8User Access Authentication
Bluefire
BLuefire Network
Login Nodes
NCAR User logins data (Internal)
Host Firewall/IP Filters
External Network
OTP
External user data
Keys
External user login and data
9Security Data Movement
- Single OTP Challenge only.
- To login use ssh -X bluefire.ucar.edu
- Shared high-performance file system between
Bluefire, data analysis and visualization servers - File transfer can be automated to /ptmp
- User can specify file transfer destination
location in /ptmp - File transfer file size limited by user quota
only - Users on UCAR network can transfer data to any
bluefire file system using OTP - SCD portal is online
10File Systems
Name Size (TB) Quota Scheduled Backup /home
4.4 TB 17.6 TB 5 GB 20GB Yes /ccsm 4.4
TB Div Q /cgd 2.2 TB Div Q /hao 1.1
TB Div Q /m3/mmmtmp 1.1 TB Div Q /ptmp
65.3 TB 109 TB 250GB 400GB /rap 2.2
TB Div Q /ncar 1.1 TB SSG /m3/users
1.1 TB Div Q /waccm 2.2 TB Div Q /acd
1.1 TB Div Q -------------
TOTAL 143.1 TB -------------
11/fis File Systems
The /fis functionality has been kept. The current
hardware and file system services were moved from
blueice.
File System Name Backed up Quota /contrib
/fis/cgd
/fis/cgd/home /fis/scd/home
/fis/other/home /fis/hao/home
/fis/hao/tgcm /fis/m3/projects/mesouse
r /fis/m3/projects/wrfhelp
/fis replacement
12Software Stack (compiler, etc)
- Operating System AIX 5.3
- Batch system Load Sharing Facility (LSF 7.02).
- Compilers XL Fortran V11, XL C/C V9
- (Note The compilers will produce 64-bit APIs. To
produce 32-bit APIs, set environment variable
OBJECT_MODE to 32.) - Utilities Like pmrinfo, spinfo, batchview, etc.
- Please refer to /bin and /usr/local/bin on
bluefire for a more complete list of user
utilities. - Debugger TotalView available soon
- File System General Parallel File System (GPFS)
13Note Sourced from IBM documentation
14Note Sourced from IBM documentation
15(No Transcript)
16 17POWER6 Tour
CISL has planned a POWER6 Tour of 15-20 minutes
in groups of 4-6 users daily (business day) at
900AM and 930 AM from August 25, 2008 until
September 05, 2008 If you are interested in a
personal tour to see and discuss thePOWER6
Architecture, then send e-mail to juliana_at_ucar.e
duirfan_at_ucar.edu Please include the following
information in your e-mail You Name Your Phone
Number Preferred Tour Date Preferred Tour Time
18Transferring Files
- Initiated from bluefire, works as previouslyscp
file remote_at_meteo.edufile - May now initiate from remote machine using ssh
keys that you install on bluefire - scp /localdir/file loginid_at_bfft.ucar.edu/ptmp/lo
ginid/ - Files must go to/from /ptmp/loginid tree
19Good Practices
- Use share queue for batch file transfer,
including msrcp to MSS - Mixing LSF and MSS usage
- Multistep applications
- See /usr/local/examples/lsf/multistep
- Synchronous reads/writes
- Step 1 - read data from mss (share queue)
- Step 2 - run model (regular queue)
- Step 3 - write data to mss (share queue)
- Saves GAUs by reducing processor count
- Pre- and/or Post-processing can follow the same
outline as this example
20User Environment and Porting
- Korn shell is default shell- To change, log in
to bems.ucar.edu from bluefire - Blueice users dotfiles have been transferred
- Blueice files have been transferred
- Quotas 5 GB home (soon to be 20), 400 GB in /ptmp
21Queues and Charging
- Queues for regular memory (64 GB) and large
memory (128 GB) nodes - 69 regular memory nodes and 48 large memory
- Charges began on July 1
- Machine charge factor is 1.4
-
22Queues and Charging, cont.
- Queue Name Queue Charge Factor Run
Limitcapability 1 12 hoursdebug,
dedicated 1 6 hourseconomy 0.5 6
hourshold 0.33 6 hourspremium 1.5 6
hoursregular,special 1 6 hoursshare (2
nodes) 1 12 hoursstandby 0.1 6
hourslrg_capability 1 12 hourslrg_economy
0.5 6 hourslrg_hold 0.33 6
hourslrg_premium 1.5 6 hourslrg_regular 1
6 hourslrg_standby 0.1 6 hours
23Libraries
- ?ESSL and PESSL IBM Engineering and Scientific
Subroutine Library?high-performance,
general-purpose math libraries?linear algebra,
eigensystems, random numbers, quadrature,
sorting, etc?ESSL supports sequential SMP
versions PESSL is parallel, MPI-compatible ?doe
s not use MPI naming constructs ?is not in 1-1
correlation with ESSL
24Libraries, cont.
- MASS and MASSV IBM Math Acceleration SubSystem
library?optimized special functions?trigonometri
c functions, exp, sqrt, and more?MASSV is a
vector version of MASS - /contrib libraries e.g. FFTW, as requested by
users - Other NCAR Graphics?Interpolation routines
sequential LAPACK
25Compilers
- Fortran 90 XLF v. 11.1 for AIX/Linux is
defaultIf needed, XLF v. 10.0.0.4 is available
in /contrib. Consult user guide for linking
instructions - C/C Visual Age v. 9.0 is default, mostly same
features as XLF11.1
26XLF 11.1- Fortran 2003
- Full support of procedure pointers and
allocatable object semantics - Object-oriented Fortran programming with
constructs similar to C classes, methods, and
destructors - User-defined derived type I/O
- Derived Type Parameters (similar to C
templates) will be only major feature not
available in 11.1
27XLF 11.1, cont.
- Compliant to OpenMP v. 2.5
- Perform subset of loop transformations at -O3
optimization level - Tuned BLAS routines (DGEMM and DGEMV) are
included in compiler runtime (libxlopt) - Recognizes matrix multiply and replaces with call
to DGEMM
28XLF 11.1, cont.
- Runtime check for availability of ESSL
- Support for auto-simdization and VMX (Altivec)
- Intrinsics (and data types) on AIX
- Inline MASS library functions (math functions)
29XLF 11.1 - New Options
- New suboptions to -qfloat-qfloatfenv asserts
that FPSCR may be accessed (default is
nofenv)-qfloathscmplx better performance for
complex divide/abs (default is nohscmplx)-qfloat
nosingle does not generate single precision float
operations (default is single)-qfloat
norngchk does not generate range check for
software divide (default is rngchk) - -qoptdebug for debugging optimized code
- Expected value directive for function arguments
30XLF 11.1 - new options, cont.
- -qxlf90nosignedzero default when -qnostrict
(improves max/min perf) - Little-endian data I/O support
- -qsmpstackcheck to detect if a threads stack
goes beyond its limit - -qtunebalanced to obtain good performance on
POWER6 without causing major degradation on
POWER5 - Builtin functions for new Power6 instructions
dcbfl (Local flush), new dcbt variant (prefetch
depth), dcbst (store stream)
31Optimization, compile time
- Compiler options -O2 through -qhot offer
optimizations - Use of medium page size (64KB) beneficial Add
to existing executableldedit -btextpsize64K
-bdatapsize64K \-bstackpsize64K a.outOr use
environment variable (Korn shell)export
LDR_CNTRLDATAPSIZE64K_at_TEXTPSIZE64K\_at_STACKPSIZE
64K
32Optimization, runtimeProcessor binding is
mandatory!
- bindproc.x in /contrib on blueice replaced with
IBM-provided launch script on bluefire. Set
export TARGET_CPU_LIST"-1 mpirun.lsf
/usr/local/bin/launch ./wrf.exe - For hybrid programs, useexport
TARGET_CPU_RANGE"-1mpirun.lsf
/usr/local/bin/hybrid_launch \ ./wrf.exealong
with OMP_NUM_THREADS - All parallel jobs should begin using one of the
launch scripts mentioned above with their
mpirun.lsf command.
33Optimization, runtime Simultaneous
MultiThreading (SMT)
- Doubles the number of active threads on a
processor by implementing a second, on-board
"virtual" processor on the chip - Easy to use Tell LSF you have twice as many
processors. Simply double the value of the LSF
ptile parameter, i.e. ptile64 instead of
ptile32. - Boosts performance by 20 or more on some
applications.
34Bluefire Performance
- blueice was 4.12 bluesky-equivalents
- bluefire is 15.58 bluesky-equivalents
Weighted Average 45 CAM, 10 HD3D, 20 POP,
25 WRF
While clock speed is 2.47x blueices, we measured
an average speedup of 1.6x per-processor over
blueice for typical NCAR applications (your
mileage may vary). We expect the bluefire
numbers to improve as IBM makes improvements to
the POWER6 compiler and InfiniBand communications
drivers.
35User Experience Community Atmosphere Model (CAM)
Performance
- Standard CAM configuration using FV dycore at
1.9x2.5º - On bluefirenodes PEs perf(yr/day)
efficiency (w.r.t. 1 node)1 32
20.5 1002 64 33.4
814 128 54.6 67 - On blueicenodes PEs perf(yr/day) efficiency
(w.r.t. 1 node)1 16 6.8
1002 32 12.3 904
64 21.0 778 128 36.4
67 - Results show a 67 performance increase at 32 PEs
(from 12.3 to 20.5yrs/day) and 50 performance
increase at 128
36Tools Totalview debugger
37Tools Memory Monitoring
- job_memusage prints total (peak) memory usage of
a job (serial, MPI, OpenMP, or hybrid) - C program based on the getrusage() system call
provided by AIX. Located in /contrib/bin - Simple to use Prefix to your executable
/contrib/bin/job_memusage.exe program args - For MPI and hybrid jobs, we recommend the
MP_LABELIO environment variable to recognize the
memory usage of every taskexport MP_LABELIOyes
kshmpirun.lsf job_memusage.exe ./cam lt namelist
38Fall is here - plan now!
- DecommissioningsBluevista decom September 30
- Accelerated Scientific Discovery (ASD) Campaign
(Successor to BTS) - 1 September - 30 NovemberExpect a handful of
projects to consume 3 million GAUs
39Questions or Problems?
- Documentationhttp//www.cisl.ucar.edu/computers/
bluefirehttp//www.cisl.ucar.edu/docs/bluefire/be
_quickstart.html - Contact CISL Customer Support for
helphttp//www.cisl.ucar/support(ExtraView
ticket) - Telephone (303) 497-1278
40Additional Reading
- POWER6http//www.research.ibm.com/journal/rd51-6.
htmlhttp//www-03.ibm.com/systems/power/hardware/
575/index.html
- AIX Operating Systemhttp//www-03.ibm.com/systems
/power/software/aix/index.html - Infinibandhttp//www.infinibandta.org/itinfo
http//www.mellanox.com/pdf/whitepapers/IB_intro_W
P_190.pdf http///www.open-mpi.org/papers/worksho
p2006/thu_01_mpi_on_infiniband.pdf
41Information We Need From You
- Job ID, date
- Nodes job ran on, if known
- Description of the problem
- Command you typed to submit job
- Error code you are getting
- These are best provided in ExtraView ticket or
via email to consult1_at_ucar.edu
42Questions?