Title: NASA NCCS APPLICATION PERFORMANCE DISCUSSION
1NASA NCCS APPLICATION PERFORMANCE DISCUSSION
- Koushik Ghosh, Ph.D.
- IBM Federal HPC
- HPC Technical Specialist
IBM Systemx iDataPlex -Parallel Scientific
Applications Development April 22-23,
2009 Koushik K Ghosh, Ph.D. IBM Federal HPC HPC
Technical Specialist
2Topics
- HW SW System Architecture
- Platform/Chipset
- Processor
- Memory
- Interconnect
- Building Apps on System x
- Compilation
- MPI
- Executing Apps on System x
- Runtime options
- Tools Oprofile / MIO I/O Perf / MPI Trace
- Discussion of NCCS apps
3Scalable Unit Summary
42 SCU Configuration
5SDR/DDR/QDR
6iDataPlex footprint
7Compute Node
- iDataPlex 2U Flex
- Intel Harpertown (Xeon L5200)
- dual-socket, quad-core 2.5 GHz 50W
- SCU 3 SCU 4
- Nehalem
- dual-socket, quad-core 2.8? GHz
- SCU 5
8Harpertown Seaburg Chipset
9Harpertown Intel Core2 Quad processor
10Nehalem Tylersburg Chipset
11Nehalem Intel Core i7 Processor
12Nehalem QPI Quick Path Interconnect
13Cache Details
14cpuinfo (Harpertown) (/opt/intel/impi/3.1/bin64/cp
uinfo)
- Architecture x86_64
- Hyperthreading disabled
- Packages 2
- Cores 8
- Processors 8
- Processor identification
- Processor Thread Core Package
- 0 0 0 1
- 1 0 0 0
- 2 0 2 0
- 3 0 2 1
- 4 0 1 0
- 5 0 3 0
- 6 0 1 1
- 7 0 3 1
- Processor placement
- Package Cores Processors
- 1 0,2,1,3 0,3,6,7
- 0 0,2,1,3 1,2,4,5
15cat cpuinfo (Nehalem)(/opt/intel/impi/3.2.0.011/b
in64/cpuinfo)
- Architecture x86_64
- Hyperthreading enabled
- Packages 2
- Cores 8
- Processors 16
- Processor identification
- Processor Thread Core Package
- 0 0 0 0
- 1 1 0 0
- 2 0 1 0
- 3 1 1 0
- 4 0 2 0
- 5 1 2 0
- 6 0 3 0
- 7 1 3 0
- 8 0 0 1
- 9 1 0 1
- 10 0 1 1
- 11 1 1 1
16cat /proc/cpuinfo (Harpertown)
- processor 0
- vendor_id GenuineIntel
- cpu family 6
- model 23
- model name Intel(R) Xeon(R) CPU
E5472 _at_ 3.00GHz - stepping 6
- cpu MHz 2992.509
- cache size 6144 KB
- physical id 1
- siblings 4
- core id 0
- cpu cores 4
- fpu yes
- fpu_exception yes
- cpuid level 10
- wp yes
- flags fpu vme de pse tsc msr pae mce
cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl vmx est tm2 cx16
xtpr lahf_lm - bogomips 5988.95
- clflush size 64
17/proc/cpuinfo (Nehalem)
- processor 0
- vendor_id GenuineIntel
- cpu family 6
- model 26
- model name Intel(R) Xeon(R) CPU
X55700 _at_ . - stepping 4
- cpu MHz 2927.000
- cache size 8192 KB
- physical id 0
- siblings 8
- core id 0
- cpu cores 4
- apicid 0
- initial apicid 0
- fpu yes
- fpu_exception yes
- cpuid level 11
- wp yes
- flags fpu vme de pse tsc msr pae mce
cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good xtopology pni dtes64 monitor ds_cpl vmx
est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
18cat /proc/meminfo
- MemTotal 24737232 kB
- MemFree 21152912 kB
- Buffers 77376 kB
- Cached 2230344 kB
- SwapCached 0 kB
- Active 1650908 kB
- Inactive 1720616 kB
- Active(anon) 955796 kB
- Inactive(anon) 0 kB
- Active(file) 695112 kB
- Inactive(file) 1720616 kB
- Unevictable 0 kB
- Mlocked 0 kB
- SwapTotal 2104472 kB
- SwapFree 2104472 kB
- Dirty 536 kB
- Writeback 0 kB
- AnonPages 955608 kB
- Mapped 28632 kB
- Slab 123752 kB
- SReclaimable 101028 kB
- SUnreclaim 22724 kB
- PageTables 5364 kB
- NFS_Unstable 0 kB
- Bounce 0 kB
- WritebackTmp 0 kB
- CommitLimit 14473088 kB
- Committed_AS 1156568 kB
- VmallocTotal 34359738367 kB
- VmallocUsed 337244 kB
- VmallocChunk 34359395451 kB
- HugePages_Total 0
- HugePages_Free 0
- HugePages_Rsvd 0
- HugePages_Surp 0
- Hugepagesize 2048 kB
19Meminfo explanation
- High-Level Statistics
- MemTotal Total usable ram (i.e. physical ram
minus a few reserved bits and the kernel binary
code) - MemFree Is sum of LowFreeHighFree (overall
stat) - MemShared 0 is here for compat reasons but
always zero. - Buffers Memory in buffer cache. mostly useless
as metric nowadays - Cached Memory in the pagecache (diskcache) minus
SwapCache - SwapCache Memory that once was swapped out, is
swapped back in but still also is in the swapfile
(if memory is needed it doesn't need to be
swapped out AGAIN because it is already in the
swapfile. This saves I/O)
20Memory
- Memory on Harpertown Compute Node SCU3 and SCU4
- 4 x 4GB (9W) PC2-5300 CL5 ECC DDR2 667MHz FBDIMMs
- 16 GB per node
- Memory on Nehalem Compute Node
- 3 DDR3 channels on each socket / total of 8 DIMM
slots - e.g. 4 GB DIMM on each DDR3 channel (24GB/node)
1300 MHz - e.g 18GB per node (1066 MHz)
- 2GB/2GB/2GB on channel1
- 2GB/2GB/2GB on channel2
- 1GB on channel 3
21Interconnect
- (1) Mellanox ConnectX dual port DDR IB 4X HCA
PCIe 2.0 x8 - IB4X DDR Cisco 9024D 288-port DDR switches for
each scalable unit cabled in the following
manner - 256 ports to compute nodes
- 2 ports to spare compute nodes
- 6 ports to service nodes
- 24 ports uplinked to Tier 1 InfiniBand switch
- ConnectX InfiniBand 4X DDR HCAs
- 16 Gb/second of uni-directional peak MPI
bandwidth - less than 2 microseconds MPI latency.
22Nehalem Features
- The Nehalem microarchitecture has many new
features, some of which are present in the Core
i7. The ones that represent significant changes
from the Core 2 include - The new LGA 1366 socket is incompatible with
earlier processors. - On-die memory controller the memory is directly
connected to the processor. It is called the
uncore part and runs at a different clock (uncore
clock) of execution cores. - Three channel memory each channel can support
one or two DDR3 DIMMs. Motherboards for Core i7
generally have three, four (31) or six DIMM
slots. - Support for DDR3 only.
- No ECC support.
- The front side bus has been replaced by the Intel
QuickPath Interconnect interface. Motherboards
must use a chipset that supports QuickPath. - The following caches
- 32 KB L1 instruction and 32 KB L1 data cache per
core - 256 KB L2 cache (combined instruction and data)
per core - 8 MB L3 (combined instruction and data)
"inclusive", shared by all cores - Single-die device all four cores, the memory
controller, and all cache are on a single die.
23Nehalem Features contd.
- "Turbo Boost" technology
- allows all active cores to intelligently clock
themselves up - in steps of 133 MHz over the design clock rate
- as long as the CPU's predetermined
thermal/electrical requirements are still met. - Re-implemented Hyper-threading.
- Each of the four cores can process up to two
threads simultaneously, - processor appears to the OS as eight CPUs.
- This feature was dropped in Core (Harpertown).
- Only one QuickPath interface not intended for
multi-processor motherboards. - 45nm process technology.
- 731M transistors.
- 263 mm2 Die size.
- Sophisticated power management places unused core
in a zero-power mode. - Support for SSE4.2 SSE4.1 instruction sets.
24I/O and Filesystem
- /discover/home
- /discover/nobackup
- IBM Global Parallel File System (GPFS) used on
all nodes - Serviced by 4 I/O nodes
- read/write access from all nodes
- /discover/home 2TB, /discover/nobackup 4 TB
- Individual quota
- /discover/home 500 MB
- /discover/nobackup 100 GB
- Very fast (peak 350 MB/sec, normal 150 - 250
MB/sec)
25Software
- OS Linux (RHEL5.2)
- Compilers
- Intel Fortran, C/C
- Math libs BLAS, LAPACK, ScaLAPACK, MKL
- MPI MPI-2
- Scheduler PBSPro
26LINUX pagesize
27Which modules
- modules loaded (64-bit compilers)
- intel-cce-10.1.017
- intel-fce-10.1.017
- intel-mkl-10.0.3.020
- intel-mpi-3.1-64bit
- /opt/intel/fce/10.1.017/bin/ifort
- /opt/intel/impi/3.1/bin64/mpiifort
- modules loaded (32-bit compilers)
- intel-cc-10.1.017
- intel-fc-10.1.017
- /opt/intel/fc/10.1.017/bin/ifort
- /opt/intel/impi/3.1/bin/mpiifort
28IFC Compiler Options of some Physics/Chemistry/Cli
mate Applications
- CubedSphere
- -safe_cray_ptr -i_dynamic -convert big_endian
-assume byterecl -ftz -i4 -r8 -O3 -xS - NPB3.2 -O3 -xT -ip -no-prec-div -ansi-alias
-fno-alias - HPCC -O2 xT
- GAMESS -O3 -xS -ipo -i-static -fno-pic ipo
- GTC -O1
- CAM -O3 xT
- MILC -O3 xT
- PARATEC -O3 -xS -ipo -i-static-fno-fnalias
-fno-alias - STREAM -O3 opt-streaming-storesalways xS-ip
- SpecCPU2006 ?????
29Optimization Level O2 (-O2)
- - Inlining of intrinsics- Intra-file
interprocedural opt- inlining- constant
propagation- forward substitution- routine
attribute propagation- variable address-taken
analysis- dead static function elimination-
removal of unreferenced variables- constant
propagation- copy propagation
- dead-code elimination- global register
allocation- global instruction scheduling and
control speculation- loop unrolling- optimized
code selection- partial redundancy elimination-
strength reduction/induction - - variable renaming- exception handling
optimizations- tail recursions- peephole
optimizations- structure assignment lowering and
optimizations- dead store elimination
30Optimization Level O3 (-O3)
- Enables O2 optimizations plus more aggressive
optimizations, such as - prefetching, scalar replacement
- loop and memory access transformations.
- Loop unrolling, including instruction scheduling
- Code replication to eliminate branches
- Padding the size of certain power-of-two arrays
to allow more efficient cache use. - O3 optimizations may not cause higher performance
unless loop and memory access transformations
take place. - O3 optimizations may slow down code in some cases
compared to O2 optimizations. - O3 option is recommended for
- loops that heavily use floating-point
calculations - Loops that process large data sets.
-
31O2 vs. O3
- O2 will get a significant amount of performance
- Depends on code constructs, memory optimizations
- Both of these should be experimented with
32Interprocedural Optimizations (-Ip)
- Interprocedural optimizations for single file
compilation. - Subset of full intra-file interprocedural
optimizations - e.g. Perform inline function expansion for calls
to functions defined within the current source
file.
33Interprocedural Optimization (-ipo)
- Multi-file ip optimizations that includes-
inline function expansion- interprocedural
constant propogation- dead code elimination-
propagation of function characteristics- passing
arguments in registers- loop-invariant code
motion
34Inlining
- -inline-levelltngt
- control inline expansion
- n0 disable inlining
- n1 no inlining (unless -ip specified)
- n2 inline any function, at the compiler's
discretion (same as -ip) - -fno-inline-functions
- inline any function at the compiler's
discretion - -finline-limitltngt
- set maximum number of statements to be considered
for inlining - -no-inline-min-size
- no size limit for inlining small routines
- -no-inline-max-size
- no size limit for inlining large routines
35Did Inlining, IPO and PGO Help?
- Use selectively on bottlenecks
- Better for small chunks of code
36The fast Option
- Include options that can improve run-time
performance - -O3 (maximum speed and high-level
optimizations) - -ipo (enables interprocedural optimizations
across files) - -xT (generate code specialized for Intel(R)
Xeon(R) processors with SSE3, SSE4 etc. - -static Statically link in libraries at link
time - -no-prec-div (disable -prec-div) where -prec-div
improves precision of FP divides (some speed
impact)
37SSE and Vectorization
- -xT Intel(R) Core(TM)2 processor family with
SSSE3 - Use xSSSE3
- Harpertown
- -xS Future Intel processors supporting
- SSE4 Vectorizing Compiler Use xSSE4.1
- Media Accelerator instructions
- -xsse4.2 for Nehalem processors (SSE4.2
instructions) - -xsse4.1 for Nehalem processors (SSE4.1
instructions)
38What is SSE4
- SSE Streaming SIMD Extensions (SSE SSE1 SSE2
SSE3) - SSSE3 Suplemental SSE
- In SSE4.2, is first available in Core i7 (aka
Nehalem) - consists of 54 instructions divided into two
major categories - Vectorizing Compiler and Media Accelerators
- Efficient Accelerated String and Text Processing.
- Graphics / Video encoding and processing / 3-D
imaging / Gaming - High-performance applications .
- Efficient Accelerated String and Text Processing
will benefit database and data mining
applications, and those that utilize parsing,
search, and pattern matching algorithms like
virus scanners and compilers. - A subset of 47 instructions, SSE4.1 in Penryn
(Core 2) Harpertown
39Vectorization (Intra register) -vec
- void vecadd(float a, float b, float c, int
n) - int i
- for (i 0 i lt n i)
- ci ai bi
-
-
- the Intel compiler will transform the loop to
allow four floating-point additions to occur
simultaneously using the addps instruction.
Simply put, using a pseudo-vector notation, the
result would look something like this - for (i 0 i lt n i4)
- cii3 aii3 bii3
-
40OpenMP
- -openmp
- generate multi-threaded code based on the OpenMP
directives - -openmp-profile
- enable analysis of OpenMP application when
- the Intel(R) Thread Profiler should be installed
- -openmp-stubs
- enables the user to compile OpenMP programs in
sequential mode - OpenMP directives are ignored and a stub OpenMP
library is linked - -openmp-report012
- control the OpenMP parallelizer diagnostic level
41Auto Parallel (-parallel)
- -parallel
- generate multithreaded code for loops that can be
safely executed in parallel. - Must use O2 or O3.
- The default numbers of threads spawned is equal
to the number of processors detected in the
system where the binary is compiled - can be changed by setting the environment
variable OMP_NUM_THREADS - -parallel-report is very useful
42Auto Parallel Experiment Outcome
- 8 cores
- 6 MPI
- OMP_NUM_THREADS2
- -stack_temps -safe_cray_ptr -i_dynamic -convert
big_endian -assume byterecl -i4 -r8 -w95 -O3
-inline-level2 - Total runtime 415 seconds
- -stack_temps -safe_cray_ptr -i_dynamic -convert
big_endian -assume byterecl -ftz -i4 -r8 -w95 -O3
-inline-level2 parallel - Total runtime 594 seconds
- Have to include parallel in LDFLAGS
43Profile Guided Optimization (PGO)
- Traditional static compilation model
- Optimization decisions based on only an estimate
of important execution characteristics. - Branch probabilities, are estimated by assuming
- that controlling conditions that test equality
are less likely to succeed than condition that
test inequality. - Relative execution counts are based on static
properties such as nesting depth. - These estimated execution characteristics are
subsequently used to make optimization decisions - such as selecting an appropriate instruction
layout, - procedure inlining
- generating a sequential and vector version of a
loop. - The quality of such decisions can substantially
improve if more accurate execution
characteristics are available, which becomes
possible under profile-guided optimization.
44PGO steps
- Phase 1 (Compile)
- mpiifort O3 prof-gen -prof-dir dirx
- Phase 2 (Run code, collect profile)
- run_CubedSphere_BMK2.sh gt BMK2.out 2gt1
- Produces .dyn files
- comp_fv/49e5dea6_12099.dyn etc. etc.
- Phase3 (Recompile)
- mpiifort O3 prof-use prof-dir dirx
- ipo remark 11000 performing multi-file
optimizations - ipo-1 remark 11004 starting multi-object
compilation - Phase 4 (Re-run code)
- rerun_CubedSphere_BMK2.sh gt BMK2.out 2gt1
45PGO Outcome mxm
- -O3 prof-gen 96 seconds
- -O3 prof-use 10 seconds
- -O2 27 seconds
46Optimization Reports
- -vec-reportn
- control amount of vectorizer diagnostic
information - n3 indicate vectorized/non-vectorized loops
and prohibiting data dependence info -
- -opt-report n
- generate an optimization report to stderr
- n3 maximum report output
- -opt-report-fileltfilegt
- specify the filename for the generated report
- -opt-report-routineltnamegt
- reports on routines containing the given name
47Inter procedure optimization (-ipo)
- Multi-file ip optimizations that includes-
inline function expansion- interprocedural
constant propogation- dead code elimination-
propagation of function characteristics- passing
arguments in registers- loop-invariant code
motion
48Compiler Options Simple MXM Example
49CubedSphere Performance for various IFC Options
50Summary of MPI options
- Stacks available
- OFED 1.3.1 / OFED 1.4.1
- MPI Implementations
- Intel MPI 3,2
- Mvapich1 1.0.1
- Mvapich2 1.2.6
- OpenMPI 3.1
- Compilers
- Intel compilers
- intel-cce-10.1.017
- intel-fce-10.1.017
- PGI
- Pathscale
- gcc
51Which MPI flavor
- intel-mpi-3.1-64bit
- intel-openmpi-1.2.6
- intel-mvapich-1.0.1
- intel-mvapich2-1.2rc2
- gcc-openmpi-1.2.6
- gcc-mvapich-1.0.1
- gcc-mvapich2-1.2rc2
- pathscale-openmpi-1.2.6
- pathscale-mvapich-1.0.1
- pathscale-mvapich2-1.2rc2
- pgi-openmpi-1
- pgi-mvapich-1.0.1
- pgi-mvapich2-1.2rc2
- ofed-1.4-pgi-openmpi-1.2.8
- ofed-1.4-pgi-mvapich-1.1.0
- ofed-1.4-pgi-mvapich2-1.2p1
- ofed-1.4-gcc-openmpi-1.2.8
- ofed-1.4-gcc-mvapich-1.1.0
- ofed-1.4-gcc-mvapich2-1.2p1
- ofed-1.4-pathscale-openmpi-1.2.8
- ofed-1.4-pathscale-mvapich-1.1.0
- ofed-1.4-pathscale-mvapich2-1.2p1
52Open Fabric Enterprise Distribution OFED 1.3.1/1.4
- The OpenFabrics Alliance software stacks OFED
1.3.1/1.4.x - Goal develop, distribute and promote a
- unified, transport-independent, open-source
software stack - RDMA-capable fabrics and networks
- InfiniBand and Ethernet
- developed for many hardware architectures and OS
- Linux and Windows.
- server and storage clustering and grid
connectivity using - optimized for performance (i.e., BW, low latency)
- transport-offload technologies available in
adapter hardware.
53MVAPICH
- MVAPICH
- (MPI-1 over OpenFabrics/Gen2, OpenFabrics/Gen2-UD,
uDAPL, InfiniPath, VAPI and TCP/IP) - MPI-1 implementation
- Based on MPICH and MVICH
- The latest release is MVAPICH 1.1 (includes MPICH
1.2.7). - It is available under BSD licensing.
54MVAPICH2
- MVAPICH2
- MPI-2 over OpenFabrics-IB, OpenFabrics-iWARP,
uDAPL and TCP/IP - MPI-2 implementation which includes all MPI-1
features. - Based on MPICH2 and MVICH.
- The latest release is MVAPICH2 1.2 (includes
MPICH2 1.0.7).
55Open MPI Version 1.3.1
- http//www.open-mpi.org
- High performance message passing library
- Open source MPI-2 implementation
- Developed and maintained by a consortium of
academic, research, and industry partners - Many OS supported
56MPIEXEC options
- Two major areas
- DEVICE
- PINNING
57RUNTIME MPI Issues
- shm
- Shared-memory only (no sockets)
- ssm
- Combined sockets shared memory (for clusters
with SMP nodes) - rdma
- RDMA-capable network fabrics including
InfiniBand, Myrinet (via DAPL) - rdssm
- Combined sockets shared memory DAPL
- for clusters with SMPnodes and RDMA-capable
network fabrics
58Typical mpiexec command
- mpiexec -genv I_MPI_DEVICE rdssm \
- -genv I _MPI_PIN 1 \
- -genv I_MPI_PIN_PROCESSOR_LIST
0,2-3,4 \ - np 16 perhost 4 a.out
- -genv X Y associate env var X with value Y for
all MPI ranks.
59MPI_DEVICE for CubedSphere
- ssm
- 250 seconds
- rdma
- 250 seconds
- rdssm
- 250 seconds
60Rank Pinning
61Task Affinity
- Taskset
- Taskset -c 0,1,4,5 .
- Numacntrl
62Interactive Tools for Monitoring
63SMT
- Bios option set at boot time
- Run 2 threads at the same time per core
- Share resources (execution units)
- Take advantage of 4-wide execution engine
- Keep it fed with multiple threads
- Hide latency of a single thread
- Most power efficient performance feature
- Very low die area cost
- Can provide significant performance benefit
depending on application - Much more efficient than adding an entire core
- Implications for Out of Order executions
- Might be good for MPI OpenMP
- Might lead to extra BW pressure and pressure on
L1 L2 L3 caches
64SMT MPI
- NOAA NCEP GFS code T190 (240 hour simulation)
- SMT OFF
- 9709 seconds
- SMT ON TURBO ON
- 7276 seconds
65TURBO
- Turbo mode boosts operating frequency based on
thermal headroom - when the processor is operating below its peak
power, - increase the clock speed of the active cores by
one or more bins to increase performance. - Common reasons for operating below peak power are
- one or more cores may be powered down
- the active workload is relatively power (e.g. no
floating point, or few memory accesses). - Active cores can increase their clock frequency
in relatively coarse increments of 133MHz speed
bins, - depending on the SKU
- the available power
- thermal headroom
- other environmental factors.
66SMT and TURBO
67SMT Hybrid MPI OpenMP
- 8 MPI tasks
- OMP_NUM_THREADS1
- OMP_NUM_THREADS2
- Potentially a good way to exploit SMP
68Partially Filled Nodes
69MPI optimization
- Affinity
- Mapping Tasks to Nodes
- Mapping Tasks to Cores
- Barriers
- Collectives
- Environment variables
- Partially/Fully loaded nodes
70SHM vs. SSM
- shm 411 seconds
- ssm 424 seconds
71Events available for Oprofile
- CPU_CLK_UNHALTED
- UNHALTED_REFERENCE_CYCLES
- INST_RETIRED_ANY_P
- LLC_MISSES
- LLC_REFS
- BR_MISS_PRED_RETIRED
72PAPI issues
- Agree 100 that performance tools are desperately
needed - Our LTC team has been actively driving the
distros to add support. - End 2009, a decision was made to drive perfmon2
as the preferred method - Have had some success in driving into next major
releases of REHEL6 and SLES11 - Unfortunately, (possibly) we missed the first
release of SLES11 and it will be in SP1 - This would be the first time we could officially
support it installed. - Run with the kernel patch, problems have to be
reproduced on a non-patched system. - This is has worked on POWER Linux users at some
pretty large sites. - Use TDS systems as the vehicle to have the
patches and do some perf testing. - SCU5 without the PAPI patch and SCU6 with??.
- If a kernel problem occurs that needs to be
reproduced, it could just be rerun on SCU5??
73Oprofile LLC_MISSES
74Oprofile
- set CUR_DIR pwd
- sudo rm -rf samples
- echo " shutdown "
- sudo opcontrol --shutdown
- echo " start-deamon "
- sudo opcontrol --verboseall --start-daemon
--no-vmlinux --session-dirCUR_DIR
--separatethread --callgraph10
--eventLLC_REFS10000 --imageEXE - sudo opcontrol --status
- echo " start "
- sudo opcontrol --start
- setenv OMP_NUM_THREADS 1
- mpiexec -genv I_MPI_DEVICE shm -perhost 8 -n
NUMPRO EXE - sudo opcontrol --stop
- echo " shutdown "
- sudo opcontrol --shutdown
75I/O optimization
- High Performance Filesystems
- Striped disks
- GPFS (parallel filesystem)
- MIO
76MIOSTAT Statistics Collection
- set MIOSTAT /home/kghosh/vmio/tools/bin/miostat
s - MIOSTAT -v ./c2l.x
77MIO optimized code execution
- setenv MIO /home/kghosh/vmio/tools
- setenv MIO_LIBRARY_PATH MIO"/BLD/xLinux.64/lib"
- setenv LD_PRELOAD MIO"/BLD/xLinux.64/lib/libTKIO.
so" - setenv TKIO_ALTLIBX "fv.xMIO/BLD/xLinux.64/lib/g
et_MIO_ptrs_64.so/abort" - setenv MIO_STATS "./MIO.PID.stats"
- setenv MIO_FILES ".nc \
- trace/stats/mbytes \
- pf/cache2g/page2m/pref2 \
- trace/stats/mbytes\
- async/nthread2/naio100/nchild1 \
- trace/stats/mbytes"
78MIO with C2L (CubeToLatLon)
- BEFORE
- Timestamp _at_ Start 140045 Cumulative time
0.000 sec - Timestamp _at_ Stop 140825 Cumulative time
460.451 sec - AFTER
- MIO_FILES ".nc
- trace/stats/mbytes
- pf/cache2g/page2m/pref2
- trace/stats/mbytes
- async/nthread2/naio100/nchild1
- trace/stats/mbytes"
- Timestamp _at_ Start 143153 Cumulative time
0.004sec - Timestamp_at_ Stop 143404 Cumulative time130.618
sec
79GPFS I/O
- Timestamp _at_ Start 101444 Cumulative time
0.012 sec - Timestamp _at_ Stop 101512 Cumulative time
27.835 sec
80Non invasive MPI Trace Tool from IBM
- No recompile needed
- Uses PMPI layer
- mpiifort -(LDFLAGS) libmpi_trace.a .o o a.out
81MPI TRACE output
- Data for MPI rank 62 of 128
- --------------------------------------------------
--------------- - MPI Routine calls avg.
bytes time(sec) - --------------------------------------------------
--------------- - MPI_Comm_size 1
0.0 0.000 - MPI_Comm_rank 1
0.0 0.000 - MPI_Isend 114554 4106.8
0.953 - MPI_Irecv 114554
4117.5 0.188 - MPI_Wait 229108
0.0 5.190 - MPI_Bcast 28
11.1 0.039 - MPI_Barrier 2
0.0 0.003 - MPI_Reduce 2
8.0 0.000 - --------------------------------------------------
--------------- - MPI task 62 of 128 had the median communication
time. - total communication time 6.373 seconds.
- total elapsed time 34.825 seconds.
- user cpu time 34.799
seconds. - system time 0.002
seconds.
82MPI TRACE OUTPUT
- Message size distributions
- MPI_Isend calls avg. bytes
time(sec) - 114252
4096.0 0.911 - 302
8192.0 0.042 - MPI_Irecv calls avg. bytes
time(sec) - 113954
4096.0 0.186 - 600
8192.0 0.002 - MPI_Bcast calls avg. bytes
time(sec) - 24
4.0 0.034 - 2
8.0 0.006 - 2
100.0 0.000 -
- MPI_Reduce calls avg. bytes
time(sec) - 2
8.0 0.000
83CubedSphere on Nehalem and Harpertown(Previous
Generation 150 PEs 9977 seconds)