NASA NCCS APPLICATION PERFORMANCE DISCUSSION - PowerPoint PPT Presentation

About This Presentation
Title:

NASA NCCS APPLICATION PERFORMANCE DISCUSSION

Description:

Harpertown Seaburg Chipset. IBM Federal 2006 IBM Corporation. IBM ... Motherboards must use a chipset that supports QuickPath. The following caches: ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 84
Provided by: william581
Category:

less

Transcript and Presenter's Notes

Title: NASA NCCS APPLICATION PERFORMANCE DISCUSSION


1
NASA NCCS APPLICATION PERFORMANCE DISCUSSION
  • Koushik Ghosh, Ph.D.
  • IBM Federal HPC
  • HPC Technical Specialist

IBM Systemx iDataPlex -Parallel Scientific
Applications Development April 22-23,
2009 Koushik K Ghosh, Ph.D. IBM Federal HPC HPC
Technical Specialist
2
Topics
  • HW SW System Architecture
  • Platform/Chipset
  • Processor
  • Memory
  • Interconnect
  • Building Apps on System x
  • Compilation
  • MPI
  • Executing Apps on System x
  • Runtime options
  • Tools Oprofile / MIO I/O Perf / MPI Trace
  • Discussion of NCCS apps

3
Scalable Unit Summary
4
2 SCU Configuration
5
SDR/DDR/QDR
6
iDataPlex footprint
7
Compute Node
  • iDataPlex 2U Flex
  • Intel Harpertown (Xeon L5200)
  • dual-socket, quad-core 2.5 GHz 50W
  • SCU 3 SCU 4
  • Nehalem
  • dual-socket, quad-core 2.8? GHz
  • SCU 5

8
Harpertown Seaburg Chipset
9
Harpertown Intel Core2 Quad processor
10
Nehalem Tylersburg Chipset
11
Nehalem Intel Core i7 Processor
12
Nehalem QPI Quick Path Interconnect
13
Cache Details
14
cpuinfo (Harpertown) (/opt/intel/impi/3.1/bin64/cp
uinfo)
  • Architecture x86_64
  • Hyperthreading disabled
  • Packages 2
  • Cores 8
  • Processors 8
  • Processor identification
  • Processor Thread Core Package
  • 0 0 0 1
  • 1 0 0 0
  • 2 0 2 0
  • 3 0 2 1
  • 4 0 1 0
  • 5 0 3 0
  • 6 0 1 1
  • 7 0 3 1
  • Processor placement
  • Package Cores Processors
  • 1 0,2,1,3 0,3,6,7
  • 0 0,2,1,3 1,2,4,5

15
cat cpuinfo (Nehalem)(/opt/intel/impi/3.2.0.011/b
in64/cpuinfo)
  • Architecture x86_64
  • Hyperthreading enabled
  • Packages 2
  • Cores 8
  • Processors 16
  • Processor identification
  • Processor Thread Core Package
  • 0 0 0 0
  • 1 1 0 0
  • 2 0 1 0
  • 3 1 1 0
  • 4 0 2 0
  • 5 1 2 0
  • 6 0 3 0
  • 7 1 3 0
  • 8 0 0 1
  • 9 1 0 1
  • 10 0 1 1
  • 11 1 1 1

16
cat /proc/cpuinfo (Harpertown)
  • processor 0
  • vendor_id GenuineIntel
  • cpu family 6
  • model 23
  • model name Intel(R) Xeon(R) CPU
    E5472 _at_ 3.00GHz
  • stepping 6
  • cpu MHz 2992.509
  • cache size 6144 KB
  • physical id 1
  • siblings 4
  • core id 0
  • cpu cores 4
  • fpu yes
  • fpu_exception yes
  • cpuid level 10
  • wp yes
  • flags fpu vme de pse tsc msr pae mce
    cx8 apic sep mtrr pge mca cmov pat pse36 clflush
    dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
    constant_tsc pni monitor ds_cpl vmx est tm2 cx16
    xtpr lahf_lm
  • bogomips 5988.95
  • clflush size 64

17
/proc/cpuinfo (Nehalem)
  • processor 0
  • vendor_id GenuineIntel
  • cpu family 6
  • model 26
  • model name Intel(R) Xeon(R) CPU
    X55700 _at_ .
  • stepping 4
  • cpu MHz 2927.000
  • cache size 8192 KB
  • physical id 0
  • siblings 8
  • core id 0
  • cpu cores 4
  • apicid 0
  • initial apicid 0
  • fpu yes
  • fpu_exception yes
  • cpuid level 11
  • wp yes
  • flags fpu vme de pse tsc msr pae mce
    cx8 apic sep mtrr pge mca cmov pat pse36 clflush
    dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
    nx rdtscp lm constant_tsc arch_perfmon pebs bts
    rep_good xtopology pni dtes64 monitor ds_cpl vmx
    est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
    lahf_lm ida tpr_shadow vnmi flexpriority ept vpid

18
cat /proc/meminfo
  • MemTotal 24737232 kB
  • MemFree 21152912 kB
  • Buffers 77376 kB
  • Cached 2230344 kB
  • SwapCached 0 kB
  • Active 1650908 kB
  • Inactive 1720616 kB
  • Active(anon) 955796 kB
  • Inactive(anon) 0 kB
  • Active(file) 695112 kB
  • Inactive(file) 1720616 kB
  • Unevictable 0 kB
  • Mlocked 0 kB
  • SwapTotal 2104472 kB
  • SwapFree 2104472 kB
  • Dirty 536 kB
  • Writeback 0 kB
  • AnonPages 955608 kB
  • Mapped 28632 kB
  • Slab 123752 kB
  • SReclaimable 101028 kB
  • SUnreclaim 22724 kB
  • PageTables 5364 kB
  • NFS_Unstable 0 kB
  • Bounce 0 kB
  • WritebackTmp 0 kB
  • CommitLimit 14473088 kB
  • Committed_AS 1156568 kB
  • VmallocTotal 34359738367 kB
  • VmallocUsed 337244 kB
  • VmallocChunk 34359395451 kB
  • HugePages_Total 0
  • HugePages_Free 0
  • HugePages_Rsvd 0
  • HugePages_Surp 0
  • Hugepagesize 2048 kB

19
Meminfo explanation
  • High-Level Statistics
  • MemTotal Total usable ram (i.e. physical ram
    minus a few reserved bits and the kernel binary
    code)
  • MemFree Is sum of LowFreeHighFree (overall
    stat)
  • MemShared 0 is here for compat reasons but
    always zero.
  • Buffers Memory in buffer cache. mostly useless
    as metric nowadays
  • Cached Memory in the pagecache (diskcache) minus
    SwapCache
  • SwapCache Memory that once was swapped out, is
    swapped back in but still also is in the swapfile
    (if memory is needed it doesn't need to be
    swapped out AGAIN because it is already in the
    swapfile. This saves I/O)

20
Memory
  • Memory on Harpertown Compute Node SCU3 and SCU4
  • 4 x 4GB (9W) PC2-5300 CL5 ECC DDR2 667MHz FBDIMMs
  • 16 GB per node
  • Memory on Nehalem Compute Node
  • 3 DDR3 channels on each socket / total of 8 DIMM
    slots
  • e.g. 4 GB DIMM on each DDR3 channel (24GB/node)
    1300 MHz
  • e.g 18GB per node (1066 MHz)
  • 2GB/2GB/2GB on channel1
  • 2GB/2GB/2GB on channel2
  • 1GB on channel 3

21
Interconnect
  • (1) Mellanox ConnectX dual port DDR IB 4X HCA
    PCIe 2.0 x8
  • IB4X DDR Cisco 9024D 288-port DDR switches for
    each scalable unit cabled in the following
    manner
  • 256 ports to compute nodes
  • 2 ports to spare compute nodes
  • 6 ports to service nodes
  • 24 ports uplinked to Tier 1 InfiniBand switch
  • ConnectX InfiniBand 4X DDR HCAs
  • 16 Gb/second of uni-directional peak MPI
    bandwidth
  • less than 2 microseconds MPI latency.

22
Nehalem Features
  • The Nehalem microarchitecture has many new
    features, some of which are present in the Core
    i7. The ones that represent significant changes
    from the Core 2 include
  • The new LGA 1366 socket is incompatible with
    earlier processors.
  • On-die memory controller the memory is directly
    connected to the processor. It is called the
    uncore part and runs at a different clock (uncore
    clock) of execution cores.
  • Three channel memory each channel can support
    one or two DDR3 DIMMs. Motherboards for Core i7
    generally have three, four (31) or six DIMM
    slots.
  • Support for DDR3 only.
  • No ECC support.
  • The front side bus has been replaced by the Intel
    QuickPath Interconnect interface. Motherboards
    must use a chipset that supports QuickPath.
  • The following caches
  • 32 KB L1 instruction and 32 KB L1 data cache per
    core
  • 256 KB L2 cache (combined instruction and data)
    per core
  • 8 MB L3 (combined instruction and data)
    "inclusive", shared by all cores
  • Single-die device all four cores, the memory
    controller, and all cache are on a single die.

23
Nehalem Features contd.
  • "Turbo Boost" technology
  • allows all active cores to intelligently clock
    themselves up
  • in steps of 133 MHz over the design clock rate
  • as long as the CPU's predetermined
    thermal/electrical requirements are still met.
  • Re-implemented Hyper-threading.
  • Each of the four cores can process up to two
    threads simultaneously,
  • processor appears to the OS as eight CPUs.
  • This feature was dropped in Core (Harpertown).
  • Only one QuickPath interface not intended for
    multi-processor motherboards.
  • 45nm process technology.
  • 731M transistors.
  • 263 mm2 Die size.
  • Sophisticated power management places unused core
    in a zero-power mode.
  • Support for SSE4.2 SSE4.1 instruction sets.

24
I/O and Filesystem
  • /discover/home
  • /discover/nobackup
  • IBM Global Parallel File System (GPFS) used on
    all nodes
  • Serviced by 4 I/O nodes
  • read/write access from all nodes
  • /discover/home 2TB, /discover/nobackup 4 TB
  • Individual quota
  • /discover/home 500 MB
  • /discover/nobackup 100 GB
  • Very fast (peak 350 MB/sec, normal 150 - 250
    MB/sec)

25
Software
  • OS Linux (RHEL5.2)
  • Compilers
  • Intel Fortran, C/C
  • Math libs BLAS, LAPACK, ScaLAPACK, MKL
  • MPI MPI-2
  • Scheduler PBSPro

26
LINUX pagesize
  • getconf PAGESIZE
  • 4096

27
Which modules
  • modules loaded (64-bit compilers)
  • intel-cce-10.1.017
  • intel-fce-10.1.017
  • intel-mkl-10.0.3.020
  • intel-mpi-3.1-64bit
  • /opt/intel/fce/10.1.017/bin/ifort
  • /opt/intel/impi/3.1/bin64/mpiifort
  • modules loaded (32-bit compilers)
  • intel-cc-10.1.017
  • intel-fc-10.1.017
  • /opt/intel/fc/10.1.017/bin/ifort
  • /opt/intel/impi/3.1/bin/mpiifort

28
IFC Compiler Options of some Physics/Chemistry/Cli
mate Applications
  • CubedSphere
  • -safe_cray_ptr -i_dynamic -convert big_endian
    -assume byterecl -ftz -i4 -r8 -O3 -xS
  • NPB3.2 -O3 -xT -ip -no-prec-div -ansi-alias
    -fno-alias
  • HPCC -O2 xT
  • GAMESS -O3 -xS -ipo -i-static -fno-pic ipo
  • GTC -O1
  • CAM -O3 xT
  • MILC -O3 xT
  • PARATEC -O3 -xS -ipo -i-static-fno-fnalias
    -fno-alias
  • STREAM -O3 opt-streaming-storesalways xS-ip
  • SpecCPU2006 ?????

29
Optimization Level O2 (-O2)
  • - Inlining of intrinsics- Intra-file
    interprocedural opt- inlining- constant
    propagation- forward substitution- routine
    attribute propagation- variable address-taken
    analysis- dead static function elimination-
    removal of unreferenced variables- constant
    propagation- copy propagation
  • dead-code elimination- global register
    allocation- global instruction scheduling and
    control speculation- loop unrolling- optimized
    code selection- partial redundancy elimination-
    strength reduction/induction
  • - variable renaming- exception handling
    optimizations- tail recursions- peephole
    optimizations- structure assignment lowering and
    optimizations- dead store elimination

30
Optimization Level O3 (-O3)
  • Enables O2 optimizations plus more aggressive
    optimizations, such as
  • prefetching, scalar replacement
  • loop and memory access transformations.
  • Loop unrolling, including instruction scheduling
  • Code replication to eliminate branches
  • Padding the size of certain power-of-two arrays
    to allow more efficient cache use.
  • O3 optimizations may not cause higher performance
    unless loop and memory access transformations
    take place.
  • O3 optimizations may slow down code in some cases
    compared to O2 optimizations.
  • O3 option is recommended for
  • loops that heavily use floating-point
    calculations
  • Loops that process large data sets.

31
O2 vs. O3
  • O2 will get a significant amount of performance
  • Depends on code constructs, memory optimizations
  • Both of these should be experimented with

32
Interprocedural Optimizations (-Ip)
  • Interprocedural optimizations for single file
    compilation.
  • Subset of full intra-file interprocedural
    optimizations
  • e.g. Perform inline function expansion for calls
    to functions defined within the current source
    file.

33
Interprocedural Optimization (-ipo)
  • Multi-file ip optimizations that includes-
    inline function expansion- interprocedural
    constant propogation- dead code elimination-
    propagation of function characteristics- passing
    arguments in registers- loop-invariant code
    motion

34
Inlining
  • -inline-levelltngt
  • control inline expansion
  • n0 disable inlining
  • n1 no inlining (unless -ip specified)
  • n2 inline any function, at the compiler's
    discretion (same as -ip)
  • -fno-inline-functions
  • inline any function at the compiler's
    discretion
  • -finline-limitltngt
  • set maximum number of statements to be considered
    for inlining
  • -no-inline-min-size
  • no size limit for inlining small routines
  • -no-inline-max-size
  • no size limit for inlining large routines

35
Did Inlining, IPO and PGO Help?
  • Use selectively on bottlenecks
  • Better for small chunks of code

36
The fast Option
  • Include options that can improve run-time
    performance
  • -O3   (maximum speed and high-level
    optimizations)
  • -ipo (enables interprocedural optimizations
    across files)
  • -xT  (generate code specialized for Intel(R)
    Xeon(R) processors with SSE3, SSE4 etc.
  • -static  Statically link in libraries at link
    time
  • -no-prec-div (disable -prec-div) where -prec-div
    improves precision of FP divides (some speed
    impact)

37
SSE and Vectorization
  • -xT Intel(R) Core(TM)2 processor family with
    SSSE3
  • Use xSSSE3
  • Harpertown
  • -xS Future Intel processors supporting
  • SSE4 Vectorizing Compiler Use xSSE4.1
  • Media Accelerator instructions
  • -xsse4.2 for Nehalem processors (SSE4.2
    instructions)
  • -xsse4.1 for Nehalem processors (SSE4.1
    instructions)

38
What is SSE4
  • SSE Streaming SIMD Extensions (SSE SSE1 SSE2
    SSE3)
  • SSSE3 Suplemental SSE
  • In SSE4.2, is first available in Core i7 (aka
    Nehalem)
  • consists of 54 instructions divided into two
    major categories
  • Vectorizing Compiler and Media Accelerators
  • Efficient Accelerated String and Text Processing.
  • Graphics / Video encoding and processing / 3-D
    imaging / Gaming
  • High-performance applications .
  • Efficient Accelerated String and Text Processing
    will benefit database and data mining
    applications, and those that utilize parsing,
    search, and pattern matching algorithms like
    virus scanners and compilers.
  • A subset of 47 instructions, SSE4.1 in Penryn
    (Core 2) Harpertown

39
Vectorization (Intra register) -vec
  • void vecadd(float a, float b, float c, int
    n)
  • int i
  • for (i 0 i lt n i)
  • ci ai bi
  • the Intel compiler will transform the loop to
    allow four floating-point additions to occur
    simultaneously using the addps instruction.
    Simply put, using a pseudo-vector notation, the
    result would look something like this
  • for (i 0 i lt n i4)
  • cii3 aii3 bii3

40
OpenMP
  • -openmp
  • generate multi-threaded code based on the OpenMP
    directives
  • -openmp-profile
  • enable analysis of OpenMP application when
  • the Intel(R) Thread Profiler should be installed
  • -openmp-stubs
  • enables the user to compile OpenMP programs in
    sequential mode
  • OpenMP directives are ignored and a stub OpenMP
    library is linked
  • -openmp-report012
  • control the OpenMP parallelizer diagnostic level

41
Auto Parallel (-parallel)
  • -parallel
  • generate multithreaded code for loops that can be
    safely executed in parallel.
  • Must use O2 or O3.
  • The default numbers of threads spawned is equal
    to the number of processors detected in the
    system where the binary is compiled
  • can be changed by setting the environment
    variable OMP_NUM_THREADS
  • -parallel-report is very useful

42
Auto Parallel Experiment Outcome
  • 8 cores
  • 6 MPI
  • OMP_NUM_THREADS2
  • -stack_temps -safe_cray_ptr -i_dynamic -convert
    big_endian -assume byterecl -i4 -r8 -w95 -O3
    -inline-level2
  • Total runtime 415 seconds
  • -stack_temps -safe_cray_ptr -i_dynamic -convert
    big_endian -assume byterecl -ftz -i4 -r8 -w95 -O3
    -inline-level2 parallel
  • Total runtime 594 seconds
  • Have to include parallel in LDFLAGS

43
Profile Guided Optimization (PGO)
  • Traditional static compilation model
  • Optimization decisions based on only an estimate
    of important execution characteristics.
  • Branch probabilities, are estimated by assuming
  • that controlling conditions that test equality
    are less likely to succeed than condition that
    test inequality.
  • Relative execution counts are based on static
    properties such as nesting depth.
  • These estimated execution characteristics are
    subsequently used to make optimization decisions
  • such as selecting an appropriate instruction
    layout,
  • procedure inlining
  • generating a sequential and vector version of a
    loop.
  • The quality of such decisions can substantially
    improve if more accurate execution
    characteristics are available, which becomes
    possible under profile-guided optimization.

44
PGO steps
  • Phase 1 (Compile)
  • mpiifort O3 prof-gen -prof-dir dirx
  • Phase 2 (Run code, collect profile)
  • run_CubedSphere_BMK2.sh gt BMK2.out 2gt1
  • Produces .dyn files
  • comp_fv/49e5dea6_12099.dyn etc. etc.
  • Phase3 (Recompile)
  • mpiifort O3 prof-use prof-dir dirx
  • ipo remark 11000 performing multi-file
    optimizations
  • ipo-1 remark 11004 starting multi-object
    compilation
  • Phase 4 (Re-run code)
  • rerun_CubedSphere_BMK2.sh gt BMK2.out 2gt1

45
PGO Outcome mxm
  • -O3 prof-gen 96 seconds
  • -O3 prof-use 10 seconds
  • -O2 27 seconds

46
Optimization Reports
  • -vec-reportn
  • control amount of vectorizer diagnostic
    information
  • n3 indicate vectorized/non-vectorized loops
    and prohibiting data dependence info
  • -opt-report n
  • generate an optimization report to stderr
  • n3 maximum report output
  • -opt-report-fileltfilegt
  • specify the filename for the generated report
  • -opt-report-routineltnamegt
  • reports on routines containing the given name

47
Inter procedure optimization (-ipo)
  • Multi-file ip optimizations that includes-
    inline function expansion- interprocedural
    constant propogation- dead code elimination-
    propagation of function characteristics- passing
    arguments in registers- loop-invariant code
    motion

48
Compiler Options Simple MXM Example
49
CubedSphere Performance for various IFC Options
50
Summary of MPI options
  • Stacks available
  • OFED 1.3.1 / OFED 1.4.1
  • MPI Implementations
  • Intel MPI 3,2
  • Mvapich1 1.0.1
  • Mvapich2 1.2.6
  • OpenMPI 3.1
  • Compilers
  • Intel compilers
  • intel-cce-10.1.017
  • intel-fce-10.1.017
  • PGI
  • Pathscale
  • gcc

51
Which MPI flavor
  • intel-mpi-3.1-64bit
  • intel-openmpi-1.2.6
  • intel-mvapich-1.0.1
  • intel-mvapich2-1.2rc2
  • gcc-openmpi-1.2.6
  • gcc-mvapich-1.0.1
  • gcc-mvapich2-1.2rc2
  • pathscale-openmpi-1.2.6
  • pathscale-mvapich-1.0.1
  • pathscale-mvapich2-1.2rc2
  • pgi-openmpi-1
  • pgi-mvapich-1.0.1
  • pgi-mvapich2-1.2rc2
  • ofed-1.4-pgi-openmpi-1.2.8
  • ofed-1.4-pgi-mvapich-1.1.0
  • ofed-1.4-pgi-mvapich2-1.2p1
  • ofed-1.4-gcc-openmpi-1.2.8
  • ofed-1.4-gcc-mvapich-1.1.0
  • ofed-1.4-gcc-mvapich2-1.2p1
  • ofed-1.4-pathscale-openmpi-1.2.8
  • ofed-1.4-pathscale-mvapich-1.1.0
  • ofed-1.4-pathscale-mvapich2-1.2p1

52
Open Fabric Enterprise Distribution OFED 1.3.1/1.4
  • The OpenFabrics Alliance software stacks OFED
    1.3.1/1.4.x
  • Goal develop, distribute and promote a
  • unified, transport-independent, open-source
    software stack
  • RDMA-capable fabrics and networks
  • InfiniBand and Ethernet
  • developed for many hardware architectures and OS
  • Linux and Windows.
  • server and storage clustering and grid
    connectivity using
  • optimized for performance (i.e., BW, low latency)
  • transport-offload technologies available in
    adapter hardware.

53
MVAPICH
  • MVAPICH
  • (MPI-1 over OpenFabrics/Gen2, OpenFabrics/Gen2-UD,
    uDAPL, InfiniPath, VAPI and TCP/IP)
  • MPI-1 implementation
  • Based on MPICH and MVICH
  • The latest release is MVAPICH 1.1 (includes MPICH
    1.2.7).
  • It is available under BSD licensing.

54
MVAPICH2
  • MVAPICH2
  • MPI-2 over OpenFabrics-IB, OpenFabrics-iWARP,
    uDAPL and TCP/IP
  • MPI-2 implementation which includes all MPI-1
    features.
  • Based on MPICH2 and MVICH.
  • The latest release is MVAPICH2 1.2 (includes
    MPICH2 1.0.7).

55
Open MPI Version 1.3.1
  • http//www.open-mpi.org
  • High performance message passing library
  • Open source MPI-2 implementation
  • Developed and maintained by a consortium of
    academic, research, and industry partners
  • Many OS supported

56
MPIEXEC options
  • Two major areas
  • DEVICE
  • PINNING

57
RUNTIME MPI Issues
  • shm
  • Shared-memory only (no sockets)
  • ssm
  • Combined sockets shared memory (for clusters
    with SMP nodes)
  • rdma
  • RDMA-capable network fabrics including
    InfiniBand, Myrinet (via DAPL)
  • rdssm
  • Combined sockets shared memory DAPL
  • for clusters with SMPnodes and RDMA-capable
    network fabrics

58
Typical mpiexec command
  • mpiexec -genv I_MPI_DEVICE rdssm \
  • -genv I _MPI_PIN 1 \
  • -genv I_MPI_PIN_PROCESSOR_LIST
    0,2-3,4 \
  • np 16 perhost 4 a.out
  • -genv X Y associate env var X with value Y for
    all MPI ranks.

59
MPI_DEVICE for CubedSphere
  • ssm
  • 250 seconds
  • rdma
  • 250 seconds
  • rdssm
  • 250 seconds

60
Rank Pinning
61
Task Affinity
  • Taskset
  • Taskset -c 0,1,4,5 .
  • Numacntrl

62
Interactive Tools for Monitoring
  • top
  • mpstat
  • vmstat
  • iostat

63
SMT
  • Bios option set at boot time
  • Run 2 threads at the same time per core
  • Share resources (execution units)
  • Take advantage of 4-wide execution engine
  • Keep it fed with multiple threads
  • Hide latency of a single thread
  • Most power efficient performance feature
  • Very low die area cost
  • Can provide significant performance benefit
    depending on application
  • Much more efficient than adding an entire core
  • Implications for Out of Order executions
  • Might be good for MPI OpenMP
  • Might lead to extra BW pressure and pressure on
    L1 L2 L3 caches

64
SMT MPI
  • NOAA NCEP GFS code T190 (240 hour simulation)
  • SMT OFF
  • 9709 seconds
  • SMT ON TURBO ON
  • 7276 seconds

65
TURBO
  • Turbo mode boosts operating frequency based on
    thermal headroom
  • when the processor is operating below its peak
    power,
  • increase the clock speed of the active cores by
    one or more bins to increase performance.
  • Common reasons for operating below peak power are
  • one or more cores may be powered down
  • the active workload is relatively power (e.g. no
    floating point, or few memory accesses).
  • Active cores can increase their clock frequency
    in relatively coarse increments of 133MHz speed
    bins,
  • depending on the SKU
  • the available power
  • thermal headroom
  • other environmental factors.

66
SMT and TURBO
67
SMT Hybrid MPI OpenMP
  • 8 MPI tasks
  • OMP_NUM_THREADS1
  • OMP_NUM_THREADS2
  • Potentially a good way to exploit SMP

68
Partially Filled Nodes
69
MPI optimization
  • Affinity
  • Mapping Tasks to Nodes
  • Mapping Tasks to Cores
  • Barriers
  • Collectives
  • Environment variables
  • Partially/Fully loaded nodes

70
SHM vs. SSM
  • shm 411 seconds
  • ssm 424 seconds

71
Events available for Oprofile
  • CPU_CLK_UNHALTED
  • UNHALTED_REFERENCE_CYCLES
  • INST_RETIRED_ANY_P
  • LLC_MISSES
  • LLC_REFS
  • BR_MISS_PRED_RETIRED

72
PAPI issues
  • Agree 100 that performance tools are desperately
    needed
  • Our LTC team has been actively driving the
    distros to add support.
  • End 2009, a decision was made to drive perfmon2
    as the preferred method
  • Have had some success in driving into next major
    releases of REHEL6 and SLES11
  • Unfortunately, (possibly) we missed the first
    release of SLES11 and it will be in SP1
  • This would be the first time we could officially
    support it installed.
  • Run with the kernel patch, problems have to be
    reproduced on a non-patched system.
  • This is has worked on POWER Linux users at some
    pretty large sites.
  • Use TDS systems as the vehicle to have the
    patches and do some perf testing.
  • SCU5 without the PAPI patch and SCU6 with??.
  • If a kernel problem occurs that needs to be
    reproduced, it could just be rerun on SCU5??

73
Oprofile LLC_MISSES
74
Oprofile
  • set CUR_DIR pwd
  • sudo rm -rf samples
  • echo " shutdown "
  • sudo opcontrol --shutdown
  • echo " start-deamon "
  • sudo opcontrol --verboseall --start-daemon
    --no-vmlinux --session-dirCUR_DIR
    --separatethread --callgraph10
    --eventLLC_REFS10000 --imageEXE
  • sudo opcontrol --status
  • echo " start "
  • sudo opcontrol --start
  • setenv OMP_NUM_THREADS 1
  • mpiexec -genv I_MPI_DEVICE shm -perhost 8 -n
    NUMPRO EXE
  • sudo opcontrol --stop
  • echo " shutdown "
  • sudo opcontrol --shutdown

75
I/O optimization
  • High Performance Filesystems
  • Striped disks
  • GPFS (parallel filesystem)
  • MIO

76
MIOSTAT Statistics Collection
  • set MIOSTAT /home/kghosh/vmio/tools/bin/miostat
    s
  • MIOSTAT -v ./c2l.x

77
MIO optimized code execution
  • setenv MIO /home/kghosh/vmio/tools
  • setenv MIO_LIBRARY_PATH MIO"/BLD/xLinux.64/lib"
  • setenv LD_PRELOAD MIO"/BLD/xLinux.64/lib/libTKIO.
    so"
  • setenv TKIO_ALTLIBX "fv.xMIO/BLD/xLinux.64/lib/g
    et_MIO_ptrs_64.so/abort"
  • setenv MIO_STATS "./MIO.PID.stats"
  • setenv MIO_FILES ".nc \
  • trace/stats/mbytes \
  • pf/cache2g/page2m/pref2 \
  • trace/stats/mbytes\
  • async/nthread2/naio100/nchild1 \
  • trace/stats/mbytes"

78
MIO with C2L (CubeToLatLon)
  • BEFORE
  • Timestamp _at_ Start 140045 Cumulative time
    0.000 sec
  • Timestamp _at_ Stop 140825 Cumulative time
    460.451 sec
  • AFTER
  • MIO_FILES ".nc
  • trace/stats/mbytes
  • pf/cache2g/page2m/pref2
  • trace/stats/mbytes
  • async/nthread2/naio100/nchild1
  • trace/stats/mbytes"
  • Timestamp _at_ Start 143153 Cumulative time
    0.004sec
  • Timestamp_at_ Stop 143404 Cumulative time130.618
    sec

79
GPFS I/O
  • Timestamp _at_ Start 101444 Cumulative time
    0.012 sec
  • Timestamp _at_ Stop 101512 Cumulative time
    27.835 sec

80
Non invasive MPI Trace Tool from IBM
  • No recompile needed
  • Uses PMPI layer
  • mpiifort -(LDFLAGS) libmpi_trace.a .o o a.out

81
MPI TRACE output
  • Data for MPI rank 62 of 128
  • --------------------------------------------------
    ---------------
  • MPI Routine calls avg.
    bytes time(sec)
  • --------------------------------------------------
    ---------------
  • MPI_Comm_size 1
    0.0 0.000
  • MPI_Comm_rank 1
    0.0 0.000
  • MPI_Isend 114554 4106.8
    0.953
  • MPI_Irecv 114554
    4117.5 0.188
  • MPI_Wait 229108
    0.0 5.190
  • MPI_Bcast 28
    11.1 0.039
  • MPI_Barrier 2
    0.0 0.003
  • MPI_Reduce 2
    8.0 0.000
  • --------------------------------------------------
    ---------------
  • MPI task 62 of 128 had the median communication
    time.
  • total communication time 6.373 seconds.
  • total elapsed time 34.825 seconds.
  • user cpu time 34.799
    seconds.
  • system time 0.002
    seconds.

82
MPI TRACE OUTPUT
  • Message size distributions
  • MPI_Isend calls avg. bytes
    time(sec)
  • 114252
    4096.0 0.911
  • 302
    8192.0 0.042
  • MPI_Irecv calls avg. bytes
    time(sec)
  • 113954
    4096.0 0.186
  • 600
    8192.0 0.002
  • MPI_Bcast calls avg. bytes
    time(sec)
  • 24
    4.0 0.034
  • 2
    8.0 0.006
  • 2
    100.0 0.000
  • MPI_Reduce calls avg. bytes
    time(sec)
  • 2
    8.0 0.000

83
CubedSphere on Nehalem and Harpertown(Previous
Generation 150 PEs 9977 seconds)
Write a Comment
User Comments (0)
About PowerShow.com