NASA NCCS APPLICATION PERFORMANCE DISCUSSION

About This Presentation

Title:

NASA NCCS APPLICATION PERFORMANCE DISCUSSION

Description:

Harpertown Seaburg Chipset. IBM Federal 2006 IBM Corporation. IBM ... Motherboards must use a chipset that supports QuickPath. The following caches: ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 84

Provided by: william581

Learn more at: https://www.nccs.nasa.gov

Category:

more less

Transcript and Presenter's Notes

Title: NASA NCCS APPLICATION PERFORMANCE DISCUSSION

1
NASA NCCS APPLICATION PERFORMANCE DISCUSSION

Koushik Ghosh, Ph.D.
IBM Federal HPC
HPC Technical Specialist

IBM Systemx iDataPlex -Parallel Scientific
Applications Development April 22-23,
2009 Koushik K Ghosh, Ph.D. IBM Federal HPC HPC
Technical Specialist
2
Topics

HW SW System Architecture
Platform/Chipset
Processor
Memory
Interconnect
Building Apps on System x
Compilation
MPI
Executing Apps on System x
Runtime options
Tools Oprofile / MIO I/O Perf / MPI Trace
Discussion of NCCS apps

3
Scalable Unit Summary
4
2 SCU Configuration
5
SDR/DDR/QDR
6
iDataPlex footprint
7
Compute Node

iDataPlex 2U Flex
Intel Harpertown (Xeon L5200)
dual-socket, quad-core 2.5 GHz 50W
SCU 3 SCU 4
Nehalem
dual-socket, quad-core 2.8? GHz
SCU 5

8
Harpertown Seaburg Chipset
9
Harpertown Intel Core2 Quad processor
10
Nehalem Tylersburg Chipset
11
Nehalem Intel Core i7 Processor
12
Nehalem QPI Quick Path Interconnect
13
Cache Details
14
cpuinfo (Harpertown) (/opt/intel/impi/3.1/bin64/cp
uinfo)

Architecture x86_64
Hyperthreading disabled
Packages 2
Cores 8
Processors 8
Processor identification
Processor Thread Core Package
0 0 0 1
1 0 0 0
2 0 2 0
3 0 2 1
4 0 1 0
5 0 3 0
6 0 1 1
7 0 3 1
Processor placement
Package Cores Processors
1 0,2,1,3 0,3,6,7
0 0,2,1,3 1,2,4,5

15
cat cpuinfo (Nehalem)(/opt/intel/impi/3.2.0.011/b
in64/cpuinfo)

Architecture x86_64
Hyperthreading enabled
Packages 2
Cores 8
Processors 16
Processor identification
Processor Thread Core Package
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0
6 0 3 0
7 1 3 0
8 0 0 1
9 1 0 1
10 0 1 1
11 1 1 1

16
cat /proc/cpuinfo (Harpertown)

processor 0
vendor_id GenuineIntel
cpu family 6
model 23
model name Intel(R) Xeon(R) CPU
E5472 _at_ 3.00GHz
stepping 6
cpu MHz 2992.509
cache size 6144 KB
physical id 1
siblings 4
core id 0
cpu cores 4
fpu yes
fpu_exception yes
cpuid level 10
wp yes
flags fpu vme de pse tsc msr pae mce
cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl vmx est tm2 cx16
xtpr lahf_lm
bogomips 5988.95
clflush size 64

17
/proc/cpuinfo (Nehalem)

processor 0
vendor_id GenuineIntel
cpu family 6
model 26
model name Intel(R) Xeon(R) CPU
X55700 _at_ .
stepping 4
cpu MHz 2927.000
cache size 8192 KB
physical id 0
siblings 8
core id 0
cpu cores 4
apicid 0
initial apicid 0
fpu yes
fpu_exception yes
cpuid level 11
wp yes
flags fpu vme de pse tsc msr pae mce
cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good xtopology pni dtes64 monitor ds_cpl vmx
est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
lahf_lm ida tpr_shadow vnmi flexpriority ept vpid

18
cat /proc/meminfo

MemTotal 24737232 kB
MemFree 21152912 kB
Buffers 77376 kB
Cached 2230344 kB
SwapCached 0 kB
Active 1650908 kB
Inactive 1720616 kB
Active(anon) 955796 kB
Inactive(anon) 0 kB
Active(file) 695112 kB
Inactive(file) 1720616 kB
Unevictable 0 kB
Mlocked 0 kB
SwapTotal 2104472 kB
SwapFree 2104472 kB
Dirty 536 kB
Writeback 0 kB

AnonPages 955608 kB
Mapped 28632 kB
Slab 123752 kB
SReclaimable 101028 kB
SUnreclaim 22724 kB
PageTables 5364 kB
NFS_Unstable 0 kB
Bounce 0 kB
WritebackTmp 0 kB
CommitLimit 14473088 kB
Committed_AS 1156568 kB
VmallocTotal 34359738367 kB
VmallocUsed 337244 kB
VmallocChunk 34359395451 kB
HugePages_Total 0
HugePages_Free 0
HugePages_Rsvd 0
HugePages_Surp 0
Hugepagesize 2048 kB

19
Meminfo explanation

High-Level Statistics
MemTotal Total usable ram (i.e. physical ram
minus a few reserved bits and the kernel binary
code)
MemFree Is sum of LowFreeHighFree (overall
stat)
MemShared 0 is here for compat reasons but
always zero.
Buffers Memory in buffer cache. mostly useless
as metric nowadays
Cached Memory in the pagecache (diskcache) minus
SwapCache
SwapCache Memory that once was swapped out, is
swapped back in but still also is in the swapfile
(if memory is needed it doesn't need to be
swapped out AGAIN because it is already in the
swapfile. This saves I/O)

20
Memory

Memory on Harpertown Compute Node SCU3 and SCU4
4 x 4GB (9W) PC2-5300 CL5 ECC DDR2 667MHz FBDIMMs
16 GB per node
Memory on Nehalem Compute Node
3 DDR3 channels on each socket / total of 8 DIMM
slots
e.g. 4 GB DIMM on each DDR3 channel (24GB/node)
1300 MHz
e.g 18GB per node (1066 MHz)
2GB/2GB/2GB on channel1
2GB/2GB/2GB on channel2
1GB on channel 3

21
Interconnect

(1) Mellanox ConnectX dual port DDR IB 4X HCA
PCIe 2.0 x8
IB4X DDR Cisco 9024D 288-port DDR switches for
each scalable unit cabled in the following
manner
256 ports to compute nodes
2 ports to spare compute nodes
6 ports to service nodes
24 ports uplinked to Tier 1 InfiniBand switch
ConnectX InfiniBand 4X DDR HCAs
16 Gb/second of uni-directional peak MPI
bandwidth
less than 2 microseconds MPI latency.

22
Nehalem Features

The Nehalem microarchitecture has many new
features, some of which are present in the Core
i7. The ones that represent significant changes
from the Core 2 include
The new LGA 1366 socket is incompatible with
earlier processors.
On-die memory controller the memory is directly
connected to the processor. It is called the
uncore part and runs at a different clock (uncore
clock) of execution cores.
Three channel memory each channel can support
one or two DDR3 DIMMs. Motherboards for Core i7
generally have three, four (31) or six DIMM
slots.
Support for DDR3 only.
No ECC support.
The front side bus has been replaced by the Intel
QuickPath Interconnect interface. Motherboards
must use a chipset that supports QuickPath.
The following caches
32 KB L1 instruction and 32 KB L1 data cache per
core
256 KB L2 cache (combined instruction and data)
per core
8 MB L3 (combined instruction and data)
"inclusive", shared by all cores
Single-die device all four cores, the memory
controller, and all cache are on a single die.

23
Nehalem Features contd.

"Turbo Boost" technology
allows all active cores to intelligently clock
themselves up
in steps of 133 MHz over the design clock rate
as long as the CPU's predetermined
thermal/electrical requirements are still met.
Re-implemented Hyper-threading.
Each of the four cores can process up to two
threads simultaneously,
processor appears to the OS as eight CPUs.
This feature was dropped in Core (Harpertown).
Only one QuickPath interface not intended for
multi-processor motherboards.
45nm process technology.
731M transistors.
263 mm2 Die size.
Sophisticated power management places unused core
in a zero-power mode.
Support for SSE4.2 SSE4.1 instruction sets.

24
I/O and Filesystem

/discover/home
/discover/nobackup
IBM Global Parallel File System (GPFS) used on
all nodes
Serviced by 4 I/O nodes
read/write access from all nodes
/discover/home 2TB, /discover/nobackup 4 TB
Individual quota
/discover/home 500 MB
/discover/nobackup 100 GB
Very fast (peak 350 MB/sec, normal 150 - 250
MB/sec)

25
Software

OS Linux (RHEL5.2)
Compilers
Intel Fortran, C/C
Math libs BLAS, LAPACK, ScaLAPACK, MKL
MPI MPI-2
Scheduler PBSPro

26
LINUX pagesize

getconf PAGESIZE
4096

27
Which modules

modules loaded (64-bit compilers)
intel-cce-10.1.017
intel-fce-10.1.017
intel-mkl-10.0.3.020
intel-mpi-3.1-64bit
/opt/intel/fce/10.1.017/bin/ifort
/opt/intel/impi/3.1/bin64/mpiifort
modules loaded (32-bit compilers)
intel-cc-10.1.017
intel-fc-10.1.017
/opt/intel/fc/10.1.017/bin/ifort
/opt/intel/impi/3.1/bin/mpiifort

28
IFC Compiler Options of some Physics/Chemistry/Cli
mate Applications

CubedSphere
-safe_cray_ptr -i_dynamic -convert big_endian
-assume byterecl -ftz -i4 -r8 -O3 -xS
NPB3.2 -O3 -xT -ip -no-prec-div -ansi-alias
-fno-alias
HPCC -O2 xT
GAMESS -O3 -xS -ipo -i-static -fno-pic ipo
GTC -O1
CAM -O3 xT
MILC -O3 xT
PARATEC -O3 -xS -ipo -i-static-fno-fnalias
-fno-alias
STREAM -O3 opt-streaming-storesalways xS-ip
SpecCPU2006 ?????

29
Optimization Level O2 (-O2)

- Inlining of intrinsics- Intra-file
interprocedural opt- inlining- constant
propagation- forward substitution- routine
attribute propagation- variable address-taken
analysis- dead static function elimination-
removal of unreferenced variables- constant
propagation- copy propagation

dead-code elimination- global register
allocation- global instruction scheduling and
control speculation- loop unrolling- optimized
code selection- partial redundancy elimination-
strength reduction/induction
- variable renaming- exception handling
optimizations- tail recursions- peephole
optimizations- structure assignment lowering and
optimizations- dead store elimination

30
Optimization Level O3 (-O3)

Enables O2 optimizations plus more aggressive
optimizations, such as
prefetching, scalar replacement
loop and memory access transformations.
Loop unrolling, including instruction scheduling
Code replication to eliminate branches
Padding the size of certain power-of-two arrays
to allow more efficient cache use.
O3 optimizations may not cause higher performance
unless loop and memory access transformations
take place.
O3 optimizations may slow down code in some cases
compared to O2 optimizations.
O3 option is recommended for
loops that heavily use floating-point
calculations
Loops that process large data sets.

31
O2 vs. O3

O2 will get a significant amount of performance
Depends on code constructs, memory optimizations
Both of these should be experimented with

32
Interprocedural Optimizations (-Ip)

Interprocedural optimizations for single file
compilation.
Subset of full intra-file interprocedural
optimizations
e.g. Perform inline function expansion for calls
to functions defined within the current source
file.

33
Interprocedural Optimization (-ipo)

Multi-file ip optimizations that includes-
inline function expansion- interprocedural
constant propogation- dead code elimination-
propagation of function characteristics- passing
arguments in registers- loop-invariant code
motion

34
Inlining

-inline-levelltngt
control inline expansion
n0 disable inlining
n1 no inlining (unless -ip specified)
n2 inline any function, at the compiler's
discretion (same as -ip)
-fno-inline-functions
inline any function at the compiler's
discretion
-finline-limitltngt
set maximum number of statements to be considered
for inlining
-no-inline-min-size
no size limit for inlining small routines
-no-inline-max-size
no size limit for inlining large routines

35
Did Inlining, IPO and PGO Help?

Use selectively on bottlenecks
Better for small chunks of code

36
The fast Option

Include options that can improve run-time
performance
-O3 (maximum speed and high-level
optimizations)
-ipo (enables interprocedural optimizations
across files)
-xT (generate code specialized for Intel(R)
Xeon(R) processors with SSE3, SSE4 etc.
-static Statically link in libraries at link
time
-no-prec-div (disable -prec-div) where -prec-div
improves precision of FP divides (some speed
impact)

37
SSE and Vectorization

-xT Intel(R) Core(TM)2 processor family with
SSSE3
Use xSSSE3
Harpertown
-xS Future Intel processors supporting
SSE4 Vectorizing Compiler Use xSSE4.1
Media Accelerator instructions
-xsse4.2 for Nehalem processors (SSE4.2
instructions)
-xsse4.1 for Nehalem processors (SSE4.1
instructions)

38
What is SSE4

SSE Streaming SIMD Extensions (SSE SSE1 SSE2
SSE3)
SSSE3 Suplemental SSE
In SSE4.2, is first available in Core i7 (aka
Nehalem)
consists of 54 instructions divided into two
major categories
Vectorizing Compiler and Media Accelerators
Efficient Accelerated String and Text Processing.
Graphics / Video encoding and processing / 3-D
imaging / Gaming
High-performance applications .
Efficient Accelerated String and Text Processing
will benefit database and data mining
applications, and those that utilize parsing,
search, and pattern matching algorithms like
virus scanners and compilers.
A subset of 47 instructions, SSE4.1 in Penryn
(Core 2) Harpertown

39
Vectorization (Intra register) -vec

void vecadd(float a, float b, float c, int
n)
int i
for (i 0 i lt n i)
ci ai bi
the Intel compiler will transform the loop to
allow four floating-point additions to occur
simultaneously using the addps instruction.
Simply put, using a pseudo-vector notation, the
result would look something like this
for (i 0 i lt n i4)
cii3 aii3 bii3

40
OpenMP

-openmp
generate multi-threaded code based on the OpenMP
directives
-openmp-profile
enable analysis of OpenMP application when
the Intel(R) Thread Profiler should be installed
-openmp-stubs
enables the user to compile OpenMP programs in
sequential mode
OpenMP directives are ignored and a stub OpenMP
library is linked
-openmp-report012
control the OpenMP parallelizer diagnostic level

41
Auto Parallel (-parallel)

-parallel
generate multithreaded code for loops that can be
safely executed in parallel.
Must use O2 or O3.
The default numbers of threads spawned is equal
to the number of processors detected in the
system where the binary is compiled
can be changed by setting the environment
variable OMP_NUM_THREADS
-parallel-report is very useful

42
Auto Parallel Experiment Outcome

8 cores
6 MPI
OMP_NUM_THREADS2
-stack_temps -safe_cray_ptr -i_dynamic -convert
big_endian -assume byterecl -i4 -r8 -w95 -O3
-inline-level2
Total runtime 415 seconds
-stack_temps -safe_cray_ptr -i_dynamic -convert
big_endian -assume byterecl -ftz -i4 -r8 -w95 -O3
-inline-level2 parallel
Total runtime 594 seconds
Have to include parallel in LDFLAGS

43
Profile Guided Optimization (PGO)

Traditional static compilation model
Optimization decisions based on only an estimate
of important execution characteristics.
Branch probabilities, are estimated by assuming
that controlling conditions that test equality
are less likely to succeed than condition that
test inequality.
Relative execution counts are based on static
properties such as nesting depth.
These estimated execution characteristics are
subsequently used to make optimization decisions
such as selecting an appropriate instruction
layout,
procedure inlining
generating a sequential and vector version of a
loop.
The quality of such decisions can substantially
improve if more accurate execution
characteristics are available, which becomes
possible under profile-guided optimization.

44
PGO steps

Phase 1 (Compile)
mpiifort O3 prof-gen -prof-dir dirx
Phase 2 (Run code, collect profile)
run_CubedSphere_BMK2.sh gt BMK2.out 2gt1
Produces .dyn files
comp_fv/49e5dea6_12099.dyn etc. etc.
Phase3 (Recompile)
mpiifort O3 prof-use prof-dir dirx
ipo remark 11000 performing multi-file
optimizations
ipo-1 remark 11004 starting multi-object
compilation
Phase 4 (Re-run code)
rerun_CubedSphere_BMK2.sh gt BMK2.out 2gt1

45
PGO Outcome mxm

-O3 prof-gen 96 seconds
-O3 prof-use 10 seconds
-O2 27 seconds

46
Optimization Reports

-vec-reportn
control amount of vectorizer diagnostic
information
n3 indicate vectorized/non-vectorized loops
and prohibiting data dependence info
-opt-report n
generate an optimization report to stderr
n3 maximum report output
-opt-report-fileltfilegt
specify the filename for the generated report
-opt-report-routineltnamegt
reports on routines containing the given name

47
Inter procedure optimization (-ipo)

Multi-file ip optimizations that includes-
inline function expansion- interprocedural
constant propogation- dead code elimination-
propagation of function characteristics- passing
arguments in registers- loop-invariant code
motion

48
Compiler Options Simple MXM Example
49
CubedSphere Performance for various IFC Options
50
Summary of MPI options

Stacks available
OFED 1.3.1 / OFED 1.4.1
MPI Implementations
Intel MPI 3,2
Mvapich1 1.0.1
Mvapich2 1.2.6
OpenMPI 3.1
Compilers
Intel compilers
intel-cce-10.1.017
intel-fce-10.1.017
PGI
Pathscale
gcc

51
Which MPI flavor

intel-mpi-3.1-64bit
intel-openmpi-1.2.6
intel-mvapich-1.0.1
intel-mvapich2-1.2rc2
gcc-openmpi-1.2.6
gcc-mvapich-1.0.1
gcc-mvapich2-1.2rc2
pathscale-openmpi-1.2.6
pathscale-mvapich-1.0.1
pathscale-mvapich2-1.2rc2
pgi-openmpi-1
pgi-mvapich-1.0.1
pgi-mvapich2-1.2rc2

ofed-1.4-pgi-openmpi-1.2.8
ofed-1.4-pgi-mvapich-1.1.0
ofed-1.4-pgi-mvapich2-1.2p1
ofed-1.4-gcc-openmpi-1.2.8
ofed-1.4-gcc-mvapich-1.1.0
ofed-1.4-gcc-mvapich2-1.2p1
ofed-1.4-pathscale-openmpi-1.2.8
ofed-1.4-pathscale-mvapich-1.1.0
ofed-1.4-pathscale-mvapich2-1.2p1

52
Open Fabric Enterprise Distribution OFED 1.3.1/1.4

The OpenFabrics Alliance software stacks OFED
1.3.1/1.4.x
Goal develop, distribute and promote a
unified, transport-independent, open-source
software stack
RDMA-capable fabrics and networks
InfiniBand and Ethernet
developed for many hardware architectures and OS
Linux and Windows.
server and storage clustering and grid
connectivity using
optimized for performance (i.e., BW, low latency)
transport-offload technologies available in
adapter hardware.

53
MVAPICH

MVAPICH
(MPI-1 over OpenFabrics/Gen2, OpenFabrics/Gen2-UD,
uDAPL, InfiniPath, VAPI and TCP/IP)
MPI-1 implementation
Based on MPICH and MVICH
The latest release is MVAPICH 1.1 (includes MPICH
1.2.7).
It is available under BSD licensing.

54
MVAPICH2

MVAPICH2
MPI-2 over OpenFabrics-IB, OpenFabrics-iWARP,
uDAPL and TCP/IP
MPI-2 implementation which includes all MPI-1
features.
Based on MPICH2 and MVICH.
The latest release is MVAPICH2 1.2 (includes
MPICH2 1.0.7).

55
Open MPI Version 1.3.1

http//www.open-mpi.org
High performance message passing library
Open source MPI-2 implementation
Developed and maintained by a consortium of
academic, research, and industry partners
Many OS supported

56
MPIEXEC options

Two major areas
DEVICE
PINNING

57
RUNTIME MPI Issues

shm
Shared-memory only (no sockets)
ssm
Combined sockets shared memory (for clusters
with SMP nodes)
rdma
RDMA-capable network fabrics including
InfiniBand, Myrinet (via DAPL)
rdssm
Combined sockets shared memory DAPL
for clusters with SMPnodes and RDMA-capable
network fabrics

58
Typical mpiexec command

mpiexec -genv I_MPI_DEVICE rdssm \
-genv I _MPI_PIN 1 \
-genv I_MPI_PIN_PROCESSOR_LIST
0,2-3,4 \
np 16 perhost 4 a.out
-genv X Y associate env var X with value Y for
all MPI ranks.

59
MPI_DEVICE for CubedSphere

ssm
250 seconds
rdma
250 seconds
rdssm
250 seconds

60
Rank Pinning
61
Task Affinity

Taskset
Taskset -c 0,1,4,5 .
Numacntrl

62
Interactive Tools for Monitoring

top
mpstat
vmstat
iostat

63
SMT

Bios option set at boot time
Run 2 threads at the same time per core
Share resources (execution units)
Take advantage of 4-wide execution engine
Keep it fed with multiple threads
Hide latency of a single thread
Most power efficient performance feature
Very low die area cost
Can provide significant performance benefit
depending on application
Much more efficient than adding an entire core
Implications for Out of Order executions
Might be good for MPI OpenMP
Might lead to extra BW pressure and pressure on
L1 L2 L3 caches

64
SMT MPI

NOAA NCEP GFS code T190 (240 hour simulation)
SMT OFF
9709 seconds
SMT ON TURBO ON
7276 seconds

65
TURBO

Turbo mode boosts operating frequency based on
thermal headroom
when the processor is operating below its peak
power,
increase the clock speed of the active cores by
one or more bins to increase performance.
Common reasons for operating below peak power are
one or more cores may be powered down
the active workload is relatively power (e.g. no
floating point, or few memory accesses).
Active cores can increase their clock frequency
in relatively coarse increments of 133MHz speed
bins,
depending on the SKU
the available power
thermal headroom
other environmental factors.

66
SMT and TURBO
67
SMT Hybrid MPI OpenMP

8 MPI tasks
OMP_NUM_THREADS1
OMP_NUM_THREADS2
Potentially a good way to exploit SMP

68
Partially Filled Nodes
69
MPI optimization

Affinity
Mapping Tasks to Nodes
Mapping Tasks to Cores
Barriers
Collectives
Environment variables
Partially/Fully loaded nodes

70
SHM vs. SSM

shm 411 seconds
ssm 424 seconds

71
Events available for Oprofile

CPU_CLK_UNHALTED
UNHALTED_REFERENCE_CYCLES
INST_RETIRED_ANY_P
LLC_MISSES
LLC_REFS
BR_MISS_PRED_RETIRED

72
PAPI issues

Agree 100 that performance tools are desperately
needed
Our LTC team has been actively driving the
distros to add support.
End 2009, a decision was made to drive perfmon2
as the preferred method
Have had some success in driving into next major
releases of REHEL6 and SLES11
Unfortunately, (possibly) we missed the first
release of SLES11 and it will be in SP1
This would be the first time we could officially
support it installed.
Run with the kernel patch, problems have to be
reproduced on a non-patched system.
This is has worked on POWER Linux users at some
pretty large sites.
Use TDS systems as the vehicle to have the
patches and do some perf testing.
SCU5 without the PAPI patch and SCU6 with??.
If a kernel problem occurs that needs to be
reproduced, it could just be rerun on SCU5??

73
Oprofile LLC_MISSES
74
Oprofile

set CUR_DIR pwd
sudo rm -rf samples
echo " shutdown "
sudo opcontrol --shutdown
echo " start-deamon "
sudo opcontrol --verboseall --start-daemon
--no-vmlinux --session-dirCUR_DIR
--separatethread --callgraph10
--eventLLC_REFS10000 --imageEXE
sudo opcontrol --status
echo " start "
sudo opcontrol --start
setenv OMP_NUM_THREADS 1
mpiexec -genv I_MPI_DEVICE shm -perhost 8 -n
NUMPRO EXE
sudo opcontrol --stop
echo " shutdown "
sudo opcontrol --shutdown