Title: HP and Intel Compilers
1HP and Intel Compilers
- see http//www.hp.com/go/lang
2Performance Expectations
- ITANIUM-2 vs PA-RISCperformance ratio1.5 X
2.5 X - ITANIUM-2 vs ALPHAperformance ratio1.0 X 1.2
X - If these ratios cant be achieved in a
benchmarkthen something is wrong.
3HP-UX hardware and runtime environment (1) Data
models
- Evolution from 32 bit to 64 bit. Kernel is 64
bit. User processes can be either 32 or 64
bit.Benefit No forced 64 bit migration from 32
bit platforms. - Compiler default is 32 bit.For 64 bit data model
use DD64 on IPFDA2.0W on PA-RISC - 64 bit model is LP6432 bit model is
ILP32Caution with long in C !Use long long to
always get 64 bit integers or use size_t
whenever possible.
4HP-UX hardware and runtime environment (2) Data
models and system libraries
- Each system library and API is available in 4
flavours in seperate subdirectories - /usr/lib/pa1.1 PA-RISC 32bit
- /usr/lib/pa20_64 PA-RISC 64bit
- /usr/lib/hpux32 IPF 32bit
- /usr/lib/hpux64 IPF 64bit
- Same thing for extra products like MPI, MLIB, ...
- Object format is ELF32 and ELF 64 except for
- PA-RISC 32bit (DA2.0 - SOM)
- Mixed linking is impossible. Linker returns
explicit message.
5HP-UX hardware and runtime environment (3) Data
alignment
- LINUX is little endian
- HP-UX is big endian with both IPF and PA-RISC.
Binary data compatibility with MIPS, SPARC,
POWERBinary data incompatibility withALPHA,
IA-32, LINUX - Data alignment is very similar between Tru64 and
HP-UXUnaligned access on HP-UX causes SIGBUS
6HP-UX hardware and runtime environment (4)
Exception handling
- HP-UX ignores FP exceptions by default. Link with
FPDVONZ for Tru64 like behaviour.API for
runtime control of FPU is defined in
/usr/include/fenv.h - NULL pointer dereference by default returns
0.Link with z for Tru64 like SIGSEGV generation
- malloc(0) returns a valid pointer
7HP-UX hardware and runtime environment (5) Memory
page size
- HP-UX supports variable sized pages / large
pages. HW page size is still 4k. - Page size is a property of the executable and can
be modified with the chatr commandchatr pd 4M
pi 4M chatr pd L pi D - Large pages can drastically reduce TLB
misses.Many HPTC apps get a huge performance
boost from large pages.
8HP-UX hardware and runtime environment (6) Large
files
- Large files are default for HP-UX / IPF but not
PA-RISC - On PA-RISC use o largefiles for newfs and mount
- Some HP-UX commands dont support large files,
e.g. tar, cpio and pax fail to backup large
files, same problem with some open source tools - Rebuild 32 bit programs with-D_LARGEFILE64_SOURCE
- No problem with 64 bit programs
9HP-UX hardware and runtime environment (7) ARIES
- PA-RISC binaries can be run on IPF through
dynamic translation - Slowdown is 3X for GUIs up to 10X for solvers
- The slowdown is hardly noticable with interactive
tools like Vim, Netscape, Acroread - Many HP-UX tools on IPF like SAM are still
PA-RISC binaries - Use file command to identify the nature of an
executable - IPF migration approach with ISVs
- Rebuild solvers first
- Rebuild GUIs, pre post later
10HP-UX Compilers C/C
- /opt/aCC/bin/aCC /opt/ansic/bin/cc is the ANSI
C/C compiler - -AA ANSI C with namespace std and new C
standard library. This is the default. - -AP Turn off AA and use older classic C
runtime libraries. Very useful for porting
legacy and open source codes. - -Aa strict ANSI C (TRU64 -std1)
- -Ae ANSI C with extensions (TRU64 -std)
- no support for KR mode
- instantiation files are written to a repository
on TRU64,to the object file on HP-UX
11HP-UX Compilers Fortran (77),90,95
- /opt/fortran90/bin/f90 is the Fortran 90/95
compiler - Supports native OpenMP 1.1 and legacy CONVEX/HP
directives Oopenmp enable OpenMP
directives Oparallel O3 enable legacy
CONVEX/HP directives - Legacy f77 compiler is obsolete, f90 handles f77
codes very well.Use U77 to enable BSD 3f
intrinsics - f90 adds trailing underscore to function names on
IPF and PA-RISC 64 bit. No trailing underscores
for PA-RISC 32 bit. Explicit control with ppu
add trailing underscore noppu do not add
trailing underscore - Pragma exampleDEC ALIAS ? HP ALIAS
12HP-UX Compilers Mixing Languages
- HP-UX compiler drivers do NOT recognize other
languages,need to compile C and F programs
separately - Make sure C symbols are lowercase and have a
trailing underscore or compile F sources with
noppu - If aCC or ld is used for linkingFORTRAN
libraries have to be passed explicitly to the
linker(libF90.a, libIO77.a, -lm, -lc) - If f90 is used for linking it will find its
libraries automatically - what returns exact compiler
version string
13Intel V7.0 Linux Compilers
- /opt/intel/compiler points to latest compiler
- efc Fortran
- /opt/intel/compiler70/ia64/bin/efc
- ecc C, C
- /opt/intel/compiler70/ia64/bin/ecc
- Source subset of following to set up
environment/opt/intel/compiler70/ia64/bin/eccvars
.csh,sh/opt/intel/compiler70/ia64/bin/efcvars.
csh,sh - Useful web page - http//www.intel.com/software/pr
oducts/compilers/
14CompilersDirectives and Pragmas
- HP Fortran compiler directive
- cdir
- HP C compiler pragma
- pragma _cnx
- (note blank between pragma and _cnx above)
- OpenMP directives
- comp
- Preferred for directive based parallelism
- COMP parallel do private(x,y) shared(z)
- Intel compiler directives
- cdec
- cdir
15HP-UX and Intel Linux CompilersArchitecture and
Data Model Switches
- DA2.0N DS2.0 PA-RISC (2.0) 32 bit
- DA2.0W DS2.0 PA-RISC (2.0) 64 bit
- DSmckinley DD32 ITANIUM-2 32 bit
- DSmckinley DD64 ITANIUM-2 64 bit
- (DSmckinley DSitanium2)
- -tpp2 ITANIUM-2 64 bit with INTEL compiler
- Recommendations
- DSmckinley and -tpp2 are also the best choice
for Madison code. - DSitanium, DSblended and tpp1 should be used
only if target is ITANIUM-1. This code performs
20 slower on ITANIUM-2.
16Intel Linux CompilersOptimisation Levels
- -O2
- very safe, register rotation, no extra unrolling,
no prefetch instructions - -O3
- usually safe, lots of optimizations including
load word pair generation, up to 8-way unrolling,
prefetch instructions
17HP-UX CompilersOptimisation Levels (1)
- O0 Default, minimal optimisation Fastest
compile time Good debugging support - O1 Basic block level optimisation Pretty fast
compile time, Improved runtime performance Go
od debugging support - O2 Full routine level optimisation Register
rotation and data prefetching Limited debugging
support, Good runtime performance Inlining
for sqrt - O2 is sufficient for most FORTRAN codes
18HP-UX CompilersOptimisation Levels (2)
- O3 Full source file level optimisation No
debugging support (-g is invalid) Adds
subroutine cloning and inlining (only within the
source file) Adds transformations for nested
loops Inlines all math intrinsics on
IPF Matches and inlines inverse square roots if
Ofltaccrelaxed Use Oinfo or Oreportall for
optimisation report - O3 is not always better than O2. Use it
deliberately for - inlining of math intrinsics and frequently called
routines - transformation of nested loops
- optimized inverse square roots (e.g. quantum
chemistry)
19HP-UX CompilersGlobal and Profile Based
Optimisation
- O4 Performs global optimisation at link time.
Can be combined with Profile Based Optimisation
(PBO). - Oprofilecollect Make an instrumented
executable for profiling. After execution it
will dump the data in flow.data - Oprofileuse Use profile data from flow.data
and use it for global optimisation - O4 and PBO is most useful for C and C as it
provides global inlining capability and reduces
branch mispredictionThe benefit for FORTRAN
codes is very limited due to common programming
practices.
20HP-UX CompilersPrefetching
- Onodataprefetchdirectindirectnone
- Control generation of data prefetch instructions
for data structures referenced within inner most
loops. The defined values for kind are - direct Enable generation of data prefetch
instructions for the benefit of direct memory
accesses, but not indirect memory accesses. - indirect Enable generation of data prefetch
instructions for the benefit of both direct and
indirect memory accesses. This is the default at
optimization levels O2 and above. - none Disable generation of data prefetch
instructions. This is the default at
optimization levels O1 and below.
21HP-UX and Intel Linux CompilersFortran prefetch
directives
- HP-UX cdir prefetch (expression)
- no special compile options needed
- Intel cdir noprefetch A,B,..
- Allows user to prefetch explicitly where the
compiler fails e.g. when addresses are computed - do i 1,n
- ia func(i)
- cdir prefetch b(func(i50))
- b(ia) b(ia)a(i)
- enddo
22HP-UX CompilersFloating-Point Accuracy
- Ofltaccstrictdefaultlimitedrelaxed
- Control the level of FP optimizations that the
compiler may perform. - Useful for debugging when there are numerical
instabilities - defaultAllow contractions, such as fused
multiply-add (FMA), but disallows any other
optimization that can result in numerical
differences. - limitedLike default, but also allows floating
point optimizations which may affect the
generation and propagation of infinities, NaNs,
and the sign of zero. - relaxedIn addition to the optimizations allowed
by limited, permits optimizations, such as
reordering of expressions, even if parenthesized,
that may affect a rounding error. This is the
same asOnofltacc. - strictDisallow any floating point optimization
that can result in numerical differences. This
is the same as Ofltacc.
23Intel Linux CompilersFloating-Point Accuracy
- -IPF_fma- (-IPF_fma- to turn off fma
generation) - Enable/disable the combining of floating point
multiplies and add / subtract operations. Note
fmas are still generated but each corresponds to
either an fmpy (fma x,y,f0) instruction or an
fadd (fma x,f1,y) instructions - -IPF_fltacc-
- Enable / disable optimizations that affect
floating point accuracy
24Inline Math Intrinsics with Olibcalls
- Not all intrinsics are treated equal
- abs is inlined at all optimisation levels
- sqrt is inlined at O2 and above
- Other math intrinsics like exp, log, pow, sin,
are inlined at O3 - Reciprocal square roots (y 1./sqrt(x))
- IPF can compute rsqrt directly (no separate
div/sqrt) - HP-UX comes with nonstandard rsqrt intrinsic
- With Ofltaccrelaxed the f90 compiler matches
and calls rsqrt at O2 but does inlining of rsqrt
only at O3. Use it carefully ! - Nice performance boost in quantum chemistry
(Coulomb forces)
25Important Linker Options
- Flush denormalized values to zero
- HP-UXLinking with FPD flushes denormalized
values to zero - LinuxCompile the main routine with ftz and link
normally - Archived libraries
- HP-UX-Wl, -aarchive or Wl,-aarchive_shared to
ensure archived libraries used as much as
possible - Linux-static prevents linking with shared
librariesld default corresponds to HP-UXs
-shared_archive-Bstatic to use archived
libraries-Bdynamic to use shared libraries
26Compiler Flags for Parallelism
- HP-UX
- OopenmpEnable OpenMP directives. Available at
any optimisation level. - Oparallel O3Enable HP/Convex directives and
automatic parallelisation. Requires O3 - Oparallel O3 OnoautoparDisable automatic
parallelisation. Keep directive based
parallelism. - Oparallel O3 OnodynselDisable dynamic loop
selection.
- Intel Linux
- -openmpEnable OpenMP directives. Available at
any optimisation level. - -parallelEnable automatic parallelisation.Op
enMP is process based with Intel Linux while it
is pthread based with HP-UXLinking on HP-UX
involves libomp, libcps, libpthread
27Environment Variables for Parallelism
- HP-UX MP_NUMBER_OF_THREADSsets the number of
threads with HP / CONVEX directives - HP-UX MP_IDLE_THREADS_WAIT set the of
milliseconds a thread spins before suspending
itself.If the number is less than 0, the threads
will spin waitUseful to prevent context switches
and thread migration - HP-UX MP_GANG ONOFFEnable / disable gang
scheduling for multithreaded and MPI appsUseful
for oversubscribed and throughput scenarios - HP-UX and Linux OMP_NUM_THREADSset OpenMP
parallelism - HP-UX and Linux MLIB_NUMBER_OF_THREADSset MLIB
shared memory parallelism
28HP-UX CompilersDangerous and Useless Switches
- Wrong floating point answers can be caused by
- O3, O4, Ofltaccdefaultrelaxed,
Onoparmsoverlap, FPD Use with caution and
check your answers. - Useless switches, dont waste you time !
- Ovectorize Matches specific loop patterns and
replaces with optimized library calls. Usefull
only for SPECfp. - Oaggressive, Oall Lots of aggressive
optimisations including Ovectorize - fastallocatable was never observed to improve
anything
29Recommended Build Approach
- Get reference timings/outputs from PA-RISC or
whatever - Set the right architecture and data model
switches - Start with O2 Odataprefetch Onolimit g
Wl,pd,L - In case of wrong answers or divergence add
OfltaccstrictIn case of right answers you can
add FPD Ofltaccrelaxedand check answers again - For C/C try Onoparmsoverlap and check answers
- Now make a profile with prospect or caliper
- Try O3 and Oloop_block for selected hotspot
routines - Start trying source changes
- For C/C try profile based optimisation