Title: Compilers and Optimization on AIX systems
1Compilers and Optimization on AIX systems
2Outline
- Overview
- Basic compiler options
- Optimization
- General programming tips
- Compiler options
- Optimized libraries
3Outline
- Overview
- Basic compiler options
- Optimization
- General programming tips
- Compiler options
- Optimized libraries
4Overview
5Overview
- Most flags and options are the same for all three
groups of compilers. - Prefix mp indicates compatibility with MPI
- e.g. mpxlf is the Fortran compiler compatible
with MPI - Prefix _r indicates thread safe compiler
- Usage
- compiler ltoptionsgt input_files
6Documentation and references
- IBM AIX compiler center
- http//publib.boulder.ibm.com/infocenter/comphelp/
v7v91/index.jsp - LSU HPC documentations
- http//appl003.lsu.edu/ocsweb/hpchome.nsf/Content
/document?OpenDocument
7Outline
- Overview
- Basic compiler options
- Optimization
- General programming tips
- Compiler options
- Optimized libraries
8Basic options
9Basic options (contd)
10Outline
- Overview
- Basic compiler options
- Optimization
- General programming tips
- Compiler options
- Optimized libraries
11Optimization general tips
- Do not excessively hand-tune your code.
- Unusual constructs may confuse the compiler and
make it difficult to optimize for new machines - Use the MASS and ESSL libraries rather than
writing your own-code (details later) - Optimized for Power5 machines
- Try not to break your code into too many small
functions and subroutines to avoid lengthy call
overhead.
12Optimization general tips (cont'd)
- Avoid unnecessary use of global variables
- Use local variables for loop index and bounds
when possible - Example When using a global variable in a loop,
load it into a local variable before the loop and
restore it back after. - Limit the use of ALLOCATABLE arrays only to
situations that demand dynamic allocation.
13Outline
- Overview
- Basic compiler options
- Optimization
- General programming tips
- Compiler options
- Optimized libraries
14High Order Transformations (-qhot)
- What does it do?
- Scalar replacement
- Loop transformation (Blocking, interchange,
fusion, reversal and unrolling of loops) - Reduce the generation of temporary arrays
- Controlled by the characteristics of loops and
the cost of loop transformations - When -qhot is specified, the compiler assumes an
optimization level of -O2 (details later)
15Example outer loop unroll
Do I1,N Do J1,N SumSumX(J)A(J,I)
Enddo Enddo
Do I1,N,4 Do J1,N SumSumX(J)A(J,I)
X(J)A(J,I1) X(J)A(J,I2) X(J)A(J,I3)
Enddo Enddo
unroll
- Minimize loads/stores by finding variables that
can be loaded once and used multiple times - Left 2 flops/2 loads
- Right 8 flops/5 loads
16Outer loop unroll test
MFLOP/s
17Example interchange loops
Do I1,N Do J1,N SumSumA(I,J)
Enddo Enddo
Do J1,N Do I1,N SumSumA(I,J)
Enddo Enddo
interchange
- Minimize strides
- Remember Fortran and C are different
- Fortran column-major arrays
- C row-major arrays
18Interchange loop test
MFLOP/s
19Optimizing for a target machine
- Instruct the compiler to generate code for
optimal execution on a given processor or
architecture. - Target machine options
- -q32 generates code for 32-bit environment
- -q64 generates code for 64-bit environment
- -qarch selects specific architecture
- -qtune biases optimization toward execution on a
give machine - -qcache defines specific cache or memory geometry
2032/64-bit environment
- Performance consideration
- 64-bit mode
- Capable of handling larger amount of data
directly in physical memory rather than relying
on disk I/O - 32-bit mode
- Smaller program, less demanding on physical
memory - The operation of division is faster
2132/64-bit environment
- Specify -q32 (default) or -q64 when compiling
- Alternative set the OBJECT_MODE environment
variable to 32 or 64 - Some tips on working with 64-bit programs
- Avoid performing mixed 32-bit and 64-bit
operations - Avoid long division whenever possible
- For C and C programs use long types instead of
signed, unsigned and plain int types for
variables which will be frequently accessed.
22Target a specific architecture (-qarch)
- Syntax -qarcharchitecture
- On Pelican system
- -qarchpwr4 Power 4 machines
- -qarchPwr5 Power 5 machines
- -qarchAuto Use the architecture of the
compiling machine. - Remember the head node on Pelican is a Power4
machine! - On LONI AIX systems
- -qarchauto or -qarchpwr5 it does not matter
because all nodes are Power5 machines
23-qtune
- Bias optimization toward a specific machine
- Tunes instruction selection, scheduling and other
implementation-dependent performance enhancement - Has effect on performance but not correctness
- Primarily of benefit for floating-point intensive
programs - Is controlled by qarch, -q32 and q64 options if
not explicitly specified - -qtuneauto assumes that the execution
environment will be the same as the complication
environment
24-qcache
- Specifies the cache configuration for a specific
machine - Especially useful for loop operations (process
only the amount of data that can fit into the
data cache) - Must be used in conjunction with -qhot
- Options
- linebytes line size of the cache
- Sizebytes total size of the cache
- Levellevel specifies the level of cache
affected - costcycles specifies the performance penalty
resulting from a cache miss
25Profile directed optimization
- Profile-directed feedback (PDF)
- Two stage optimization
- Should be mainly used on code that has rarely
executed conditional error handling or
instrumentation.
26Interprocedure analysis (-qipa)
- Optimize across different files (whole program
analysis) - Have different levels
- Level0
- Program partitioning and simple interprocedural
optimization - Level1
- Default level of -qipa
- Inlining and global data mapping
- level2
- Global alias analysis
- Interprocedural data flow
27Inlining
- Can be turned on by specifing -qipainlineinline-
options (or-qinlineinline-options) - Useful when your program has many subprogram
calls - Reduce the call overhead
- Identify the subprograms that are called the most
and inline only those subprograms - Examples
- -qipainlineauto inline all procedures
- -qipainlinesub1inlinenoauto only inline the
procedure sub1
28Choose an optimization level
- -On option
- -O0 very limited optimization, fast compilation,
debuggable code - -O2 comprehensive low-level optimization,
partial debug support - -O3 more extensive optimization, some precision
trade-off - -O4 Everything from -O3 plus -qhot -qipa
-qarchauto -qtuneauto -qcacheauto - -O5 Everything from -O4 plus -qipalevel2
29Choose an optimization level
- Test and debug code before go to any level of
optimization - If encountered problem with -O2, check the code
for any non-standard use of aliasing rules. - Consider using -qaliasnostd (Fortran) or
-qaliasnoansi (C) instruct the compiler to
apply aliasing assertion to your compilation
unit. - If encountered problem with -O3, consider using
-qstrict along with -O3. - -qstrict ensure the optimizer will not alter the
semantics of a program - Try to at least optimize your program with -O3
-qhot
30Outline
- Overview
- Basic compiler options
- Optimization
- General programming tips
- Compiler options
- Optimized libraries
31Optimized libraries
- Mathematical Acceleration SubSystem (MASS)
- Engineering and Scientific Subroutine Library
(ESSL) - Both support FORTRAN, C, and C languages.
32MASS
- Optimized intrinsic functions
- Examples sqrt, sin, cos, exp, log, xy
- Better performance at the expense of reduced
precision (1 to 2 bits less) - Have both scalar and vector versions
- Thread safe
- Usage
- Compile normally, then link with the option -lmass
33MASS performance
Moperations/s
34Intrinsic vector functions
- Intrinsic vector functions
- Compiler generates vector intrinsic functions
when -qhot is specified - Examples vlog, vexp, vdiv, vsqrt
- The performance is very good
Do i1,n A(i)log(B(i)) Enddo
Call __vlog(A,B,n)
-qhot
35Vector MASS library
- Usage
- Need to call explicitly in the code
- Example call vexp(A,B,n) rather than do i1,n
B(i)exp(A(i)) enddo - Link with lmassv
36Vector MASS performance
Moperations/s
37ESSL
- Has over 400 subroutines
- Tuned for PowerPC systems
- Available for parallel computing environment also
- Usage
- Link with -lessl
38ESSL library
- Linear algebra subprograms
- Linear equations
- Eigen system analysis
- Fourier transforms
- Convolution and correlation
- Sorting and searching
- Interpolation
- Numerical quadrature
- Random number generation
39BLAS (Basic Linear Algebra Subprograms)
- BLAS 1 (vector-vector operation)
- Compiler generated code is faster than ESSL
- BLAS 2 (vector-matrix operation)
- Compiler generated code is equivalent to ESSL
- BLAS 3 (matrix-matrix operation)
- ESSL is significantly faster
40ESSL performance