Compilers and Optimization on AIX systems - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Compilers and Optimization on AIX systems

Description:

On Pelican system: -qarch=pwr4: Power 4 machines -qarch=Pwr5: Power 5 machines ... Remember: the head node on Pelican is a Power4 machine! On LONI AIX systems: ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 41

Provided by: ley6

Category:

more less

Transcript and Presenter's Notes

Title: Compilers and Optimization on AIX systems

1
Compilers and Optimization on AIX systems

Le Yan
Jan 26, 2007

2
Outline

Overview
Basic compiler options
Optimization
General programming tips
Compiler options
Optimized libraries

3
Outline

Overview
Basic compiler options
Optimization
General programming tips
Compiler options
Optimized libraries

4
Overview
5
Overview

Most flags and options are the same for all three
groups of compilers.
Prefix mp indicates compatibility with MPI
e.g. mpxlf is the Fortran compiler compatible
with MPI
Prefix _r indicates thread safe compiler
Usage
compiler ltoptionsgt input_files

6
Documentation and references

IBM AIX compiler center
http//publib.boulder.ibm.com/infocenter/comphelp/
v7v91/index.jsp
LSU HPC documentations
http//appl003.lsu.edu/ocsweb/hpchome.nsf/Content
/document?OpenDocument

7
Outline

Overview
Basic compiler options
Optimization
General programming tips
Compiler options
Optimized libraries

8
Basic options
9
Basic options (contd)
10
Outline

Overview
Basic compiler options
Optimization
General programming tips
Compiler options
Optimized libraries

11
Optimization general tips

Do not excessively hand-tune your code.
Unusual constructs may confuse the compiler and
make it difficult to optimize for new machines
Use the MASS and ESSL libraries rather than
writing your own-code (details later)
Optimized for Power5 machines
Try not to break your code into too many small
functions and subroutines to avoid lengthy call
overhead.

12
Optimization general tips (cont'd)

Avoid unnecessary use of global variables
Use local variables for loop index and bounds
when possible
Example When using a global variable in a loop,
load it into a local variable before the loop and
restore it back after.
Limit the use of ALLOCATABLE arrays only to
situations that demand dynamic allocation.

13
Outline

Overview
Basic compiler options
Optimization
General programming tips
Compiler options
Optimized libraries

14
High Order Transformations (-qhot)

What does it do?
Scalar replacement
Loop transformation (Blocking, interchange,
fusion, reversal and unrolling of loops)
Reduce the generation of temporary arrays
Controlled by the characteristics of loops and
the cost of loop transformations
When -qhot is specified, the compiler assumes an
optimization level of -O2 (details later)

15
Example outer loop unroll
Do I1,N Do J1,N SumSumX(J)A(J,I)
Enddo Enddo
Do I1,N,4 Do J1,N SumSumX(J)A(J,I)
X(J)A(J,I1) X(J)A(J,I2) X(J)A(J,I3)
Enddo Enddo
unroll

Minimize loads/stores by finding variables that
can be loaded once and used multiple times
Left 2 flops/2 loads
Right 8 flops/5 loads

16
Outer loop unroll test
MFLOP/s
17
Example interchange loops
Do I1,N Do J1,N SumSumA(I,J)
Enddo Enddo
Do J1,N Do I1,N SumSumA(I,J)
Enddo Enddo
interchange

Minimize strides
Remember Fortran and C are different
Fortran column-major arrays
C row-major arrays

18
Interchange loop test
MFLOP/s
19
Optimizing for a target machine

Instruct the compiler to generate code for
optimal execution on a given processor or
architecture.
Target machine options
-q32 generates code for 32-bit environment
-q64 generates code for 64-bit environment
-qarch selects specific architecture
-qtune biases optimization toward execution on a
give machine
-qcache defines specific cache or memory geometry

20
32/64-bit environment

Performance consideration
64-bit mode
Capable of handling larger amount of data
directly in physical memory rather than relying
on disk I/O
32-bit mode
Smaller program, less demanding on physical
memory
The operation of division is faster

21
32/64-bit environment

Specify -q32 (default) or -q64 when compiling
Alternative set the OBJECT_MODE environment
variable to 32 or 64
Some tips on working with 64-bit programs
Avoid performing mixed 32-bit and 64-bit
operations
Avoid long division whenever possible
For C and C programs use long types instead of
signed, unsigned and plain int types for
variables which will be frequently accessed.

22
Target a specific architecture (-qarch)

Syntax -qarcharchitecture
On Pelican system
-qarchpwr4 Power 4 machines
-qarchPwr5 Power 5 machines
-qarchAuto Use the architecture of the
compiling machine.
Remember the head node on Pelican is a Power4
machine!
On LONI AIX systems
-qarchauto or -qarchpwr5 it does not matter
because all nodes are Power5 machines

23
-qtune

Bias optimization toward a specific machine
Tunes instruction selection, scheduling and other
implementation-dependent performance enhancement
Has effect on performance but not correctness
Primarily of benefit for floating-point intensive
programs
Is controlled by qarch, -q32 and q64 options if
not explicitly specified
-qtuneauto assumes that the execution
environment will be the same as the complication
environment

24
-qcache

Specifies the cache configuration for a specific
machine
Especially useful for loop operations (process
only the amount of data that can fit into the
data cache)
Must be used in conjunction with -qhot
Options
linebytes line size of the cache
Sizebytes total size of the cache
Levellevel specifies the level of cache
affected
costcycles specifies the performance penalty
resulting from a cache miss

25
Profile directed optimization

Profile-directed feedback (PDF)
Two stage optimization
Should be mainly used on code that has rarely
executed conditional error handling or
instrumentation.

26
Interprocedure analysis (-qipa)

Optimize across different files (whole program
analysis)
Have different levels
Level0
Program partitioning and simple interprocedural
optimization
Level1
Default level of -qipa
Inlining and global data mapping
level2
Global alias analysis
Interprocedural data flow

27
Inlining

Can be turned on by specifing -qipainlineinline-
options (or-qinlineinline-options)
Useful when your program has many subprogram
calls
Reduce the call overhead
Identify the subprograms that are called the most
and inline only those subprograms
Examples
-qipainlineauto inline all procedures
-qipainlinesub1inlinenoauto only inline the
procedure sub1

28
Choose an optimization level

-On option
-O0 very limited optimization, fast compilation,
debuggable code
-O2 comprehensive low-level optimization,
partial debug support
-O3 more extensive optimization, some precision
trade-off
-O4 Everything from -O3 plus -qhot -qipa
-qarchauto -qtuneauto -qcacheauto
-O5 Everything from -O4 plus -qipalevel2

29
Choose an optimization level

Test and debug code before go to any level of
optimization
If encountered problem with -O2, check the code
for any non-standard use of aliasing rules.
Consider using -qaliasnostd (Fortran) or
-qaliasnoansi (C) instruct the compiler to
apply aliasing assertion to your compilation
unit.
If encountered problem with -O3, consider using
-qstrict along with -O3.
-qstrict ensure the optimizer will not alter the
semantics of a program
Try to at least optimize your program with -O3
-qhot

30
Outline