Title: PGI Compilers Tools for Scientists and Engineers
1PGI CompilersTools for Scientists and Engineers
September 2006
Brent Leback brent.leback_at_pgroup.com Dave
Norton dave.norton_at_pgroup.com www.pgroup.com
2Outline of Todays Topics
- Introduction to PGI Compilers and Tools
- Documentation. Getting Help
- Basic Compiler Options
- Optimization Strategies
- 6.2 Features and Roadmap
- Questions and Answers
3PGI Compilers and Tools, features
- Optimization State-of-the-art vector,
parallel, IPA, Feedback, - Cross-platform AMD Intel, 32/64-bit, Linux
Windows - PGI Unified Binary for AMD and Intel processors
- Tools Integrated OpenMP/MPI debug profile,
IDE integration - Parallel MPI, OpenMP 2.5, auto-parallel for
Multi-core - Comprehensive OS Support Red Hat 7.3 9.0,
RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1
10.1, SLES 8/9/10, Windows XP, Windows x64
4PGI Tools Enable Developers to
- View x64 as a unified CPU architecture
- Extract peak performance from x64 CPUs
- Ride innovation waves from both Intel and AMD
- Use a single source base and toolset across Linux
and Windows - Develop, debug, tune parallel applications
forMulti-core, Multi-core SMP, Clustered
Multi-core SMP
5PGI Documentation and Support
- PGI provided documentation
- PGI User Forums, at www.pgroup.com
- PGI FAQs, Tips Techniques pages
- Email support, via trs_at_pgroup.com
- Web support, a form-based system similar to email
support - Fax support
6PGI Docs Support, cont.
- Legacy phone support, direct access, etc.
- PGI download web page
- PGI prepared/personalized training
- PGI ISV program
- PGI Premier Service program
7PGI Basic Compiler Options
- Basic Usage
- Language Dialects
- Target Architectures
- Debugging aids
- Optimization switches
8PGI Basic Compiler Usage
- A compiler driver interprets options and invokes
pre-processors, compilers, assembler, linker,
etc. - Options precedence if options conflict, last
option on command line takes precedence - Use -Minfo to see a listing of optimizations and
transformations performed by the compiler - Use -help to list all options or see details on
how to use a given option, e.g. pgf90 -Mvect
-help - Use man pages for more details on options, e.g.
man pgf90 - Use v to see under the hood
9Flags to support language dialects
- Fortran
- pgf77, pgf90, pgf95, pghpf tools
- Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95,
.F95, .hpf, .HPF - -Mextend, -Mfixed, -Mfreeform
- Type size i2, -i4, -i8, -r4, -r8, etc.
- -Mcray, -Mbyteswapio, -Mupcase, -Mnomain,
-Mrecursive, etc. - C/C
- pgcc, pgCC, aka pgcpp
- Suffixes .c, .C, .cc, .cpp, .i
- -B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt
- -Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs
10Specifying the target architecture
- Not an issue on XT3.
- Defaults to the type of processor/OS you are
running on - Use the tp switch.
- -tp k8-64 or tp p7-64 or tp core2-64 for 64-bit
code. - -tp amd64e for AMD opteron rev E or later
- -tp x64 for unified binary
- -tp k8-32, k7, p7, piv, piii, p6, p5, px for 32
bit code
11Flags for debugging aids
- -g generates symbolic debug information used by a
debugger - -gopt generates debug information in the presence
of optimization - -Mbounds adds array bounds checking
- -v gives verbose output, useful for debugging
system or build problems - -Mlist will generate a listing
- -Minfo provides feedback on optimizations made by
the compiler - -S or Mkeepasm to see the exact assembly
generated
12Basic optimization switches
- Traditional optimization controlled through
-Oltngt, n is 0 to 4. - -fast switch combines common set into one simple
switch, is equal to -O2 -Munrollc1 -Mnoframe
-Mlre - For -Munroll, c specifies completely unroll loops
with this loop count or less - -Munrollnltmgt says unroll other loops m times
- -Mnoframe does not set up a stack frame
- -Mlre is loop-carried redundancy elimination
13Basic optimization switches, cont.
- fastsse switch is commonly used, extends fast to
SSE hardware, and vectorization - -fastsse is equal to -O2 -Munrollc1 -Mnoframe
-Mlre (-fast) plus -Mvectsse, -Mscalarsse
-Mcache_align, -Mflushz - -Mcache_align aligns top level arrays and objects
on cache-line boundaries - -Mflushz flushes SSE denormal numbers to zero
14Optimization Strategies
- Establish a workload
- Optimization from the top-down
- Use of proper tools, methods
- Processor level optimizations, parallel methods
- Different flags/features for different types of
code
15Node level tuning
- Vectorization packed SSE instructions maximize
performance - Interprocedural Analysis (IPA) use it!
motivating examples - Function Inlining especially important for C
and C - Parallelization for Cray XD1 and multi-core
processors - Miscellaneous Optimizations hit or miss, but
worth a try
16Vectorizable F90 Array Syntax Data is REAL4
350 ! 351 ! Initialize vertex, similarity and
coordinate arrays 352 ! 353 Do Index 1,
NodeCount 354 IX MOD (Index - 1, NodesX)
1 355 IY ((Index - 1) / NodesX)
1 356 CoordX (IX, IY) Position (1) (IX
- 1) StepX 357 CoordY (IX, IY)
Position (2) (IY - 1) StepY 358 JetSim
(Index) SUM (Graph (, , Index) 359
GaborTrafo (, ,
CoordX(IX,IY), CoordY(IX,IY))) 360 VertexX
(Index) MOD (ParamsGraphRandomIndex (Index) -
1, NodesX) 1 361 VertexY (Index)
((ParamsGraphRandomIndex (Index) - 1) / NodesX)
1 362 End Do
Inner loop at line 358 is vectorizable, can
used packed SSE instructions
17fastsse to Enable SSE VectorizationMinfo to
List Optimizations to stderr
pgf95 -fastsse -Mipafast -Minfo -S
graphRoutines.f90 localmove 334, Loop unrol
led 1 times (completely unrolled)
343, Loop unrolled 2 times (completely unrolle
d) 358, Generated an alternate loop for the in
ner loop Generated vector sse code for
inner loop Generated 2 prefetch
instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this
loop
18Vector SSE
Scalar SSE
.LB6_1245 lineno 358 movlps (rdx,
rcx),xmm2 subl 8,eax
movlps 16(rcx,rdx),xmm3
prefetcht0 64(rcx,rsi) prefetcht0
64(rcx,rdx) movhps 8(rcx,rdx),xmm2
mulps (rsi,rcx),xmm2 movhps
24(rcx,rdx),xmm3 addps xmm2,xmm0
mulps 16(rcx,rsi),xmm3
addq 32,rcx testl eax,e
ax addps xmm3,xmm0
jg .LB6_1245
.LB6_668 lineno 358
movss -12(rax),xmm2 movss -4(rax),
xmm3 subl 1,edx mulss -1
2(rcx),xmm2 addss xmm0,xmm2
mulss -4(rcx),xmm3
movss -8(rax),xmm0
mulss -8(rcx),xmm0
addss xmm0,xmm2 movss (ra
x),xmm0 addq 16,rax
addss xmm3,xmm2 mulss (rc
x),xmm0 addq 16,rcx
testl edx,edx addss xmm0,
xmm2 movaps xmm2,xmm0
jg .LB6_625
Facerec Scalar 104.2 sec Facerec Vector 84.3
sec
19Vectorizable C Code Fragment?
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Minfo functions.c func4
221, Loop unrolled 4 times 221, Loop not
vectorized due to data dependency 223, Loop
not vectorized due to data dependency
20Pointer Arguments Inhibit Vectorization
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Msafeptr Minfo
functions.c func4 221, Generated vector SSE
code for inner loop Generated 3
prefetch instructions for this loop 223,
Unrolled inner loop 4 times
21C Constant Inhibits Vectorization
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Msafeptr Mfcon Minfo
functions.c func4 221, Generated vector SSE
code for inner loop Generated 3
prefetch instructions for this loop 223,
Generated vector SSE code for inner loop
Generated 4 prefetch instructions for this
loop
22-Msafeptr Option and Pragma
Mnosafeptrall arg auto dummy local
static global all All pointers are
safe arg Argument pointers are safe local local
pointers are safe static static local pointers
are safe global global pointers are safe
pragma scope nosafeptrarg local global
static all, Where scope is global, routine
or loop
23Common Barriers to SSE Vectorization
- Potential Dependencies C Pointers Give
compiler more info with Msafeptr, pragmas,
or restrict type qualifer - Function Calls Try inlining with Minline or
Mipainline - Type conversions manually convert constants
or use flags - Large Number of Statements Try
Mvectnosizelimit - Too few iterations Usually better to unroll
the loop - Real dependencies Must restructure loop, if
possible
24Barriers to Efficient Execution of Vector SSE
Loops
- Not enough work vectors are too short
- Vectors not aligned to a cache line boundary
- Non unity strides
- Code bloat if altcode is generated
25 - Vectorization packed SSE instructions
maximize performance - Interprocedural Analysis (IPA) use it!
motivating example - Function Inlining especially important for C
and C - Parallelization for Cray XD1 and multi-core
processors - Miscellaneous Optimizations hit or miss, but
worth a try
26What can Interprocedural Analysis and
Optimization with Mipa do for You?
- Interprocedural constant propagation
- Pointer disambiguation
- Alignment detection, Alignment propagation
- Global variable mod/ref detection
- F90 shape propagation
- Function inlining
- IPA optimization of libraries, including
inlining
27Effect of IPA on the WUPWISE Benchmark
- Mipafast gt constant propagation gt compiler
sees complex matrices are all 4x3 gt
completely unrolls loops - Mipafast,inline gt small matrix multiplies
are all inlined
28Using Interprocedural Analysis
- Must be used at both compile time and link time
- Non-disruptive to development process
edit/build/run - Speed-ups of 5 - 10 are common
- Mipasafeltnamegt - safe to optimize functions
which call or are called from unknown
function/library name - Mipalibopt perform IPA optimizations on
libraries - Mipalibinline perform IPA inlining from
libraries
29 - Vectorization packed SSE instructions
maximize performance - Interprocedural Analysis (IPA) use it!
motivating examples - Function Inlining especially important for C
and C - SMP Parallelization for Cray XD1 and
multi-core processors - Miscellaneous Optimizations hit or miss, but
worth a try
30Explicit Function Inlining
Minlinelibltinlibgt nameltfuncgt
exceptltfuncgt sizeltngt
levelsltngt libltinlibgt Inline extracted
functions from inlib nameltfuncgt Inline
function func exceptltfuncgt Do not inline
function func sizeltngt Inline only functions
smaller than n statements (approximate) levels
ltngt Inline n levels of functions
For C Codes, PGI Recommends IPA-basedinlining
or Minlinelevels10!
31Other C recommendations
- Encapsulation, Data Hiding - small functions,
inline! - Exception Handling use no_exceptions until
7.0 - Overloaded operators, overloaded functions -
okay - Pointer Chasing - -Msafeptr, restrict qualifer,
32 bits? - Templates, Generic Programming now okay
- Inheritance, polymorphism, virtual functions
runtime lookup or check, no inlining, potential
performance penalties
32 - Vectorization packed SSE instructions
maximize performance - Interprocedural Analysis (IPA) use it!
motivating examples - Function Inlining especially important for C
and C - SMP Parallelization for Cray XT(CNL) on
multi-core processors - Miscellaneous Optimizations hit or miss, but
worth a try
33SMP Parallelization
- Mconcur for auto-parallelization on multi-core
- Compiler strives for parallel outer loops,
vector SSE inner loops - Mconcurinnermost forces a vector/parallel
innermost loop - Mconcurcncall enables parallelization of
loops with calls - mp to enable OpenMP 2.5 parallel programming
model - See PGI Users Guide or OpenMP 2.5 standard
- OpenMP programs compiled w/out mp just work
- Not supported on Cray XT(Catamount) would
require some custom work - Mconcur and mp can be used together!
34MGRID BenchmarkMain Loop
DO 10 I32, N-1 DO 10 I22,N-1
DO 10 I12,N-110 R(I1,I2,I3)
V(I1,I2,I3)
-A(0)(U(I1,I2,I3))
-A(1)(U(I1-1,I2,I3)U(I11,I2,I3)
U(I1,I2-1,I3)U(I1,I21,I3)
U(I1,I2,I3-1)U(I1,I2,I
31))
-A(2)(U(I1-1,I2-1,I3)U(I11,I2-1,I3)
U(I1-1,I21,I3)U(I11,I21,I3)
U(I1,I2-1,I3-1)U(I
1,I21,I3-1)
U(I1,I2-1,I31)U(I1,I21,I31)
U(I1-1,I2,I3-1)U(I1-1,I2,I31)
U(I11,I2,I3-1)U(I
11,I2,I31) )
-A(3)(U(I1-1,I2-1,I3-1)U(I11,I2-1,I3-1)
U(I1-1,I21,I3-1)U(I11,I21,I3-1)
U(I1-1,I2-1,I31)U(I11,I2-1,I31)
U(I1-1,I21,I31)U(I11,I21,I31))
35Auto-parallel MGRID Overall Speed-upis 40 on
Dual-core AMD Opteron
pgf95 fastsse Mipafast,inline Minfo
Mconcur mgrid.f resid . . . 189, Parallel
code for non-innermost loop activated
if loop count gt 33 block distribution
291, 4 loop-carried redundant expressions
removed with 12 operations and
16 arrays Generated vector SSE code
for inner loop Generated 8 prefetch
instructions for this loop Generated
vector SSE code for inner loop
Generated 8 prefetch instructions for this loop
36 - Vectorization packed SSE instructions
maximize performance - Interprocedural Analysis (IPA) use it!
motivating examples - Function Inlining especially important for C
and C - SMP Parallelization for Cray XD1 and
multi-core processors - Miscellaneous Optimizations hit or miss, but
worth a try
37Miscellaneous Optimizations (1)
- Mfprelaxed single-precision sqrt, rsqrt, div
performed using reduced-precision reciprocal
approximation - lacml and lacml_mp link in the AMD Core
Math Library - Mprefetchdltpgt,nltqgt control prefetching
distance, max number of prefetch
instructions per loop - tp k8-32 can result in big performance win
on some C/C codes that dont require gt 2GB
addressing pointer and long data become
32-bits
38Miscellaneous Optimizations (2)
- O3 more aggressive hoisting and scalar
replacement not part of fastsse, always
time your code to make sure its faster - For C codes no_exceptions
Minlinelevels10 - Mnomovnt disable / force non-temporal
moves - Vversion to switch between PGI releases at
file level - Mvectnoaltcode disable multiple versions of
loops
39Whats New in PGI 6.2
- Industry-leading SPECFP06 and SPECINT06
Performance - PGI Visual Fortran for Windows x64 Windows XP
- Full-featured PGI Workstation/Server for 32-bit
Windows XP - PGI Unified Binary performance enhancements
- More gcc extensions / compatibility
- New SSE intrinsics
- PGI CDK ROLL for ROCKS clusters
- MPICH1 and MPICH2 support in the PGI CDK
- Incremental debugger/profiler enhancements
- Limited tuning for Intel Core2 (Woodcrest et al)
40PGI Visual Fortran 6.2
- Deep integration with Visual Studio 2005
- PGI-custom Fortran-aware text editor
- Syntax coloring, keyword completion
- Fortran 95 Intrinsics tips
- PGI-custom project system and icons
- PGI-custom property pages
- One-touch project build / execute
- MS Visual C interoperability
- Mixed VC / PGI Fortran applications
- PGI-custom parallel F95 debug engine
- OpenMP 2.5 / threads debugging
- Just-in-time debugging features
- DVF/CVF compatibility features
- Win32 API support
- Complete (Vis Studio bundled) and Standard
(no Vis Studio) versions
- PGI Unified Binary executables
- Auto-parallel for multi-core CPUs
- Native OpenMP 2.5 parallelization
- World-class performance
- 64-bit Windows x64 support
- 32-bit Windows 2000/XP support
- Optimization/support for AMD64
- Optimization/support for Intel EM64T
- DEC/IBM/Cray compatibility features
- cpp-compatible pre-processing
- Visual Studio 2005 bundled
- MSDN Library bundled
- GUI parallel debugging/profiling
- Assembly-optimized BLAS/LAPACK/FFTs
- Boxed CD-ROM/Manuals media kit
PVF Workstation Complete Only
41On the PGI Roadmap
- PGI Unified Binary directives and enhancements
- Aggressive Intel Core2 and next gen AMD64 tuning
- Industry-leading SPECFP06 and SPECINT06
Performance on Linux/Windows/AMD/Intel/32/64 - Incremental PGDBG enhancements, improved C
support - MPI Debugging / Profiling for Windows x64 CCS
Clusters - All-new cross-platform PGPROF performance
profiler - Fortran 2003/C99 language features
- GCC front-end compatibility, g
interoperability - PGC tuning, PGC/VC interoperability
- Windows SUA and Apple/MacOS platform support
- De facto standard scalable C/Fortran
language/tools extensions
42Questions?
Reach me at brent.leback_at_pgroup.com Thanks for
your time