PGI Compilers Tools for Scientists and Engineers - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

PGI Compilers Tools for Scientists and Engineers

Description:

Type conversions manually convert constants or use flags ... Syntax coloring, keyword completion. Fortran 95 Intrinsics tips ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 43

Provided by: doug190

Category:

more less

Transcript and Presenter's Notes

Title: PGI Compilers Tools for Scientists and Engineers

1
PGI CompilersTools for Scientists and Engineers
September 2006
Brent Leback brent.leback_at_pgroup.com Dave
Norton dave.norton_at_pgroup.com www.pgroup.com
2
Outline of Todays Topics

Introduction to PGI Compilers and Tools
Documentation. Getting Help
Basic Compiler Options
Optimization Strategies
6.2 Features and Roadmap
Questions and Answers

3
PGI Compilers and Tools, features

Optimization State-of-the-art vector,
parallel, IPA, Feedback,
Cross-platform AMD Intel, 32/64-bit, Linux
Windows
PGI Unified Binary for AMD and Intel processors
Tools Integrated OpenMP/MPI debug profile,
IDE integration
Parallel MPI, OpenMP 2.5, auto-parallel for
Multi-core
Comprehensive OS Support Red Hat 7.3 9.0,
RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1
10.1, SLES 8/9/10, Windows XP, Windows x64

4
PGI Tools Enable Developers to

View x64 as a unified CPU architecture
Extract peak performance from x64 CPUs
Ride innovation waves from both Intel and AMD
Use a single source base and toolset across Linux
and Windows
Develop, debug, tune parallel applications
forMulti-core, Multi-core SMP, Clustered
Multi-core SMP

5
PGI Documentation and Support

PGI provided documentation
PGI User Forums, at www.pgroup.com
PGI FAQs, Tips Techniques pages
Email support, via trs_at_pgroup.com
Web support, a form-based system similar to email
support
Fax support

6
PGI Docs Support, cont.

Legacy phone support, direct access, etc.
PGI download web page
PGI prepared/personalized training
PGI ISV program
PGI Premier Service program

7
PGI Basic Compiler Options

Basic Usage
Language Dialects
Target Architectures
Debugging aids
Optimization switches

8
PGI Basic Compiler Usage

A compiler driver interprets options and invokes
pre-processors, compilers, assembler, linker,
etc.
Options precedence if options conflict, last
option on command line takes precedence
Use -Minfo to see a listing of optimizations and
transformations performed by the compiler
Use -help to list all options or see details on
how to use a given option, e.g. pgf90 -Mvect
-help
Use man pages for more details on options, e.g.
man pgf90
Use v to see under the hood

9
Flags to support language dialects

Fortran
pgf77, pgf90, pgf95, pghpf tools
Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95,
.F95, .hpf, .HPF
-Mextend, -Mfixed, -Mfreeform
Type size i2, -i4, -i8, -r4, -r8, etc.
-Mcray, -Mbyteswapio, -Mupcase, -Mnomain,
-Mrecursive, etc.
C/C
pgcc, pgCC, aka pgcpp
Suffixes .c, .C, .cc, .cpp, .i
-B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt
-Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs

10
Specifying the target architecture

Not an issue on XT3.
Defaults to the type of processor/OS you are
running on
Use the tp switch.
-tp k8-64 or tp p7-64 or tp core2-64 for 64-bit
code.
-tp amd64e for AMD opteron rev E or later
-tp x64 for unified binary
-tp k8-32, k7, p7, piv, piii, p6, p5, px for 32
bit code

11
Flags for debugging aids

-g generates symbolic debug information used by a
debugger
-gopt generates debug information in the presence
of optimization
-Mbounds adds array bounds checking
-v gives verbose output, useful for debugging
system or build problems
-Mlist will generate a listing
-Minfo provides feedback on optimizations made by
the compiler
-S or Mkeepasm to see the exact assembly
generated

12
Basic optimization switches

Traditional optimization controlled through
-Oltngt, n is 0 to 4.
-fast switch combines common set into one simple
switch, is equal to -O2 -Munrollc1 -Mnoframe
-Mlre
For -Munroll, c specifies completely unroll loops
with this loop count or less
-Munrollnltmgt says unroll other loops m times
-Mnoframe does not set up a stack frame
-Mlre is loop-carried redundancy elimination

13
Basic optimization switches, cont.

fastsse switch is commonly used, extends fast to
SSE hardware, and vectorization
-fastsse is equal to -O2 -Munrollc1 -Mnoframe
-Mlre (-fast) plus -Mvectsse, -Mscalarsse
-Mcache_align, -Mflushz
-Mcache_align aligns top level arrays and objects
on cache-line boundaries
-Mflushz flushes SSE denormal numbers to zero

14
Optimization Strategies

Establish a workload
Optimization from the top-down
Use of proper tools, methods
Processor level optimizations, parallel methods
Different flags/features for different types of
code

15
Node level tuning

Vectorization packed SSE instructions maximize
performance
Interprocedural Analysis (IPA) use it!
motivating examples
Function Inlining especially important for C
and C
Parallelization for Cray XD1 and multi-core
processors
Miscellaneous Optimizations hit or miss, but
worth a try

16
Vectorizable F90 Array Syntax Data is REAL4
350 ! 351 ! Initialize vertex, similarity and
coordinate arrays 352 ! 353 Do Index 1,
NodeCount 354 IX MOD (Index - 1, NodesX)
1 355 IY ((Index - 1) / NodesX)
1 356 CoordX (IX, IY) Position (1) (IX
- 1) StepX 357 CoordY (IX, IY)
Position (2) (IY - 1) StepY 358 JetSim
(Index) SUM (Graph (, , Index) 359
GaborTrafo (, ,
CoordX(IX,IY), CoordY(IX,IY))) 360 VertexX
(Index) MOD (ParamsGraphRandomIndex (Index) -
1, NodesX) 1 361 VertexY (Index)
((ParamsGraphRandomIndex (Index) - 1) / NodesX)
1 362 End Do
Inner loop at line 358 is vectorizable, can
used packed SSE instructions
17
fastsse to Enable SSE VectorizationMinfo to
List Optimizations to stderr
pgf95 -fastsse -Mipafast -Minfo -S
graphRoutines.f90 localmove    334, Loop unrol
led 1 times (completely unrolled)
   343, Loop unrolled 2 times (completely unrolle
d)    358, Generated an alternate loop for the in
ner loop          Generated vector sse code for
inner loop     Generated 2 prefetch
instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this
loop
18
Vector SSE
Scalar SSE
.LB6_1245 lineno 358         movlps  (rdx,
rcx),xmm2         subl    8,eax
        movlps  16(rcx,rdx),xmm3
prefetcht0 64(rcx,rsi) prefetcht0
64(rcx,rdx) movhps 8(rcx,rdx),xmm2
        mulps   (rsi,rcx),xmm2 movhps
24(rcx,rdx),xmm3         addps   xmm2,xmm0
        mulps   16(rcx,rsi),xmm3
        addq    32,rcx         testl   eax,e
ax         addps   xmm3,xmm0
        jg      .LB6_1245
.LB6_668 lineno 358
movss   -12(rax),xmm2         movss   -4(rax),
xmm3         subl    1,edx         mulss   -1
2(rcx),xmm2         addss   xmm0,xmm2
        mulss   -4(rcx),xmm3
        movss   -8(rax),xmm0
        mulss   -8(rcx),xmm0
        addss   xmm0,xmm2         movss   (ra
x),xmm0         addq    16,rax
        addss   xmm3,xmm2         mulss   (rc
x),xmm0         addq    16,rcx
        testl   edx,edx         addss   xmm0,
xmm2         movaps  xmm2,xmm0
        jg      .LB6_625
Facerec Scalar 104.2 sec Facerec Vector 84.3
sec
19
Vectorizable C Code Fragment?
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Minfo functions.c func4
221, Loop unrolled 4 times 221, Loop not
vectorized due to data dependency 223, Loop
not vectorized due to data dependency
20
Pointer Arguments Inhibit Vectorization
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Msafeptr Minfo
functions.c func4 221, Generated vector SSE
code for inner loop Generated 3
prefetch instructions for this loop 223,
Unrolled inner loop 4 times
21
C Constant Inhibits Vectorization
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Msafeptr Mfcon Minfo
functions.c func4 221, Generated vector SSE
code for inner loop Generated 3
prefetch instructions for this loop 223,
Generated vector SSE code for inner loop
Generated 4 prefetch instructions for this
loop
22
-Msafeptr Option and Pragma
Mnosafeptrall arg auto dummy local
static global all All pointers are
safe arg Argument pointers are safe local local
pointers are safe static static local pointers
are safe global global pointers are safe
pragma scope nosafeptrarg local global
static all, Where scope is global, routine
or loop
23
Common Barriers to SSE Vectorization

Potential Dependencies C Pointers Give
compiler more info with Msafeptr, pragmas,
or restrict type qualifer
Function Calls Try inlining with Minline or
Mipainline
Type conversions manually convert constants
or use flags
Large Number of Statements Try
Mvectnosizelimit
Too few iterations Usually better to unroll
the loop
Real dependencies Must restructure loop, if
possible

24
Barriers to Efficient Execution of Vector SSE
Loops

Not enough work vectors are too short
Vectors not aligned to a cache line boundary
Non unity strides
Code bloat if altcode is generated

Vectorization packed SSE instructions
maximize performance
Interprocedural Analysis (IPA) use it!
motivating example
Function Inlining especially important for C
and C
Parallelization for Cray XD1 and multi-core
processors
Miscellaneous Optimizations hit or miss, but
worth a try

26
What can Interprocedural Analysis and
Optimization with Mipa do for You?

Interprocedural constant propagation
Pointer disambiguation
Alignment detection, Alignment propagation
Global variable mod/ref detection
F90 shape propagation
Function inlining
IPA optimization of libraries, including
inlining

27
Effect of IPA on the WUPWISE Benchmark

Mipafast gt constant propagation gt compiler
sees complex matrices are all 4x3 gt
completely unrolls loops
Mipafast,inline gt small matrix multiplies
are all inlined

28
Using Interprocedural Analysis

Must be used at both compile time and link time
Non-disruptive to development process
edit/build/run
Speed-ups of 5 - 10 are common
Mipasafeltnamegt - safe to optimize functions
which call or are called from unknown
function/library name
Mipalibopt perform IPA optimizations on
libraries
Mipalibinline perform IPA inlining from
libraries

Vectorization packed SSE instructions
maximize performance
Interprocedural Analysis (IPA) use it!
motivating examples
Function Inlining especially important for C
and C
SMP Parallelization for Cray XD1 and
multi-core processors
Miscellaneous Optimizations hit or miss, but
worth a try

30
Explicit Function Inlining
Minlinelibltinlibgt nameltfuncgt
exceptltfuncgt sizeltngt
levelsltngt libltinlibgt Inline extracted
functions from inlib nameltfuncgt Inline
function func exceptltfuncgt Do not inline
function func sizeltngt Inline only functions
smaller than n statements (approximate) levels
ltngt Inline n levels of functions
For C Codes, PGI Recommends IPA-basedinlining
or Minlinelevels10!
31
Other C recommendations

Encapsulation, Data Hiding - small functions,
inline!
Exception Handling use no_exceptions until
7.0
Overloaded operators, overloaded functions -
okay
Pointer Chasing - -Msafeptr, restrict qualifer,
32 bits?
Templates, Generic Programming now okay
Inheritance, polymorphism, virtual functions
runtime lookup or check, no inlining, potential
performance penalties

Vectorization packed SSE instructions
maximize performance
Interprocedural Analysis (IPA) use it!
motivating examples
Function Inlining especially important for C
and C
SMP Parallelization for Cray XT(CNL) on
multi-core processors
Miscellaneous Optimizations hit or miss, but
worth a try

33
SMP Parallelization

Mconcur for auto-parallelization on multi-core
Compiler strives for parallel outer loops,
vector SSE inner loops
Mconcurinnermost forces a vector/parallel
innermost loop
Mconcurcncall enables parallelization of
loops with calls
mp to enable OpenMP 2.5 parallel programming
model
See PGI Users Guide or OpenMP 2.5 standard
OpenMP programs compiled w/out mp just work
Not supported on Cray XT(Catamount) would
require some custom work
Mconcur and mp can be used together!

34
MGRID BenchmarkMain Loop
DO 10 I32, N-1 DO 10 I22,N-1
DO 10 I12,N-110 R(I1,I2,I3)
V(I1,I2,I3)
-A(0)(U(I1,I2,I3))
-A(1)(U(I1-1,I2,I3)U(I11,I2,I3)

U(I1,I2-1,I3)U(I1,I21,I3)
U(I1,I2,I3-1)U(I1,I2,I
31))
-A(2)(U(I1-1,I2-1,I3)U(I11,I2-1,I3)

U(I1-1,I21,I3)U(I11,I21,I3)
U(I1,I2-1,I3-1)U(I
1,I21,I3-1)
U(I1,I2-1,I31)U(I1,I21,I31)

U(I1-1,I2,I3-1)U(I1-1,I2,I31)
U(I11,I2,I3-1)U(I
11,I2,I31) )
-A(3)(U(I1-1,I2-1,I3-1)U(I11,I2-1,I3-1)

U(I1-1,I21,I3-1)U(I11,I21,I3-1)

U(I1-1,I2-1,I31)U(I11,I2-1,I31)

U(I1-1,I21,I31)U(I11,I21,I31))
35
Auto-parallel MGRID Overall Speed-upis 40 on
Dual-core AMD Opteron
pgf95 fastsse Mipafast,inline Minfo
Mconcur mgrid.f resid . . . 189, Parallel
code for non-innermost loop activated
if loop count gt 33 block distribution
291, 4 loop-carried redundant expressions
removed with 12 operations and
16 arrays Generated vector SSE code
for inner loop Generated 8 prefetch
instructions for this loop Generated
vector SSE code for inner loop
Generated 8 prefetch instructions for this loop
36

Vectorization packed SSE instructions
maximize performance
Interprocedural Analysis (IPA) use it!
motivating examples
Function Inlining especially important for C
and C
SMP Parallelization for Cray XD1 and
multi-core processors
Miscellaneous Optimizations hit or miss, but
worth a try

37
Miscellaneous Optimizations (1)

Mfprelaxed single-precision sqrt, rsqrt, div
performed using reduced-precision reciprocal
approximation
lacml and lacml_mp link in the AMD Core
Math Library
Mprefetchdltpgt,nltqgt control prefetching
distance, max number of prefetch
instructions per loop
tp k8-32 can result in big performance win
on some C/C codes that dont require gt 2GB
addressing pointer and long data become
32-bits

38
Miscellaneous Optimizations (2)

O3 more aggressive hoisting and scalar
replacement not part of fastsse, always
time your code to make sure its faster
For C codes no_exceptions
Minlinelevels10
Mnomovnt disable / force non-temporal
moves
Vversion to switch between PGI releases at
file level
Mvectnoaltcode disable multiple versions of
loops

39
Whats New in PGI 6.2

Industry-leading SPECFP06 and SPECINT06
Performance
PGI Visual Fortran for Windows x64 Windows XP
Full-featured PGI Workstation/Server for 32-bit
Windows XP
PGI Unified Binary performance enhancements
More gcc extensions / compatibility
New SSE intrinsics
PGI CDK ROLL for ROCKS clusters
MPICH1 and MPICH2 support in the PGI CDK
Incremental debugger/profiler enhancements
Limited tuning for Intel Core2 (Woodcrest et al)

40
PGI Visual Fortran 6.2

Deep integration with Visual Studio 2005
PGI-custom Fortran-aware text editor
Syntax coloring, keyword completion
Fortran 95 Intrinsics tips
PGI-custom project system and icons
PGI-custom property pages
One-touch project build / execute
MS Visual C interoperability
Mixed VC / PGI Fortran applications
PGI-custom parallel F95 debug engine
OpenMP 2.5 / threads debugging
Just-in-time debugging features
DVF/CVF compatibility features
Win32 API support
Complete (Vis Studio bundled) and Standard
(no Vis Studio) versions

PGI Unified Binary executables
Auto-parallel for multi-core CPUs
Native OpenMP 2.5 parallelization
World-class performance
64-bit Windows x64 support
32-bit Windows 2000/XP support
Optimization/support for AMD64
Optimization/support for Intel EM64T
DEC/IBM/Cray compatibility features
cpp-compatible pre-processing
Visual Studio 2005 bundled
MSDN Library bundled
GUI parallel debugging/profiling
Assembly-optimized BLAS/LAPACK/FFTs
Boxed CD-ROM/Manuals media kit

PVF Workstation Complete Only
41
On the PGI Roadmap

PGI Unified Binary directives and enhancements
Aggressive Intel Core2 and next gen AMD64 tuning
Industry-leading SPECFP06 and SPECINT06
Performance on Linux/Windows/AMD/Intel/32/64
Incremental PGDBG enhancements, improved C
support
MPI Debugging / Profiling for Windows x64 CCS
Clusters
All-new cross-platform PGPROF performance
profiler
Fortran 2003/C99 language features
GCC front-end compatibility, g
interoperability
PGC tuning, PGC/VC interoperability
Windows SUA and Apple/MacOS platform support
De facto standard scalable C/Fortran
language/tools extensions