Title: High Performance Computing
1High Performance Computing
- By Mark Mc Keown
- ITC Research Computing Support Group
- res-consult_at_virginia.edu
2High Performance Computing
- Compiler Options for Producing the Fastest
Executable. - Code Optimization.
- Profiling and Timing.
- Auto-Parallelization.
- OpenMP.
- MPI.
- Hardware Resources Available to UVa Researchers.
3Compiler Options for producing the Fastest
Executable
- Using optimization flags when compiling can
greatly reduce the runtime of an executable. - Each compiler has a different set of options for
creating the fastest executable . - Often the best compiler options can only be
arrived at by empirical testing and timing of
your code. - A good reference for compiler flags that can be
used with various architectures is the SPEC web
site www.spec.org. - Read the Compiler manpages.
4Example of Compiler Flags used on a Sun Ultra 10
workstation
- Compiler SUNWpro 4.2
- Flags none
- Runtime 23min 22.4 Sec
- Compiler SUNWpro 5.0
- Flags none
- Runtime 14min 21.0 Sec
5- Compiler SUNWpro 5.0
- Flags -O3
- Runtime 2min 24.4 sec
- Compiler SUNWpro 5.0
- Flags -fast
- Runtime 2min 06.7 sec
- Compiler SUNWpro 5.0
- Flags -fast xcrossfile
- Runtime 1min 59.6 sec
6- Compiler SUNWpro 5.0
- Flags -fast xcrossfile -xprofile
- Runtime 1min 57.3 sec
7Useful Compiler Options
- IBM AIX for SP2 -O3 qstrict -qtunep2sc qhot
-qarchp2sc qipa - SGI -Ofast
- Sun -fast xcrossfile
- GNU -O3 ffast-math funroll-loops
8Code Optimization
- Strength Reduction
- AX2.0 becomes AXX
- AX/2.0 becomes A0.5X
-
9- A ( 2.0exp(T) exp(2T) )
- tmpexp(T)
- A ( 2.0tmp tmptmp )
-
10Loop Ordering
- Fortran
- DO J1,N
- DO I1,N
- A(I,J) B(I,J) C(I,J)D
- ENDDO
- ENDDO
- C/C
- for(I0 Iltn I)
- for(J0 Jltn J)
- aI,JaIJ CIJD
11Other Optimizations
- Copy Propagation
- Constant Folding
- Dead Code Removal
- Induction Variable Simplification
- Function Inlining
- Loop Invariant Conditionals
- Variable Renaming
12- Loop Invariant Code Motion
- Loop Fusion
- Pushing Loops inside Subroutines
- Loop Index Dependent Conditionals
- Loop Unrolling
- Loop Stride Size
- Floating Point Optimizations
- Faster Algorithms
- External Libraries
- Assembly Code
- Lookup Tables
13Code Optimization References
- http//www.itc.virginia.edu/research/Optim.html
- http//www.cs.utk.edu/mucci/MPPopt.html
- http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg2
45155.pdf - http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg2
45611.pdf - http//www.npaci.edu/T3E/single_pe.html
- http//www.epcc.ed.ac.uk/epcc-tec/documents/course
mat.html - Software Optimizations for High Performance
Computing by Crawford and Wadleigh - High Performance Computing by Kevin Dowd et al
14Timing and Profiling Codes
- Early Optimization is the root of all evil
Donald Kunth - The 80-20 rule codes generally spend 80 of
their time executing 20 of their instructions
15time command
- Useful for calculating how long a code runs for,
provides both user and system time. - /usr/bin/time test.x
- real 111.7
- user 109.7
- sys 0.5
16etime function
- Can be used to time sections of code
- real4 tarray(2),etime
- real4 start,finish
- startetime(tarray)
- ..
- ..
- finishetime(tarray)
- write(,) finish-start
-
17gprof
- Provides a very detailed break down on how much
time is spent in each function of a code - Compile with pg
- f77 pg O o test.x test.f
- Execute code in normal manner
- ./test.x
- Create profile with gprof
- gprof test.x gt test.prof
18Hardware Counters
- Some CPUs have special registers that allow them
to count certain events - Events can include FLOPS, cache misses, floating
point exceptions. - Can provide a very detailed picture of what is
happening in a region of the code. - Vendor tools provide can provide easy access to
hardware counters third party tools can be used
across a range of systems. Examples include PAPI
and PCL
19Vendor Tools for Profiling and Timing
- Most vendors provide specialized tools for
profiling and timing codes which usually have a
simple to use GUI. - Sun Forte Workshop
- SGI Speedshop
20Timing and Profiling References
- High Performance Computing by Dowd et al
- Unix for Fortran Programmers by Loukides
- http//www.kfa-juelich.de/zam/PCL
- http//icl.cd.utk.edu/projects/papi
- http//www.sun.com/forte/
- http//www.sgi.com/developers/devtools/tools/prode
v.html - http//www.itc.virignia.edu/research/profile.html
- http//www.deloire.com/gnu/docs/binutils/grprof_to
c.html
21Single Processor System
22Shared Memory System
23Distributed Memory System
24NUMA Non Uniform Memory Architecture
25Auto-Parallelization
- Can be used to program shared memory machines
- Very simple to use only requires a compiler
flag. - Safe compiler guarantees code will execute
safely across the processors
26A Parallelized Loop
- DO I1,100
- A(I)B(I)C(I)
- ENDDO
- CPU 1
- DO I1,50
- A(I)B(I)C(I)
- ENDDO
- CPU 2
- DO I51,100
- A(I)B(I)C(I)
- ENDDO
27Compiler Flags for Auto-parallelization
- Sun
- f77 fast autopar loopinfo o code.x code.f
- SGI
- f77 Ofast IPA apolist o code.x code.f
- IBM
- xlf_r O3 qsmpauto qreportsmplist qsource
o code.x code.f
28Where Auto-Parallelization Fails
- Not Enough work in the loop
- IO statements in the loop
- Early exits in the loop (while loops)
- Scalar Dependence
- Reductions
- Subroutine Calls
- Array Dependence
29OpenMP
- Can be used on SMP machines only.
- OpenMP solves a lot of the problems of
auto-parallelization. - Simple to use simply add pragmasor
directives to define which loops to parallelize - Supported by most vendors
30OpenMP Example
- !OMP PARALLEL DO
- Do I2,N
- B(I)A(I)C(I)/2.0
- Enddo
- !OMP END PARALLEL DO
31Compiler Flags for OpenMP
- Sun flag is explicitpar
- IBM flag is qsmpomp
- SGI flag is MPopen_mpon
32References for Auto-Parallelization and OpenMP
- www.openmp.org
- http//techpubs.sgi.com/library/
- www.sgi.com/software/openmp
- www.redbooks.ibm.com/redbooks/SG245611.html
- www.npaci.edu/online/v3.16/SCAN.html
- High Performance Computing by K.Dowd et al
- Parallel Programming in OpenMP by Chandra et al
33Message Passing Interface
- MPI can be used to program a distributed memory
system such as a Linux cluster or the IBM SP2 as
well as SMP machines. - MPI is superseding PVM.
- MPI is an industry standard supported by most
vendors. - MPI versions run on most Unix and Windows NT/2000
34MPI
- MPI is a library of functions that can be called
by a users code to pass information between
processors. - The MPI library consists of over 200 functions
in general only a small subset of these are used
in any code. - MPI can be used with Fortran, C and C.
- MPI can be used on a single processor system to
emulate a parallel system useful for developing
and testing code.
35MPI v OpenMP
- MPI scales better than OpenMP the very largest
Supercomputers use distributed memory. - MPI is more difficult to use compared to OpenMP.
- MPI is more portable than OpenMP, it can run on
both distributed memory and shared memory
machines.
36MPI Books
- Parallel Programming with MPI by Peter Pacheco
- Using MPI Portable Parallel Programming With
the Message-Passing Interface by William Gropp et
al - Using MPI-2 Advanced Features of the Message-
Passing Interface by William Gropp et al - MPI the Complete Reference The MPI Core by Marc
Snir - Mpi the Complete Reference The MPI-2 Extensions
by William Gropp et al
37MPI Web-References
- http//www.rs6000.ibm.com/support/Education/sp_tut
or/mpi.html - http//www-unix.mcs.anl.gov/mpi/mpich/index.html
- http//www.lam-mpi.org/
- http//www.mpi-forum.org/
- http//fawlty.cs.usfca.edu/mpi/
38ACTS - Advanced Computational Testing and
Simulation
- A set of software tools for High Performance
Computing. - Tools are developed by the Department of Energy .
- Examples include ScaLAPACK, PETSc, TAO etc.
- http//www.nersc.gov/ACTS/index.html
39Hardware Resources Available to UVa Researchers
- IBM SP2 24 Power2SC nodes with 512MB of memory
connected with a high speed interconnect. - http//www.itc.virginia.edu/research/sp2.html
- Unixlab Cluster 54 SGI and Sun workstations.
- http//www.people.virginia.edu/userv/usage.html
40Centurion
- Centurion large linux cluster run by the CS
department for the Legion Project. - Consists of 128 dual processor Pentium II 400Mhz
nodes and 128 533Mhz Alpha nodes. - Each node has 256MB.
- 64 of the Alpha nodes have a high speed myrinet
inter-connect. - http//legion.virginia.edu/centurion/Centurion.htm
l
41National SuperComputer Centers
- NPACI - National Partnership for Advanced
Computational Infrastructure. - NCSA - National Center for Supercomputing
Applications - US researchers are eligible to apply for time on
NCSA/NPACI machines. - If application is successful time is free.
- Small allocations for 10,000 hours are relatively
easy to get.
42Some Sample Machines.
- BlueHorizon 1152 IBM Power3 CPUs configured in
144 nodes with 8 CPUs and 4GB per nodes based at
San Deigo. - http//www.npaci.edu/BlueHorizon/
- TeraScale Computing System 2728 1GHz Alpha CPUs
configured in 682 nodes - http//www.psc.edu/machines/tcs/
- Distributed Terascale Facility a 13.6 Teraflop
Linux Cluster - http//www.npaci.edu/teragrid/index.html
43- NPACI
- http//www.npaci.edu/
- NCSA
- http//www.ncsa.edu/
- NERSC National Energy Research Scientific
Computing Center - (DoE funded research only) - http//www.nersc.gov/