High Performance Computing presentation

About This Presentation

Transcript and Presenter's Notes

Title: High Performance Computing

1
High Performance Computing

By Mark Mc Keown
ITC Research Computing Support Group
res-consult_at_virginia.edu

2
High Performance Computing

Compiler Options for Producing the Fastest
Executable.
Code Optimization.
Profiling and Timing.
Auto-Parallelization.
OpenMP.
MPI.
Hardware Resources Available to UVa Researchers.

3
Compiler Options for producing the Fastest
Executable

Using optimization flags when compiling can
greatly reduce the runtime of an executable.
Each compiler has a different set of options for
creating the fastest executable .
Often the best compiler options can only be
arrived at by empirical testing and timing of
your code.
A good reference for compiler flags that can be
used with various architectures is the SPEC web
site www.spec.org.
Read the Compiler manpages.

4
Example of Compiler Flags used on a Sun Ultra 10
workstation

Compiler SUNWpro 4.2
Flags none
Runtime 23min 22.4 Sec
Compiler SUNWpro 5.0
Flags none
Runtime 14min 21.0 Sec

Compiler SUNWpro 5.0
Flags -O3
Runtime 2min 24.4 sec
Compiler SUNWpro 5.0
Flags -fast
Runtime 2min 06.7 sec
Compiler SUNWpro 5.0
Flags -fast xcrossfile
Runtime 1min 59.6 sec

Compiler SUNWpro 5.0
Flags -fast xcrossfile -xprofile
Runtime 1min 57.3 sec

7
Useful Compiler Options

IBM AIX for SP2 -O3 qstrict -qtunep2sc qhot
-qarchp2sc qipa
SGI -Ofast
Sun -fast xcrossfile
GNU -O3 ffast-math funroll-loops

8
Code Optimization

Strength Reduction
AX2.0 becomes AXX
AX/2.0 becomes A0.5X

A ( 2.0exp(T) exp(2T) )
tmpexp(T)
A ( 2.0tmp tmptmp )

10
Loop Ordering

Fortran
DO J1,N
DO I1,N
A(I,J) B(I,J) C(I,J)D
ENDDO
ENDDO

C/C
for(I0 Iltn I)
for(J0 Jltn J)
aI,JaIJ CIJD

11
Other Optimizations

Copy Propagation
Constant Folding
Dead Code Removal
Induction Variable Simplification
Function Inlining
Loop Invariant Conditionals
Variable Renaming

Loop Invariant Code Motion
Loop Fusion
Pushing Loops inside Subroutines
Loop Index Dependent Conditionals
Loop Unrolling
Loop Stride Size
Floating Point Optimizations
Faster Algorithms
External Libraries
Assembly Code
Lookup Tables

13
Code Optimization References

http//www.itc.virginia.edu/research/Optim.html
http//www.cs.utk.edu/mucci/MPPopt.html
http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg2
45155.pdf
http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg2
45611.pdf
http//www.npaci.edu/T3E/single_pe.html
http//www.epcc.ed.ac.uk/epcc-tec/documents/course
mat.html
Software Optimizations for High Performance
Computing by Crawford and Wadleigh
High Performance Computing by Kevin Dowd et al

14
Timing and Profiling Codes

Early Optimization is the root of all evil
Donald Kunth
The 80-20 rule codes generally spend 80 of
their time executing 20 of their instructions

15
time command

Useful for calculating how long a code runs for,
provides both user and system time.
/usr/bin/time test.x
real 111.7
user 109.7
sys 0.5

16
etime function

Can be used to time sections of code
real4 tarray(2),etime
real4 start,finish
startetime(tarray)
..
..
finishetime(tarray)
write(,) finish-start

17
gprof

Provides a very detailed break down on how much
time is spent in each function of a code
Compile with pg
f77 pg O o test.x test.f
Execute code in normal manner
./test.x
Create profile with gprof
gprof test.x gt test.prof

18
Hardware Counters

Some CPUs have special registers that allow them
to count certain events
Events can include FLOPS, cache misses, floating
point exceptions.
Can provide a very detailed picture of what is
happening in a region of the code.
Vendor tools provide can provide easy access to
hardware counters third party tools can be used
across a range of systems. Examples include PAPI
and PCL

19
Vendor Tools for Profiling and Timing

Most vendors provide specialized tools for
profiling and timing codes which usually have a
simple to use GUI.
Sun Forte Workshop
SGI Speedshop

20
Timing and Profiling References

High Performance Computing by Dowd et al
Unix for Fortran Programmers by Loukides
http//www.kfa-juelich.de/zam/PCL
http//icl.cd.utk.edu/projects/papi
http//www.sun.com/forte/
http//www.sgi.com/developers/devtools/tools/prode
v.html
http//www.itc.virignia.edu/research/profile.html
http//www.deloire.com/gnu/docs/binutils/grprof_to
c.html

21
Single Processor System
22
Shared Memory System
23
Distributed Memory System
24
NUMA Non Uniform Memory Architecture
25
Auto-Parallelization

Can be used to program shared memory machines
Very simple to use only requires a compiler
flag.
Safe compiler guarantees code will execute
safely across the processors

26
A Parallelized Loop

DO I1,100
A(I)B(I)C(I)
ENDDO

CPU 1
DO I1,50
A(I)B(I)C(I)
ENDDO
CPU 2
DO I51,100
A(I)B(I)C(I)
ENDDO

27
Compiler Flags for Auto-parallelization

Sun
f77 fast autopar loopinfo o code.x code.f
SGI
f77 Ofast IPA apolist o code.x code.f
IBM
xlf_r O3 qsmpauto qreportsmplist qsource
o code.x code.f

28
Where Auto-Parallelization Fails

Not Enough work in the loop
IO statements in the loop
Early exits in the loop (while loops)
Scalar Dependence
Reductions
Subroutine Calls
Array Dependence

29
OpenMP

Can be used on SMP machines only.
OpenMP solves a lot of the problems of
auto-parallelization.
Simple to use simply add pragmasor
directives to define which loops to parallelize
Supported by most vendors

30
OpenMP Example

!OMP PARALLEL DO
Do I2,N
B(I)A(I)C(I)/2.0
Enddo
!OMP END PARALLEL DO

31
Compiler Flags for OpenMP

Sun flag is explicitpar
IBM flag is qsmpomp
SGI flag is MPopen_mpon

32
References for Auto-Parallelization and OpenMP

www.openmp.org
http//techpubs.sgi.com/library/
www.sgi.com/software/openmp
www.redbooks.ibm.com/redbooks/SG245611.html
www.npaci.edu/online/v3.16/SCAN.html
High Performance Computing by K.Dowd et al
Parallel Programming in OpenMP by Chandra et al

33
Message Passing Interface

MPI can be used to program a distributed memory
system such as a Linux cluster or the IBM SP2 as
well as SMP machines.
MPI is superseding PVM.
MPI is an industry standard supported by most
vendors.
MPI versions run on most Unix and Windows NT/2000

34
MPI

MPI is a library of functions that can be called
by a users code to pass information between
processors.
The MPI library consists of over 200 functions
in general only a small subset of these are used
in any code.
MPI can be used with Fortran, C and C.
MPI can be used on a single processor system to
emulate a parallel system useful for developing
and testing code.

35
MPI v OpenMP

MPI scales better than OpenMP the very largest
Supercomputers use distributed memory.
MPI is more difficult to use compared to OpenMP.
MPI is more portable than OpenMP, it can run on
both distributed memory and shared memory
machines.

36
MPI Books

Parallel Programming with MPI by Peter Pacheco
Using MPI Portable Parallel Programming With
the Message-Passing Interface by William Gropp et
al
Using MPI-2 Advanced Features of the Message-
Passing Interface by William Gropp et al
MPI the Complete Reference The MPI Core by Marc
Snir
Mpi the Complete Reference The MPI-2 Extensions
by William Gropp et al

37
MPI Web-References

http//www.rs6000.ibm.com/support/Education/sp_tut
or/mpi.html
http//www-unix.mcs.anl.gov/mpi/mpich/index.html
http//www.lam-mpi.org/
http//www.mpi-forum.org/
http//fawlty.cs.usfca.edu/mpi/

38
ACTS - Advanced Computational Testing and
Simulation

A set of software tools for High Performance
Computing.
Tools are developed by the Department of Energy .
Examples include ScaLAPACK, PETSc, TAO etc.
http//www.nersc.gov/ACTS/index.html

39
Hardware Resources Available to UVa Researchers

IBM SP2 24 Power2SC nodes with 512MB of memory
connected with a high speed interconnect.
http//www.itc.virginia.edu/research/sp2.html
Unixlab Cluster 54 SGI and Sun workstations.
http//www.people.virginia.edu/userv/usage.html

40
Centurion

Centurion large linux cluster run by the CS
department for the Legion Project.
Consists of 128 dual processor Pentium II 400Mhz
nodes and 128 533Mhz Alpha nodes.
Each node has 256MB.
64 of the Alpha nodes have a high speed myrinet
inter-connect.
http//legion.virginia.edu/centurion/Centurion.htm
l

41
National SuperComputer Centers

NPACI - National Partnership for Advanced
Computational Infrastructure.
NCSA - National Center for Supercomputing
Applications
US researchers are eligible to apply for time on
NCSA/NPACI machines.
If application is successful time is free.
Small allocations for 10,000 hours are relatively
easy to get.

42
Some Sample Machines.

BlueHorizon 1152 IBM Power3 CPUs configured in
144 nodes with 8 CPUs and 4GB per nodes based at
San Deigo.
http//www.npaci.edu/BlueHorizon/
TeraScale Computing System 2728 1GHz Alpha CPUs
configured in 682 nodes
http//www.psc.edu/machines/tcs/
Distributed Terascale Facility a 13.6 Teraflop
Linux Cluster
http//www.npaci.edu/teragrid/index.html

NPACI
http//www.npaci.edu/
NCSA
http//www.ncsa.edu/
NERSC National Energy Research Scientific
Computing Center - (DoE funded research only)
http//www.nersc.gov/

Write a Comment

User Comments (0)

About PowerShow.com

High Performance Computing PowerPoint PPT Presentation