High Performance Computing - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

High Performance Computing

Description:

Using optimization flags when compiling can greatly reduce the runtime of an executable. ... Compile with pg. f77 pg O o test.x test.f. Execute code in ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 44
Provided by: catheri75
Category:

less

Transcript and Presenter's Notes

Title: High Performance Computing


1
High Performance Computing
  • By Mark Mc Keown
  • ITC Research Computing Support Group
  • res-consult_at_virginia.edu

2
High Performance Computing
  • Compiler Options for Producing the Fastest
    Executable.
  • Code Optimization.
  • Profiling and Timing.
  • Auto-Parallelization.
  • OpenMP.
  • MPI.
  • Hardware Resources Available to UVa Researchers.

3
Compiler Options for producing the Fastest
Executable
  • Using optimization flags when compiling can
    greatly reduce the runtime of an executable.
  • Each compiler has a different set of options for
    creating the fastest executable .
  • Often the best compiler options can only be
    arrived at by empirical testing and timing of
    your code.
  • A good reference for compiler flags that can be
    used with various architectures is the SPEC web
    site www.spec.org.
  • Read the Compiler manpages.

4
Example of Compiler Flags used on a Sun Ultra 10
workstation
  • Compiler SUNWpro 4.2
  • Flags none
  • Runtime 23min 22.4 Sec
  • Compiler SUNWpro 5.0
  • Flags none
  • Runtime 14min 21.0 Sec

5
  • Compiler SUNWpro 5.0
  • Flags -O3
  • Runtime 2min 24.4 sec
  • Compiler SUNWpro 5.0
  • Flags -fast
  • Runtime 2min 06.7 sec
  • Compiler SUNWpro 5.0
  • Flags -fast xcrossfile
  • Runtime 1min 59.6 sec

6
  • Compiler SUNWpro 5.0
  • Flags -fast xcrossfile -xprofile
  • Runtime 1min 57.3 sec

7
Useful Compiler Options
  • IBM AIX for SP2 -O3 qstrict -qtunep2sc qhot
    -qarchp2sc qipa
  • SGI -Ofast
  • Sun -fast xcrossfile
  • GNU -O3 ffast-math funroll-loops

8
Code Optimization
  • Strength Reduction
  • AX2.0 becomes AXX
  • AX/2.0 becomes A0.5X

9
  • A ( 2.0exp(T) exp(2T) )
  • tmpexp(T)
  • A ( 2.0tmp tmptmp )

10
Loop Ordering
  • Fortran
  • DO J1,N
  • DO I1,N
  • A(I,J) B(I,J) C(I,J)D
  • ENDDO
  • ENDDO
  • C/C
  • for(I0 Iltn I)
  • for(J0 Jltn J)
  • aI,JaIJ CIJD

11
Other Optimizations
  • Copy Propagation
  • Constant Folding
  • Dead Code Removal
  • Induction Variable Simplification
  • Function Inlining
  • Loop Invariant Conditionals
  • Variable Renaming

12
  • Loop Invariant Code Motion
  • Loop Fusion
  • Pushing Loops inside Subroutines
  • Loop Index Dependent Conditionals
  • Loop Unrolling
  • Loop Stride Size
  • Floating Point Optimizations
  • Faster Algorithms
  • External Libraries
  • Assembly Code
  • Lookup Tables

13
Code Optimization References
  • http//www.itc.virginia.edu/research/Optim.html
  • http//www.cs.utk.edu/mucci/MPPopt.html
  • http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg2
    45155.pdf
  • http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg2
    45611.pdf
  • http//www.npaci.edu/T3E/single_pe.html
  • http//www.epcc.ed.ac.uk/epcc-tec/documents/course
    mat.html
  • Software Optimizations for High Performance
    Computing by Crawford and Wadleigh
  • High Performance Computing by Kevin Dowd et al

14
Timing and Profiling Codes
  • Early Optimization is the root of all evil
    Donald Kunth
  • The 80-20 rule codes generally spend 80 of
    their time executing 20 of their instructions

15
time command
  • Useful for calculating how long a code runs for,
    provides both user and system time.
  • /usr/bin/time test.x
  • real 111.7
  • user 109.7
  • sys 0.5

16
etime function
  • Can be used to time sections of code
  • real4 tarray(2),etime
  • real4 start,finish
  • startetime(tarray)
  • ..
  • ..
  • finishetime(tarray)
  • write(,) finish-start

17
gprof
  • Provides a very detailed break down on how much
    time is spent in each function of a code
  • Compile with pg
  • f77 pg O o test.x test.f
  • Execute code in normal manner
  • ./test.x
  • Create profile with gprof
  • gprof test.x gt test.prof

18
Hardware Counters
  • Some CPUs have special registers that allow them
    to count certain events
  • Events can include FLOPS, cache misses, floating
    point exceptions.
  • Can provide a very detailed picture of what is
    happening in a region of the code.
  • Vendor tools provide can provide easy access to
    hardware counters third party tools can be used
    across a range of systems. Examples include PAPI
    and PCL

19
Vendor Tools for Profiling and Timing
  • Most vendors provide specialized tools for
    profiling and timing codes which usually have a
    simple to use GUI.
  • Sun Forte Workshop
  • SGI Speedshop

20
Timing and Profiling References
  • High Performance Computing by Dowd et al
  • Unix for Fortran Programmers by Loukides
  • http//www.kfa-juelich.de/zam/PCL
  • http//icl.cd.utk.edu/projects/papi
  • http//www.sun.com/forte/
  • http//www.sgi.com/developers/devtools/tools/prode
    v.html
  • http//www.itc.virignia.edu/research/profile.html
  • http//www.deloire.com/gnu/docs/binutils/grprof_to
    c.html

21
Single Processor System
22
Shared Memory System
23
Distributed Memory System
24
NUMA Non Uniform Memory Architecture
25
Auto-Parallelization
  • Can be used to program shared memory machines
  • Very simple to use only requires a compiler
    flag.
  • Safe compiler guarantees code will execute
    safely across the processors

26
A Parallelized Loop
  • DO I1,100
  • A(I)B(I)C(I)
  • ENDDO
  • CPU 1
  • DO I1,50
  • A(I)B(I)C(I)
  • ENDDO
  • CPU 2
  • DO I51,100
  • A(I)B(I)C(I)
  • ENDDO

27
Compiler Flags for Auto-parallelization
  • Sun
  • f77 fast autopar loopinfo o code.x code.f
  • SGI
  • f77 Ofast IPA apolist o code.x code.f
  • IBM
  • xlf_r O3 qsmpauto qreportsmplist qsource
    o code.x code.f

28
Where Auto-Parallelization Fails
  • Not Enough work in the loop
  • IO statements in the loop
  • Early exits in the loop (while loops)
  • Scalar Dependence
  • Reductions
  • Subroutine Calls
  • Array Dependence

29
OpenMP
  • Can be used on SMP machines only.
  • OpenMP solves a lot of the problems of
    auto-parallelization.
  • Simple to use simply add pragmasor
    directives to define which loops to parallelize
  • Supported by most vendors

30
OpenMP Example
  • !OMP PARALLEL DO
  • Do I2,N
  • B(I)A(I)C(I)/2.0
  • Enddo
  • !OMP END PARALLEL DO

31
Compiler Flags for OpenMP
  • Sun flag is explicitpar
  • IBM flag is qsmpomp
  • SGI flag is MPopen_mpon

32
References for Auto-Parallelization and OpenMP
  • www.openmp.org
  • http//techpubs.sgi.com/library/
  • www.sgi.com/software/openmp
  • www.redbooks.ibm.com/redbooks/SG245611.html
  • www.npaci.edu/online/v3.16/SCAN.html
  • High Performance Computing by K.Dowd et al
  • Parallel Programming in OpenMP by Chandra et al

33
Message Passing Interface
  • MPI can be used to program a distributed memory
    system such as a Linux cluster or the IBM SP2 as
    well as SMP machines.
  • MPI is superseding PVM.
  • MPI is an industry standard supported by most
    vendors.
  • MPI versions run on most Unix and Windows NT/2000

34
MPI
  • MPI is a library of functions that can be called
    by a users code to pass information between
    processors.
  • The MPI library consists of over 200 functions
    in general only a small subset of these are used
    in any code.
  • MPI can be used with Fortran, C and C.
  • MPI can be used on a single processor system to
    emulate a parallel system useful for developing
    and testing code.

35
MPI v OpenMP
  • MPI scales better than OpenMP the very largest
    Supercomputers use distributed memory.
  • MPI is more difficult to use compared to OpenMP.
  • MPI is more portable than OpenMP, it can run on
    both distributed memory and shared memory
    machines.

36
MPI Books
  • Parallel Programming with MPI by Peter Pacheco
  • Using MPI Portable Parallel Programming With
    the Message-Passing Interface by William Gropp et
    al
  • Using MPI-2 Advanced Features of the Message-
    Passing Interface by William Gropp et al
  • MPI the Complete Reference The MPI Core by Marc
    Snir
  • Mpi the Complete Reference The MPI-2 Extensions
    by William Gropp et al

37
MPI Web-References
  • http//www.rs6000.ibm.com/support/Education/sp_tut
    or/mpi.html
  • http//www-unix.mcs.anl.gov/mpi/mpich/index.html
  • http//www.lam-mpi.org/
  • http//www.mpi-forum.org/
  • http//fawlty.cs.usfca.edu/mpi/

38
ACTS - Advanced Computational Testing and
Simulation
  • A set of software tools for High Performance
    Computing.
  • Tools are developed by the Department of Energy .
  • Examples include ScaLAPACK, PETSc, TAO etc.
  • http//www.nersc.gov/ACTS/index.html

39
Hardware Resources Available to UVa Researchers
  • IBM SP2 24 Power2SC nodes with 512MB of memory
    connected with a high speed interconnect.
  • http//www.itc.virginia.edu/research/sp2.html
  • Unixlab Cluster 54 SGI and Sun workstations.
  • http//www.people.virginia.edu/userv/usage.html

40
Centurion
  • Centurion large linux cluster run by the CS
    department for the Legion Project.
  • Consists of 128 dual processor Pentium II 400Mhz
    nodes and 128 533Mhz Alpha nodes.
  • Each node has 256MB.
  • 64 of the Alpha nodes have a high speed myrinet
    inter-connect.
  • http//legion.virginia.edu/centurion/Centurion.htm
    l

41
National SuperComputer Centers
  • NPACI - National Partnership for Advanced
    Computational Infrastructure.
  • NCSA - National Center for Supercomputing
    Applications
  • US researchers are eligible to apply for time on
    NCSA/NPACI machines.
  • If application is successful time is free.
  • Small allocations for 10,000 hours are relatively
    easy to get.

42
Some Sample Machines.
  • BlueHorizon 1152 IBM Power3 CPUs configured in
    144 nodes with 8 CPUs and 4GB per nodes based at
    San Deigo.
  • http//www.npaci.edu/BlueHorizon/
  • TeraScale Computing System 2728 1GHz Alpha CPUs
    configured in 682 nodes
  • http//www.psc.edu/machines/tcs/
  • Distributed Terascale Facility a 13.6 Teraflop
    Linux Cluster
  • http//www.npaci.edu/teragrid/index.html

43
  • NPACI
  • http//www.npaci.edu/
  • NCSA
  • http//www.ncsa.edu/
  • NERSC National Energy Research Scientific
    Computing Center - (DoE funded research only)
  • http//www.nersc.gov/
Write a Comment
User Comments (0)
About PowerShow.com