Title: Designing and Building Parallel Programs
1Designing and Building Parallel Programs
A Tutorial Presented by the Department of
Defense HPC ModernizationProgramming Environment
Training Program
- (c) 1995, 1996, 1997, 1998
- Ian Foster Gina Goff
- Ehtesham Hayder Charles Koelbel
2Outline
- Day 1
- Introduction to Parallel Programming
- The OpenMP Programming Language
- Day 2
- Introduction to MPI
- The PETSc Library
- Individual consulting in afternoons
3Outline
- Day 1
- Introduction to Parallel Programming
- Parallel Computers Algorithms
- Understanding Performance
- Parallel Programming Tools
- The OpenMP Programming Language
- Day 2
- Introduction to MPI
- The PETSc Library
4Why Parallel Computing?
- Continuing demands for higher performance
- Physical limits on single processor performance
- High costs of internal concurrency
- Result is rise of multiprocessor architectures
- And number of processors will continue to
increase - Networking is another contributing factor
- Future software must be concurrent scalable
5The Multicomputeran Idealized Parallel Computer
6Multicomputer Architecture
- Multicomputer nodes network
- Node processor(s) local memory
- Access to local memory is cheap
- 10s of cycles does not involve network
- Conventional memory reference
- Access to remote memory is expensive
- 100s or 1000s of cycles uses network
- Use I/O-like mechanisms (e.g., send/receive)
7Multicomputer Cost Model
- Cost of remote memory access/communication
(including synchronization) - T ts N tw
- ts per-message cost (latency)
- tw per-word cost
- N message size in words
- Hence locality is an important property of good
parallel algorithms
8How do Real Parallel Computers Fit the Model?
- Major architecture types
- Distributed memory MIMD
- Shared memory MIMD
- Distributed shared memory (DSM)
- Workstation clusters and metacomputers
- Model fits current architectures pretty well
9Distributed MemoryMIMD Multiprocessor
- Multiple Instruction/Multiple Data
- Processors with local memory connected by
high-speed interconnection network - Typically high bandwidth, medium latency
- Hardware support for remote memory access
- Model breaks down when topology matters
- Examples Cray T3E, IBM SP
10Shared MemoryMIMD Multiprocessor
- Processors access shared memory via bus
- Low latency, high bandwidth
- Bus contention limits scalability
- Search for scalability introduces locality
- Cache (a form of local memory)
- Multistage architectures (some memory closer)
- Examples Cray T90, SGI PCA, Sun
11Distributed Shared Memory (DSM)
- A hybrid of distributed and shared memory
- Small groups of processors share memory others
access across a scalable network - Low to moderate latency, high bandwidth
- Model simplifies the multilevel hierarchy
- Examples SGI Origin, HP Exemplar
12Workstation Clusters
- Workstations connected by network
- Cost effective
- High latency, low to moderate bandwidth
- Often lack integrated software environment
- Model breaks down if connectivity limited
- Examples Ethernet, ATM crossbar, Myrinet
13A Simple Parallel Programming Model
- A parallel computation is a set of tasks
- Each task has local data, can be connected to
other tasks by channels - A task can
- Compute using local data
- Send to/receive from other tasks
- Create new tasks, or terminate itself
- A receiving task blocks until data available
14Properties
- Concurrency is enhanced by creating multiple
tasks - Scalability More tasks than nodes
- Locality Access local data when possible
- A task (with local data and subtasks) is a unit
for modular design - Mapping to nodes affects performance only
15Parallel Algorithm Design
- Goal Develop an efficient (parallel) solution
to a programming problem - Identify sensible parallel algorithms
- Evaluate performance and complexity
- We present
- A systematic design methodology
- Some basic design techniques
- Illustrative examples
16A Design Methodology
- Partition
- Define tasks
- Communication
- Identify requirements
- Agglomeration
- Enhance locality
- Mapping
- Place tasks
17Partitioning
- Goal identify opportunities for concurrent
execution (define tasks computationdata) - Focus on data operated on by algorithm ...
- Then distribute computation appropriately
- Domain decomposition
- ... or on the operations performed
- Then distribute data appropriately
- Functional decomposition
18Communication
- Identify communication requirements
- If computation in one task requires data located
in another, communication is needed - Example finite difference computation
- Must communicate with each neighbor
Xi (Xi-1 2Xi Xi1)/4
Partition creates one task per point
X1
X2
X3
19Agglomeration
- Once tasks communication determined,
agglomerate small tasks into larger tasks - Motivations
- To reduce communication costs
- If tasks cannot execute concurrently
- To reduce software engineering costs
- Caveats
- May involve replicating computation or data
20Mapping
- Place tasks on processors, to
- Maximize concurrency note potential
- Minimize communication conflict
- Techniques
- Regular problems agglomerate to P tasks
- Irregular problems use static load balancing
- If irregular in time dynamic load balancing
21Example Atmosphere Model
- Simulate atmospheric processes
- Conservation of momentum, mass, energy
- Ideal gas law, hydrostatic approximation
- Represent atmosphere state by 3-D grid
- Periodic in two horizontal dimensions
- Nx.Ny.Nz e.g., Ny50-500, Nx2Ny, Nz15-30
- Computation includes
- Atmospheric dynamics finite difference
- Physics (radiation etc.) in vertical only
22Atmosphere Model Numerical Methods
- Discretize the (continuous) domain by a regular
Nx??Ny ??Nz grid - Store p, u, v, T, ? at every grid point
- Approximate derivatives by finite differences
- Leads to stencils in vertical and horizontal
23Atmosphere ModelPartition
- Use domain decomposition
- Because model operates on large, regular grid
- Can decompose in 1, 2, or 3 dimensions
- 3-D decomposition offers greatest flexibility
24Atmosphere ModelCommunication
- Finite difference stencil horizontally
- Local, regular, structured
- Radiation calculations vertically
- Global, regular, structured
- Diagnostic sums
- Global, regular, structured
25Atmosphere ModelAgglomeration
- In horizontal
- Clump so that 4 points per task
- Efficiency communicate with 4 neighbors only
- In vertical, clump all points in column
- Performance avoid communication
- Modularity Reuse physics modules
- Resulting algorithm reasonably scalable
- (Nx.Ny)/4 at least 1250 tasks
26Atmosphere ModelMapping
- Technique depends on load distribution
- 1) Agglomerate to one task per processor
- Appropriate if little load imbalance
- 2) Extend (1) to incorporate cyclic mapping
- Works well for diurnal cycle imbalance
- 3) Use dynamic, local load balancing
- Works well for unpredictable, local imbalances
27Modeling Performance
- Execution time (sums are over P nodes) is
- T (sumTcomp sumTcomm sumTidle)/P
- Computation time comprises both
- Operations required by sequential algorithm
- Additional work, replicated work
- Idle time due to
- Load imbalance, and/or
- Latency (waiting for remote data)
28Bandwidth and Latency
- Recall cost model
- T ts N tw
- ts per-message cost (latency)
- tw per-word cost (1/ tw bandwidth)
- N message size in words
- Model works well for many algorithms, and on many
computers
29Measured Costs
30Typical Communication Costs
- Computer ts tw
- IBM SP2 40 0.11
- Intel Paragon 121 0.07
- Meiko CS-2 87 0.08
- Sparc/Ethernet 1500 5.0
- Sparc/FDDI 1150 1.1
Times in microseconds
31Example Finite Difference
- Finite difference computation on N2Z grid
- 9-point stencil
- Similar to atmosphere model earlier
- Decompose along one horizontal dimension
32Time for Finite Difference
- Identical computations at each grid point
- Tcomp tcN2Z (tc is compute time/point)
- 1-D decomposition, so each node sends 2NZ data to
2 neighbors if ? 2 rows/node - Tcomm P(ts2 tw4NZ)
- No significant idle time if load balanced
- Tidle 0
- Therefore, T tcN2Z/P ts2 tw4NZ
33Using Performance Models
- During design
- Use models to study qualitative behavior
- Calibrate models by measuring tc, ts, tw, etc.
- Use calibrated models to evaluate design
alternatives - During implementation
- Compare predictions with observations
- Relate discrepancies to implementation or model
- Use models to guide optimization process
34Design Alternatives Finite Difference
- Consider 2-D and 3-D decompositions
- Are they ever a win?
- If so, when?
35Design Alternatives (2)
- 2-D Decomposition - On a ?P????P processor grid,
messages of size 2N/?P???Z to 4 neighbors, so - T tcN2Z/P 4(ts tw2NZ/?P)
- Good if ts lt twNZ(2-4/?P)
- 3-D Decomposition - On a Px ? Py ? Pz grid,
- T tcN2Z/P ts6 tw2N2/(PxPy)
tw4(NZ)/(PxPz) tw4(NZ)/(PyPz)
36Finding Model Discrepancies
What we have here is a failure to communicate
37Impact of Network Topology
- Multicomputer model assumes comm cost independent
of location other comms - Real networks are not fully connected
- Multicomputer model can break down
2-D Mesh
Ethernet
38Competition for Bandwidth
- In many cases, a bandwidth constrained model can
give sufficiently accurate results - If S processes must communicate over same wire
at same time, each has 1/S bandwidth - Example finite difference on Ethernet
- All processors share single Ethernet
- Hence bandwidth term scaled by P
- T tcN2Z/P ts2 tw4NZP
39Bandwidth-Constrained Model Versus. Observations
Bandwidth-constrained model gives better fit
40Tool Survey
- High Performance Fortran (HPF)
- Message Passing Interface (MPI)
- Parallel Computing Forum (PCF) and OpenMP
- Portable, Extensible Toolkit for Scientific
Computations (PETSc)
41High Performance Fortran (HPF)
- A standard data-parallel language
- CM Fortran, C, HPC are related
- Programmer specifies
- Concurrency (concurrent operations on arrays)
- Locality (data distribution)
- Compiler infers
- Mapping of computation (owner-computes rule)
- Communication
42HPF Example
PROGRAM hpf_finite_difference !HPF PROCESSORS
pr(4) REAL x(100,100), new(100,100) !HPF ALIGN
new(,) WITH x(,) !HPF DISTRIBUTE x(BLOCK,
) ONTO pr new(299,299) (x(198,299)x(3100
,299)
x(299,198)x(299,3100)) / 4 diff
MAXVAL(ABS(new-x)) end
43HPF Analysis
- Advantages
- High level preserves sequential semantics
- Standard
- Disadvantages
- Restricted applicability
- Requires sophisticated compiler technology
- Good for regular, SPMD problems
44Message Passing Interface (MPI)
- A standard message-passing library
- p4, NX, p4, Express, PARMACS are precursors
- An MPI program defines a set of processes
- (usually one process per node)
- ... that communicate by calling MPI functions
- (point-to-point and collective)
- ... and can be constructed in a modular fashion
- (communicators are the key)
45MPI Example
main(int argc, char argv) MPI_Comm com
MPI_COMM_WORLD MPI_Init(argc,argv)
MPI_Comm_size(com,np) MPI_Send(local1,1,
MPI_FLOAT,lnbr,10,com) MPI_Recv(local,1,MPI_FL
OAT,rnbr,10,com,status) MPI_Send(locallsize,
1,MPI_FLOAT,rnbr,10,com) MPI_Recv(locallsize
1,1,MPI_FLOAT,lnbr,10,com,status) ldiff
maxerror(local) MPI_Allreduce(ldiff,diff,1,M
PI_FLOAT,MPI_MAX,com) MPI_Finalize()
46MPI Analysis
- Advantages
- Wide availability of efficient implementations
- Support for modular design, code reuse
- Disadvantages
- Low level (parallel assembly code)
- Less well-suited to shared-memory machines
- Good for performance-critical codes with natural
task modularity
47PCF
- Standardization (circa 1993) of shared memory
parallelism in Fortran - A PCF program is multithreaded, with explicit
synchronization between the threads and shared
variables - A PCF program is divided into regions
- Serial regions Only the master thread executes
- Parallel regions Work shared by all threads
48PCF and OpenMP
- PCF per se was not widely implemented
- Timing Distributed memory became popular
- Complexity Many details for special cases
- Not Invented Here (NIH) syndrome
- Its ideas resurfaced in OpenMP
- Primary differences are the spelling and
low-level controls - Also some significant simplification (claimed to
add scalability)
49PCF Example (SGI Variant)
PCF standard
- !DOACROSS, LOCAL(I), SHARE(A,B,C),
- ! REDUCTION(X),
- ! IF (N.GT.1000),
- ! MP_SCHEDTYPEDYNAMIC, CHUNK100
- DO I 2, N-1
- A(I) (B(I-1)B(I)B(I1)) / 3
- X X A(I)/C(I)
- END DO
- !DOACROSS, LOCAL(I), SHARE(D,E),
- ! MP_SCHEDTYPESIMPLE
- DO I 1, N
- D(I) SIN(E(I))
- END DO
X is a summation
Conditional parallelization
Iterations managed first-come, first-served, in
blocks of 100
Iterations blocked evenly among
threads(INTERLEAVE, GSS, RUNTIME scheduling
also available)
50OpenMP Example
- !OMP PARALLEL DO, SHARED(A,B,C),
- !OMP REDUCTION(X),
- !OMP SCHEDULE(DYNAMIC, 100)
- DO I 2, N-1
- A(I) (B(I-1)B(I)B(I1)) / 3
- X X A(I)/C(I)
- END DO
- !OMP END DO
- !OMP PARALLEL DO, SHARED(D,E),
- !OMP SCHEDULE(STATIC)
- DO I 1, N
- D(I) SIN(E(I))
- END DO
- !OMP END DO
X is a summation
Iterations managed first-come, first-served, in
blocks of 100
Iterations blocked evenly among threads (GUIDED
scheduling also available)
51PCF/OpenMP Analysis
- Advantages
- Convenient for shared memory, especially when
using vendor extensions - Disadvantages
- Tied strongly to shared memory
- Few standard features for locality control
- A good choice for shared-memory and DSM machines,
but portability is still hard
52Portable, Extensible Toolkit for Scientific
Computations (PETSc)
- A higher-level approach to solving PDEs
- Not parallel per se, but easy to use that way
- User-level library provides
- Linear and nonlinear solvers
- Standard and advanced options (e.g. parallel
preconditioners) - Programmer supplies
- Application-specific set-up, data structures, PDE
operators
53PETSc Example
- SLES snes
- MAT A
- Vec x,F
- integer n,its,ierr
- call MatCreat(MPI_COMM_WORLD,n,n,J,ierr)
- call VecCreate(MPI_COMM_WORLD,n,x,ierr)
- call VecDuplicate(x,F,ierr)
- call SNESCreate(MPI_COMM_WORLD,SNES_NONLINEAR_EQUA
TIONS,snes,ierr) - call SNESSetFunction(snes,F,EvaluateFunction,PETSC
_NULL,ierr) - call SNESSetJacobian(snes,J,EvaluateJacobian,PETSC
_NULL,ierr) - call SNESSetFromOptions(sles,ierr)
- call SNESSolve(sles,b,x,its,ierr)
- call SNESDestroy(snes,ierr)
54PETSc Analysis
- A rather different beast from the other tools
- The P does not stand for Parallel
- Most concurrency in user code, often MPI
- Advantages
- Easy access to advanced numerical methods
- Disadvantages
- Limited scope
- Good for implicit or explicit PDE solutions