Title: Programming Paradigms and Algorithms
1Programming Paradigms and Algorithms
- WA 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6,
9.2.8, 10.4.1, - Kumar 12.1.3
- 1.      Berman, F., Wolski, R., Figueira, S.,
Schopf, J. and Shao, G., "Application-Level
Scheduling on Distributed Heterogeneous
Networks," Proceedings of Supercomputing '96 - (httpapples.ucsd.edu)
-
2Common Parallel Programming Paradigms
- Embarrassingly parallel programs
- Workqueue
- Master/Slave programs
- Monte Carlo methods
- Regular, Iterative (Stencil) Computations
- Pipelined Computations
- Synchronous Computations
3Regular, Iterative Stencil Applications
- Many scientific applications have the format
- Loop until some condition is true
- Perform computation which involvescommunicati
ng with N,E,W,S neighborsof a point (5 point
stencil) - Convergence test?
4Stencil Example Jacobi2D
- Jacobi algorithm, also known as the method of
simultaneous corrections is an iterative method
for approximating the solution to a system of
linear equations. - Jacobi addresses the problem of solving n linear
equations in n unknowns Axb where the ith
equation is
or alternatively - as and bs are known, want to solve for xs
5Jacobi 2D Strategy
- Jacobi strategy iterates until the computation
converges to an exact solution, i.e. each
iteration we solve - where the values from the (k-1)st iteration are
used to compute the values for the kth iteration - For important classes of problems, Jacobi
converges to a good solution after O(logN)
iterations Leighton - typically, the solution is approximated to a
desired error threshold
6Jacobi 2D
- Equation is most efficient to solve when most as
are 0 - When most as entries are non-zero, A is dense
- When most as are 0, A is sparse
- Sparse matrices are regularly found in many
scientific applications.
7La Places Equation
- Jacobi strategy can be used effectively to solve
sparse linear equations. - One such equation is La Places equation
- f is solved over a 2D space having coordinates x
and y - If the distance between points (D) is small
enough, f can be approximated by - These equations reduce to
8La Places Equation
- Note the relationship between the parameters
- This forms a 4 point stencil
- Any update will involve only local communication!
9Solving La Place using Jacobi strategy
- Note that in La Place equation, we want to solve
for all f(x,y) which has 2 parameters - In Jacobi, we want to solve for x_i which has
only 1 index - How do we convert f(x,y) into x_i ?
- Associate x_is with the f(x,y)s by distributing
them in the f 2D matrix in row-major (natural)
order - For an nxn matrix, there are then nxn x_is, so
the A matrix will need to be (nxn)X(nxn)
10Solving La Place using Jacobi strategy
- When the x_is are distributed in the f 2D
matrix in row-major (natural) order - becomes
11Working backward
- Now we want to work backward to find out what the
A matrix and b vector will be for Jacobi - Our solution to the La Place equation gives us
equations of this form - Rewriting, we get
- So the b_i are 0, what is the A matrix?
12Finding the A matrix
- Each row only at most 5 non-zero entries
- All entries on the diagonal are 4
N9, n3
13Jacobi Implementation Strategy
- An initial guess is made for all the unknowns,
typically x_i b_i - New values for the x_is are calculated using the
iteration equations - The updated values are substituted in the
iteration equations and the process repeats again - The user provides a "termination condition" to
end the iteration. - An example termination condition is
- error
threshold.
14Data Parallel Jacobi 2D Pseudo-code
- Initialize ghost regions
- for (i1 iltN i)
- x0i northi
- xN1i southi
- xi0 westi
- xiN1 easti
- Initialize matrix
- for (i1 iltN i)
- for (j1 jltN j)
- xij initvalue
- Iterative refinement of x until values converge
- while (maxdiff gt CONVERG)
- Update x array
- for (i1 iltN i)
- for (j1 jltN j)
- newxij ¼ (xi-1j xij1 xi1j
xij-1) - Convergence test
- maxdiff 0
- for (i1 iltN i)
15Jacobi2D Programming Issues
- Synchronization
- Should we synchronize between iterations?
Between multiple iterations? - Should we tag information and let the application
run asynchronously? (How bad can things get?) - How often should we test for convergence?
- How important is it to know when were done?
- How expensive is it?
16Jacobi2D Programming Issues
- Block decomposition or strip decomposition?
- How big should the blocks or strips be?
- How should blocks/strips be allocated to
processors?
Block
Uniform Strip
Non-uniform Strip
17HPF-Style Data Decompositions
- 1D (Processors P0 P1 P2 P3 , tasks
0-15) - Block decomposition (Task i allocated to
processor floor (i/p)) - Cyclic decomposition (Task i allocated to
processor i mod p) - Block-Cycle Decomposition (Block i allocated to
processor i mod p)
Block
Cyclic
Block-cyclic
18HPF-Style Data Decompositions
- 2D
- Each dimension partitioned by block, cyclic,
block-cyclic or (do nothing) - Useful set of uniform decompositions can be
constructed
Block, Block
Block,
, Cyclic
19Jacobi on a Cluster
- If each partition of Jacobi is executed on a
processor in a lab cluster, we can no longer
assume we have dedicated processors and network - In particular, the performance exhibited by the
cluster will vary over time and with load - How can we go about developing a
performance-efficient implementation in a more
dynamic environment?
20Jacobi AppLeS
- We developed an AppLeS application scheduler
- AppLeS Application-Level Scheduler
- AppLeS is scheduling agent which integrates with
application to form a Grid-aware adaptive
self-scheduling application - Targeted Jacobi AppLeS to a distributed clustered
environment
21How Does AppLeS Work?
AppLeS application self-scheduling
application
accessible resources
feasible resource sets
Grid Infrastructure
NWS
evaluatedschedules
Resources
best schedule
22Network Weather Service (Wolski, U. Tenn.)
- The NWS provides dynamic resource information
for AppLeS - NWS is stand-alone system
- NWS
- monitors current system state
- provides best forecast of resource load from
multiple models
23Jacobi2D AppLeS Resource Selector
- Feasible resources determined according to
application-specific distance metric - Choose fastest machine as locus
- Compute distance D from locus based on unit-sized
application-specific benchmark - Dlocus,X compunit,locus-compunit,X
commW,E columns - Resources sorted according to distance from
locus, forming a desirability list - Feasible resource sets formed from initial
subsets of sorted desirability list - Next step plan a schedule for each feasible
resource set - Scheduler will choose schedule with best
predicted execution time
24Jacobi2D Performance Model and Schedule Planning
- Execution time for ith strip
- where load predicted percentage of CPU time
available (NWS) - comm time to send and receive messages factored
by - predicted BW (NWS)
- AppLeS uses time-balancing to determine best
partition on a given set of resources - Solve
- for
25Jacobi2D Experiments
- Experiments compare
- Compile-time block HPF partitioning
- Compile-time irregular strip partitioning no NWS
forecasts, no resource selection - Run-time strip AppLeS
partitioning - Runs for different partitioning methods performed
back-to-back on production systems - Average execution time recorded
- Distributed UCSD/SDSC platform Sparcs, RS6000,
Alpha Farm, SP-2
26Jacobi2D AppLeS Experiments
- Representative Jacobi 2D AppLeS experiment
- Adaptive scheduling leverages deliverable
performance of contended system
- Spike occurs when a gateway between PCL and SDSC
goes down - Subsequent AppLeS experiments avoid slow link