Title: Patterns of parallelism
1Patterns of parallelism
- Materials customized from Prof. Sandhya
Dwarkadass lecture notes - http//www.cs.rochester.edu/users/faculty/sandhya/
csc258/
2Steps in the parallelization
- Decompose a job into tasks
- Assign the tasks to processes
- Orchestration communication of data,
synchronization among processes. - Decomposing job is the most important part of the
process.
3Decomposing jobs
- Two types of jobs
- Same operation on a large set of data
- Vector addition A BC
- for (i0 iltsize i) Ai Bi Ci
- Decompose based on data domain
- Apply a sequence of functions on the data
- Photoshop
- Final photo red_eye_removal(brighter
(increased_contrast(original photo))) - Decompose based on function.
4How to decompose a job?
- Domain deposition and functional deposition
- Domain deposition data associated with a problem
is decomposed. Each parallel task then works on a
portion of the data. - Functional deposition focus on the computation
that is to be performed. The problem is
decomposed according to the work that must be
done. Each task then performs a portion of the
overall work.
5How to decompose a job?
- Data parallelism and task parallelism
- Data parallelism (domain deposition) all
processes do the same thing on different data - Regular and irregular problems (using linear or
non-linear indexing). - Task parallelism (functional deposition)
different processes do different tasks. - Task queues
- pipelining
6Domain deposition
7Domain deposition methods
8Functional deposition
9Functional deposition example signal processing
10Data parallelism examples
- Example 1 Matrix multiplication
- Multiply two n by n matrix into the third n by n
matrix.
11Matrix multiply
for (i0 iltn i) for (j0 jltn j)
cij 0.0 for (i0 iltn i) for (j0
jltn j) for (k0 kltn k)
cij cij aik bkj
Cij ai0b0j ai1b1j
aiN-1bN-1j
12Parallel matrix multiply
- No loop-carried dependence in i- and j- loops.
- Loop-carried dependence in k-loop.
- All i- and j- iterations can be run in parallel
- If we have P processors, we can give n/P rows or
columns to each processor (block or cyclic). - Or we can divide the matrix into P squares and
give each processor a square.
13SOR
- SOR implements a mathematic model for many
natural phenomena. - Example heat dissipation in a metal sheet.
- The model solves a partial differential equation.
- iterative method.
- Discretized the problem into a mesh of grids and
solve the values in the mesh.
14(No Transcript)
15Relaxation algorithm
- For some number of iterations
- For each internal grid point
- Compute average of its four neighbors.
- Termination condition
- The values at grid points change very little (the
difference between the values in the new array
and the old array is below a threshold.
16SOR code
- / initialization /
- for (i0 iltn1 i) gridi0 0.0
- for (i0 iltN1 i) gridin1 0.0
- for (j0 jltn1 j) grid0j 1.0
- for (j0 jlt n1 j) gridn1j 0.0
- for (i1 iltn i)
- for (j 1 jltn j)
- gridij 0.0
17SOR code
- / iteration /
- error 1000.0
- while (error gt threshold)
- for (i1 iltn i)
- for (j1 jltn j)
- tempij 0.25(gridi-1jgridi
1j gridij-1gridij1) - error 0.0
- for (i1 iltn i)
- for (j1 jltn j)
- if (abs(tempij-gridij) gt
error) error abs(tempij gridij) - gridij tempij
-
18Parallel SOR
- No dependences between iterations of the first
(I, j) loop nest - No dependences between iterations of the second
(I, j) loop besides the reduction of error - Anti-dependence between the first and the second
loop nest in the same step - True dependence between second loop nest and the
first loop nest of the next step
19Parallel SOR
- First loop can be parallelized.
- Second loop can be parallelized (with reduction
support). - We must make processor wait at the end of each
loop next. - Use either barrier or fork-join.
- If we have P processors, we can give n/P rows or
columns to each processor. - Or we can divide the array in P squares, and give
each processor a square to compute.
20Molecular Dynamics (MD)
- Simulation of a set of bodies under the influence
of physical laws. - Atoms, molecules, celestial bodies.
- Applications in many areas weather model,
material science, ocean model, galaxy formation. - MD has different forms but the same basic
structure.
21Molecular Dynamics code structure
- For some number of timesteps
- for all molecules i
- for all other molecules j
- forcei f(loci, locj)
- for all molecules i
- loci g (loci, forcei)
22Molecular Dynamics code structure
- To reduce the amount of computation, account for
interaction only with nearby molecules (molecules
beyond the cut-off distance have no effects)
23Molecular Dynamics code structure
- For some number of timesteps
- for all molecules i
- for all nearby molecules j
- forcei f(loci, locj)
- for all molecules i
- loci g (loci, forcei)
24Optimized Molecular Dynamics code
- For each molecule I
- keep the number of nearby molecules at
countI - array of indices of nearby molecules
indexij (0 ltj lt countI) - For some number of timesteps
- for (i0 iltnum_mol i)
- for (j0 jltcounti j)
- forcei f(loci,
locindexij) - for (i0 iltnum_mol i)
- loci g (loci, forcei)
25Molecular dynamics
- No loop-carried dependence in the first I- loop.
- Loop carried dependence (reduction) in j-loop.
- No loop-carried dependence in the second loop.
- True dependence between first and second I-loop.
26Molecular dynamics
- First I-loop can be parallelized.
- Second I-loop can be parallelized.
- Must make processors wait between loops
- Barrier or fork-join.
27Irregular vs. regular data parallel
- In SOR and MM, all arrays are accessed through
linear expressions of the loop indices, known at
compile time (regular). - In MD, some arrays are accessed through
non-linear expressions of the loop indices, some
known only at runtime (irregular).
28Irregular vs. regular data parallel
- No real differences in terms of parallelization
(based on dependences). - Will lead to fundamental differences in
expressions of parallelism - Irregularity make it difficult for parallelism
based on data distribution. - Not difficult for parallelism based on iteration
distribution.
29Irregular vs. regular data parallel
- Parallelization of the first loop
- Has a load balancing issue
- Some molecules have few/many neighbors.
- Needs more sophisticated loop partitioning
schemes.
30Task parallelism
- Each process performs a different task
- Two principal flavors
- pipelines
- Task queues
- Program examples PIPE (pipeline), TSP (task
queue)
31Pipeline
- Often occurs with image processing applications
(photoshop), where a number of images undergoes a
sequence of transformation.
For (I0 Iltnum_pic, read(in_picI, I)
int_pic_1I trans1(in_picI)
int_pic_2I trans2(in_pic_1I)
int_pic_3I trans3(int_pic_2I)
out_picI trans4int_pic_3I)
32Sequential vs. pipelined execution
Sequential
pipeline
33Realizing the pipeline
- Process 1
- For (I0 Iltnum_pics, read(in_picI) I)
- int_pic_1I trans1(in_picI)
- signal(event_1_2I)
34Realizing the pipeline
- Process 2
- For (I0 Iltnum_pics I)
- wait(event_1_2I)
- int_pic_2I trans2(in_pic_1I)
- signal(event_2_3I)
-
- Process 3 is similar
35Realizing the pipeline
- Process 4
- For (I0 Iltnum_pics I)
- wait(event_3_4I)
- out_picI trans4(in_pic_3I)
-
36Pipeline issues
- One stage takes more time than others
- Stages take a variable amount of time
- Extra buffers can be some cushion against
variability.