Patterns of parallelism - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Patterns of parallelism

Description:

Orchestration: communication of data, synchronization among processes. ... Final photo = red_eye_removal(brighter (increased_contrast(original photo) ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 37

Provided by: xiny2

Category:

more less

Transcript and Presenter's Notes

Title: Patterns of parallelism

1
Patterns of parallelism

Materials customized from Prof. Sandhya
Dwarkadass lecture notes
http//www.cs.rochester.edu/users/faculty/sandhya/
csc258/

2
Steps in the parallelization

Decompose a job into tasks
Assign the tasks to processes
Orchestration communication of data,
synchronization among processes.
Decomposing job is the most important part of the
process.

3
Decomposing jobs

Two types of jobs
Same operation on a large set of data
Vector addition A BC
for (i0 iltsize i) Ai Bi Ci
Decompose based on data domain
Apply a sequence of functions on the data
Photoshop
Final photo red_eye_removal(brighter
(increased_contrast(original photo)))
Decompose based on function.

4
How to decompose a job?

Domain deposition and functional deposition
Domain deposition data associated with a problem
is decomposed. Each parallel task then works on a
portion of the data.
Functional deposition focus on the computation
that is to be performed. The problem is
decomposed according to the work that must be
done. Each task then performs a portion of the
overall work.

5
How to decompose a job?

Data parallelism and task parallelism
Data parallelism (domain deposition) all
processes do the same thing on different data
Regular and irregular problems (using linear or
non-linear indexing).
Task parallelism (functional deposition)
different processes do different tasks.
Task queues
pipelining

6
Domain deposition
7
Domain deposition methods
8
Functional deposition
9
Functional deposition example signal processing
10
Data parallelism examples

Example 1 Matrix multiplication
Multiply two n by n matrix into the third n by n
matrix.

11
Matrix multiply
for (i0 iltn i) for (j0 jltn j)
cij 0.0 for (i0 iltn i) for (j0
jltn j) for (k0 kltn k)
cij cij aik bkj
Cij ai0b0j ai1b1j
aiN-1bN-1j
12
Parallel matrix multiply

No loop-carried dependence in i- and j- loops.
Loop-carried dependence in k-loop.
All i- and j- iterations can be run in parallel
If we have P processors, we can give n/P rows or
columns to each processor (block or cyclic).
Or we can divide the matrix into P squares and
give each processor a square.

13
SOR

SOR implements a mathematic model for many
natural phenomena.
Example heat dissipation in a metal sheet.
The model solves a partial differential equation.
iterative method.
Discretized the problem into a mesh of grids and
solve the values in the mesh.

14
(No Transcript)
15
Relaxation algorithm

For some number of iterations
For each internal grid point
Compute average of its four neighbors.
Termination condition
The values at grid points change very little (the
difference between the values in the new array
and the old array is below a threshold.

16
SOR code

/ initialization /
for (i0 iltn1 i) gridi0 0.0
for (i0 iltN1 i) gridin1 0.0
for (j0 jltn1 j) grid0j 1.0
for (j0 jlt n1 j) gridn1j 0.0
for (i1 iltn i)
for (j 1 jltn j)
gridij 0.0

17
SOR code

/ iteration /
error 1000.0
while (error gt threshold)
for (i1 iltn i)
for (j1 jltn j)
tempij 0.25(gridi-1jgridi
1j gridij-1gridij1)
error 0.0
for (i1 iltn i)
for (j1 jltn j)
if (abs(tempij-gridij) gt
error) error abs(tempij gridij)
gridij tempij

18
Parallel SOR

No dependences between iterations of the first
(I, j) loop nest
No dependences between iterations of the second
(I, j) loop besides the reduction of error
Anti-dependence between the first and the second
loop nest in the same step
True dependence between second loop nest and the
first loop nest of the next step

19
Parallel SOR

First loop can be parallelized.
Second loop can be parallelized (with reduction
support).
We must make processor wait at the end of each
loop next.
Use either barrier or fork-join.
If we have P processors, we can give n/P rows or
columns to each processor.
Or we can divide the array in P squares, and give
each processor a square to compute.

20
Molecular Dynamics (MD)

Simulation of a set of bodies under the influence
of physical laws.
Atoms, molecules, celestial bodies.
Applications in many areas weather model,
material science, ocean model, galaxy formation.
MD has different forms but the same basic
structure.

21
Molecular Dynamics code structure

For some number of timesteps
for all molecules i
for all other molecules j
forcei f(loci, locj)
for all molecules i
loci g (loci, forcei)

22
Molecular Dynamics code structure

To reduce the amount of computation, account for
interaction only with nearby molecules (molecules
beyond the cut-off distance have no effects)

23
Molecular Dynamics code structure

For some number of timesteps
for all molecules i
for all nearby molecules j
forcei f(loci, locj)
for all molecules i
loci g (loci, forcei)

24
Optimized Molecular Dynamics code

For each molecule I
keep the number of nearby molecules at
countI
array of indices of nearby molecules
indexij (0 ltj lt countI)
For some number of timesteps
for (i0 iltnum_mol i)
for (j0 jltcounti j)
forcei f(loci,
locindexij)
for (i0 iltnum_mol i)
loci g (loci, forcei)

25
Molecular dynamics

No loop-carried dependence in the first I- loop.
Loop carried dependence (reduction) in j-loop.
No loop-carried dependence in the second loop.
True dependence between first and second I-loop.

26
Molecular dynamics

First I-loop can be parallelized.
Second I-loop can be parallelized.
Must make processors wait between loops
Barrier or fork-join.

27
Irregular vs. regular data parallel

In SOR and MM, all arrays are accessed through
linear expressions of the loop indices, known at
compile time (regular).
In MD, some arrays are accessed through
non-linear expressions of the loop indices, some
known only at runtime (irregular).

28
Irregular vs. regular data parallel

No real differences in terms of parallelization
(based on dependences).
Will lead to fundamental differences in
expressions of parallelism
Irregularity make it difficult for parallelism
based on data distribution.
Not difficult for parallelism based on iteration
distribution.

29
Irregular vs. regular data parallel

Parallelization of the first loop
Has a load balancing issue
Some molecules have few/many neighbors.
Needs more sophisticated loop partitioning
schemes.

30
Task parallelism

Each process performs a different task
Two principal flavors
pipelines
Task queues
Program examples PIPE (pipeline), TSP (task
queue)

31
Pipeline

Often occurs with image processing applications
(photoshop), where a number of images undergoes a
sequence of transformation.

For (I0 Iltnum_pic, read(in_picI, I)
int_pic_1I trans1(in_picI)
int_pic_2I trans2(in_pic_1I)
int_pic_3I trans3(int_pic_2I)
out_picI trans4int_pic_3I)
32
Sequential vs. pipelined execution
Sequential
pipeline
33
Realizing the pipeline