Patterns of parallelism - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Patterns of parallelism

Description:

Orchestration: communication of data, synchronization among processes. ... Final photo = red_eye_removal(brighter (increased_contrast(original photo) ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 37
Provided by: xiny2
Category:

less

Transcript and Presenter's Notes

Title: Patterns of parallelism


1
Patterns of parallelism
  • Materials customized from Prof. Sandhya
    Dwarkadass lecture notes
  • http//www.cs.rochester.edu/users/faculty/sandhya/
    csc258/

2
Steps in the parallelization
  • Decompose a job into tasks
  • Assign the tasks to processes
  • Orchestration communication of data,
    synchronization among processes.
  • Decomposing job is the most important part of the
    process.

3
Decomposing jobs
  • Two types of jobs
  • Same operation on a large set of data
  • Vector addition A BC
  • for (i0 iltsize i) Ai Bi Ci
  • Decompose based on data domain
  • Apply a sequence of functions on the data
  • Photoshop
  • Final photo red_eye_removal(brighter
    (increased_contrast(original photo)))
  • Decompose based on function.

4
How to decompose a job?
  • Domain deposition and functional deposition
  • Domain deposition data associated with a problem
    is decomposed. Each parallel task then works on a
    portion of the data.
  • Functional deposition focus on the computation
    that is to be performed. The problem is
    decomposed according to the work that must be
    done. Each task then performs a portion of the
    overall work.

5
How to decompose a job?
  • Data parallelism and task parallelism
  • Data parallelism (domain deposition) all
    processes do the same thing on different data
  • Regular and irregular problems (using linear or
    non-linear indexing).
  • Task parallelism (functional deposition)
    different processes do different tasks.
  • Task queues
  • pipelining

6
Domain deposition
7
Domain deposition methods
8
Functional deposition
9
Functional deposition example signal processing
10
Data parallelism examples
  • Example 1 Matrix multiplication
  • Multiply two n by n matrix into the third n by n
    matrix.

11
Matrix multiply
for (i0 iltn i) for (j0 jltn j)
cij 0.0 for (i0 iltn i) for (j0
jltn j) for (k0 kltn k)
cij cij aik bkj
Cij ai0b0j ai1b1j
aiN-1bN-1j
12
Parallel matrix multiply
  • No loop-carried dependence in i- and j- loops.
  • Loop-carried dependence in k-loop.
  • All i- and j- iterations can be run in parallel
  • If we have P processors, we can give n/P rows or
    columns to each processor (block or cyclic).
  • Or we can divide the matrix into P squares and
    give each processor a square.

13
SOR
  • SOR implements a mathematic model for many
    natural phenomena.
  • Example heat dissipation in a metal sheet.
  • The model solves a partial differential equation.
  • iterative method.
  • Discretized the problem into a mesh of grids and
    solve the values in the mesh.

14
(No Transcript)
15
Relaxation algorithm
  • For some number of iterations
  • For each internal grid point
  • Compute average of its four neighbors.
  • Termination condition
  • The values at grid points change very little (the
    difference between the values in the new array
    and the old array is below a threshold.

16
SOR code
  • / initialization /
  • for (i0 iltn1 i) gridi0 0.0
  • for (i0 iltN1 i) gridin1 0.0
  • for (j0 jltn1 j) grid0j 1.0
  • for (j0 jlt n1 j) gridn1j 0.0
  • for (i1 iltn i)
  • for (j 1 jltn j)
  • gridij 0.0

17
SOR code
  • / iteration /
  • error 1000.0
  • while (error gt threshold)
  • for (i1 iltn i)
  • for (j1 jltn j)
  • tempij 0.25(gridi-1jgridi
    1j gridij-1gridij1)
  • error 0.0
  • for (i1 iltn i)
  • for (j1 jltn j)
  • if (abs(tempij-gridij) gt
    error) error abs(tempij gridij)
  • gridij tempij

18
Parallel SOR
  • No dependences between iterations of the first
    (I, j) loop nest
  • No dependences between iterations of the second
    (I, j) loop besides the reduction of error
  • Anti-dependence between the first and the second
    loop nest in the same step
  • True dependence between second loop nest and the
    first loop nest of the next step

19
Parallel SOR
  • First loop can be parallelized.
  • Second loop can be parallelized (with reduction
    support).
  • We must make processor wait at the end of each
    loop next.
  • Use either barrier or fork-join.
  • If we have P processors, we can give n/P rows or
    columns to each processor.
  • Or we can divide the array in P squares, and give
    each processor a square to compute.

20
Molecular Dynamics (MD)
  • Simulation of a set of bodies under the influence
    of physical laws.
  • Atoms, molecules, celestial bodies.
  • Applications in many areas weather model,
    material science, ocean model, galaxy formation.
  • MD has different forms but the same basic
    structure.

21
Molecular Dynamics code structure
  • For some number of timesteps
  • for all molecules i
  • for all other molecules j
  • forcei f(loci, locj)
  • for all molecules i
  • loci g (loci, forcei)

22
Molecular Dynamics code structure
  • To reduce the amount of computation, account for
    interaction only with nearby molecules (molecules
    beyond the cut-off distance have no effects)

23
Molecular Dynamics code structure
  • For some number of timesteps
  • for all molecules i
  • for all nearby molecules j
  • forcei f(loci, locj)
  • for all molecules i
  • loci g (loci, forcei)

24
Optimized Molecular Dynamics code
  • For each molecule I
  • keep the number of nearby molecules at
    countI
  • array of indices of nearby molecules
    indexij (0 ltj lt countI)
  • For some number of timesteps
  • for (i0 iltnum_mol i)
  • for (j0 jltcounti j)
  • forcei f(loci,
    locindexij)
  • for (i0 iltnum_mol i)
  • loci g (loci, forcei)

25
Molecular dynamics
  • No loop-carried dependence in the first I- loop.
  • Loop carried dependence (reduction) in j-loop.
  • No loop-carried dependence in the second loop.
  • True dependence between first and second I-loop.

26
Molecular dynamics
  • First I-loop can be parallelized.
  • Second I-loop can be parallelized.
  • Must make processors wait between loops
  • Barrier or fork-join.

27
Irregular vs. regular data parallel
  • In SOR and MM, all arrays are accessed through
    linear expressions of the loop indices, known at
    compile time (regular).
  • In MD, some arrays are accessed through
    non-linear expressions of the loop indices, some
    known only at runtime (irregular).

28
Irregular vs. regular data parallel
  • No real differences in terms of parallelization
    (based on dependences).
  • Will lead to fundamental differences in
    expressions of parallelism
  • Irregularity make it difficult for parallelism
    based on data distribution.
  • Not difficult for parallelism based on iteration
    distribution.

29
Irregular vs. regular data parallel
  • Parallelization of the first loop
  • Has a load balancing issue
  • Some molecules have few/many neighbors.
  • Needs more sophisticated loop partitioning
    schemes.

30
Task parallelism
  • Each process performs a different task
  • Two principal flavors
  • pipelines
  • Task queues
  • Program examples PIPE (pipeline), TSP (task
    queue)

31
Pipeline
  • Often occurs with image processing applications
    (photoshop), where a number of images undergoes a
    sequence of transformation.

For (I0 Iltnum_pic, read(in_picI, I)
int_pic_1I trans1(in_picI)
int_pic_2I trans2(in_pic_1I)
int_pic_3I trans3(int_pic_2I)
out_picI trans4int_pic_3I)
32
Sequential vs. pipelined execution
Sequential
pipeline
33
Realizing the pipeline
  • Process 1
  • For (I0 Iltnum_pics, read(in_picI) I)
  • int_pic_1I trans1(in_picI)
  • signal(event_1_2I)

34
Realizing the pipeline
  • Process 2
  • For (I0 Iltnum_pics I)
  • wait(event_1_2I)
  • int_pic_2I trans2(in_pic_1I)
  • signal(event_2_3I)
  • Process 3 is similar

35
Realizing the pipeline
  • Process 4
  • For (I0 Iltnum_pics I)
  • wait(event_3_4I)
  • out_picI trans4(in_pic_3I)

36
Pipeline issues
  • One stage takes more time than others
  • Stages take a variable amount of time
  • Extra buffers can be some cushion against
    variability.
Write a Comment
User Comments (0)
About PowerShow.com