Principles of Parallel Algorithm Design

About This Presentation

Title:

Principles of Parallel Algorithm Design

Description:

relatively a few combinations lead to a good parallel algorithm ... rectilinear, jagged, orthogonal bisection. Dynamic Mapping Schemes. centralized schemes ... – PowerPoint PPT presentation

Number of Views:433

Avg rating:3.0/5.0

Slides: 76

Provided by: cevdeta

Category:

more less

Transcript and Presenter's Notes

Title: Principles of Parallel Algorithm Design

1
Principles of Parallel Algorithm Design

Prof. Dr. Cevdet Aykanat
Bilkent Üniversitesi
Bilgisayar Mühendisligi Bölümü

2

Principles of Parallel Algorithm
Design

Identifying concurrent tasks
Mapping tasks onto multiple processes
Distributing input, output, intermediate data
Managing access to shared data
Synchronizing processors

3

Principles of Parallel Algorithm Design

Identifying concurrent tasks
Mapping tasks onto multiple processes
Distributing input, output, intermediate data
Managing access to shared data
Synchronizing processors

several choices for each step
relatively a few combinations lead to a good
parallel algorithm
different choices yield best performance on
different parallel architectures
different parallel programming paradigms

4
Decomposition, Tasks

decomosition
dividing a computation into smaller parts
some or all parts can be executed concurrently
atomic task
user defined
indivisible units of computation
same size or different size

5
(No Transcript)
6
Task Dependence Graphs (TDG)

directed acyclic graph
nodes atomic tasks
directed edges dependencies
some tasks use data produced by other tasks
TDG can be weighted
node wgt amount of computation
edge wgt amount of data
multiple ways of expressing certain computations
different ways of arranging computations
lead to different TDGs

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Granularity, Concurrency

granularity number () and size of tasks
fine grain large of small tasks
coarse grain small of large tas
degree of concurrency (DoC)
of tasks that can be executed simultaneously
max DoC max DoC at any given time
tree TDGs max DoC of leaves (usually)
avg DoC DoC over entire duration

11
(No Transcript)
12
Degree of Concurrency

depends on granularity
finer task granularity larger DoC
bound on fine granularity of a decomposition
depends on shape of TDG
shallow and wide TDG larger DoC
deep and thin TDG smaller DoC
critical path
longest directed path between a start node and a
finish node
critical path length sum of wgts along the path
avg DoC total work / critical path length

13
(No Transcript)
14
Task Interaction Graph (TIG)

tasks share input, output or intermediate data
interactions among independent tasks of aTDG
TIG pattern of interactions among tasks
node task
edge connects tasks that interact with each
other
TIG can be weighted
node wgt amount of computation
edge wgt amount of interaction

15
(No Transcript)
16
Processes and Mapping

process vs processor
logical computing agents that perform tasks
mapping assigning tasks to processes
conflicting goals in a good mapping
maximize concurrency
map independent tasks to different processes
minimize idle time / interaction overhead
map tasks along critical path to same process
map tasks with high interaction to same processes
e.g., map all tasks to the same process

17
(No Transcript)
18
Decomposition Techiques

recursive decomposition
data decomposition
explaloratory decomposition
speculative decomposition

19
Recursive Decomposition

divide-and-conquer strategy ? natural concurrency
divide problem into a set of independent
subproblems
conquer recursively solve each subproblem
combine solns to subproblems to a soln of
problem
if sequential algorithm is not based on DAC
restructure computation as a DAC algorithm
recursive decomposition to extract concurrency
e.g., finding minimum of an array A of n
numbers

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Data Decomposition

partition/decompose computational data domain
use this partition to induce task decomposition
tasks similar operations on different data parts
partitioning output data
each output can be computed independently as a fn
of input
example block matrix multiplication
data decomposition may not lead to unique task
decompsition
another example computing itemset frequencies
input transactions output itemset
frequencies

24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Data Decomposition

partitioning input data
may not be possible desirable to partition output
data
e.g., finding min, sum of a set of numbers,
sorting
a task created for each part of the input data
task all computations that can be done using
local data
a combine step may be needed to combine results
of tasks
example finding the sum of an array A of n
numbers
example computing itemset frequencies
partitioning both output and input data
output data partitioning is feasible
partitioning of input data offers additional
concurrency
example computing itemset frequencies

29
Data Decomposition

partitioning intermediate data
multistage computations
partioning input or output data of an
intermediate stage
may lead to higher concurrency
some restructuring of the algorithm may be needed
example block matrix multiplication
owner computes rule
each part performs all computations involving
data it owns
input perform all computations that can be done
using local data
ouput compute all data in the partition

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
Other Decomposition Techniques

exploratory decomposition
search of a configuration space for a solution
partition the search space into smaller parts
search each part concurrently
total parallel work lt, , gt total serial work
example 15-puzzle problem
speculative decomposition
hybrid decompositions
computation structured into multiple stages
may apply different decompositions in different
stages
examples finding min of an array and quicksort
data decomposition then recursive decomposition

34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Characteristics of Tasks

task generation static vs dynamic task
generation
static all tasks are known priori to execution
of algorithm
data decomposition matrix multiplication
recursive decomposition finding min of an array
dynamic actual tasks and TPG/TIG not available a
priori
rules, guideliness governing task generation may
be known
recursive decomposition quicksort
another example ray tracing
task sizes uniform vs non-uniform
complexity of mapping depends on this
tasks in matrix multiplication uniform
tasks in quicksort non-uniform

40
Characteristics of Tasks

knowledge of task sizes
can be used in mapping
known tasks in decompositions for matrix
multiplication
unknown tasks in 15-puzzle problem
do not know a priori how many moves will lead to
a soln.
size of data associated with tasks
associated data must be available to the process
size and location of the associated data
consider data migration overhead in the mapping

41
Characteristics of Inter-Task Interactions

static vs dynamic
static pattern and timing of interactions known
a priori
static interaction decompositions for matrix
multiplication
message-passing paradigm (MPP)
active involvement of both interacting tasks
static interactions easy to program
dynamic interactions harder to program
tasks assigned additional synchronization and
polling responsibilities
shared-address-space (SASP) can handle both
equally easily
regular vs irregular (spatial structure)
regular structure that can be exploited for
efficient implement.
structured/curvilinear grids (implicit
connectivity)
image dithering (example)
irregular no such regular pattern exists
unstructured grids (connectivity maintained
explicitly)
SpMxV (sparse matrix vector multiplication)
irregular and dynamic interactions harder to
handle in MPP

42
(No Transcript)
43
Characteristics of Inter-Task Interactions

read-only vs read-write
read-only tasks require read-only access to
shared data
example decompositions for matrix multiplication
read-write tasks need to read and write on
shared data
example heuristic search for 15-puzzle problem
one-way vs two-way
2-way data/work needed by a task explicitliy
supplied by another
usually involve predefined producer and consumer
1-way only one of a pair of comm. tasks
initiates completes interaction
read-only ? 1-way read-write ? either
1-way or 2-way
SASP can handle both interactions equally easily
MPP cannot handle 1-way interaction directly
source of data should explicitly send it to the
recipient
static 1-way easily converted to 2-way via
program restructuring
dynamic 1-way nontrivial program structuring for
converting to 2-way
polling task checks for pending requests from
others at regular intervals

44
Mapping Techniques

minimize overheads of parallel task execution
overhead inter-process interaction
overhead process idle time (uneven load
distribution)
load balancing
balanced aggregate load necessary but not
sufficient
computations interactions well balanced at each
stage
example 12-task decomposition (9-12 depends on
1-8)

45
(No Transcript)
46
Static vs Dynamic Mapping

static distribute tasks prior to execution
static task generation either static or dynamic
mapping
good mapping knowledge of task sizes, data
sizes, TIG
non-trivial problem (usually NP-hard)
task sizes known but non-uniform
even if no TDG/TIG ? number partitioning problem
dynamic distribute workload during execution
dynamic task generation dynamic mapping
task sizes unkown dynamic mapping more effective
large data size dynamic mapping costly (in MPP)

47
Static-Mapping Schemes

mapping based on data partitioning
data partitioning induces a decomposition
partitioning selected with final mapping in mind
i.e., p-way data decomposition
dense arrays
sparse data structures, graphs (FE meshes)
mapping based on task partitioning
task dependence graphs, task interaction graphs
hierarchical partitioning
hybrid decomposition and mapping techniques

48
Array Distribution Schemes

block distributions spatial locality of
interaction
each process receives a contigous block of
entries
1D each part contains a block of consecutive
rows
i.e., kth part contains rows kn/p ... (k1)n/p-1
2D checkerboard partitioning
higher dimensional distributions
higher degree of concurrency
less inter-process interaction
example matrix multiplication

49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
Array Distribution Schemes

cyclic distribution
amount of work differs for different matrix
entries
examples ray casting, dense LU factorization
block distribution leads to load imbalance
all processes have tasks from all parts of the
matrix
good load balance, but complete loss of locality
block-cyclic distribution
partition array into more than p blocks
map blocks to processes in a round-robin
(scattered) manner
randomized block distribution
when the distribution of work has some special
pattern
adaptive 2D array partitionings
rectilinear, jagged, orthogonal bisection

57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
Dynamic Mapping Schemes

centralized schemes
all tasks maintained in a common pool or by a
process
idle processes take task(s) from central pool or
master process
easier to implement
limited scalability central pool/process becomes
a bottleneck
chunk scheduling idle processes get group of
tasks
danger of load imbalance due to large chunk sizes
decrease chunk size as program progresses
e.g., sorting entries in each row of a matrix
non-uniform tasks unknown task sizes
e.g., image-space parallel ray casting

69
Dynamic Mapping Schemes

distributed schemes
tasks are distributed among processes
more scalable (no bottleneck)
critical parameters of distributed load balancing
how sending and receiving processes ard paired?
who initiates the work transfer sender or
receiver?
how much work transferred in each exchange?
when is he work transfer performed?
suitability to parallel architectures
both can be implemented in both SAS and MP
paradigms
dynamic schemes require movement of tasks
computational granularity of tasks should be high
in MP systems

70
Methods for Interaction Overheads

factors
volume and frequency of interaction
spatial and temporal pattern of interactions
maximizing data locality
minimize volume of data exchange
minimize overall volume of shared data
similar to maximizing temporal data locality
minimize frequency of interaction
high startup cost associated with each
interaction
restructure algorithm shared data accessed in
large pieces
similar to increasing spatial locality of data
access
minimizing contention and hot spots
multiple tasks try to access same resource
concurrently
multiple simultaneous access to same memory
block/bank
multiple processes sending messages to same
process at the same time

71
Methods for Interaction Overheads

minimizing contention and hot spots
multiple tasks try to access same resource
concurrently
multiple simultaneous access to same memory
block/bank
multiple processes sending messages to same
process simult.
e.g., matrix multiplication based on 2D
partitioning
overlapping computations with interactions
early initiation of an interaction
support from programming paradigm, OS, hardware
MP non-blocking message-passing primitives

72
Methods for Interaction Overheads

replicating data or computation
replicating frequently accessed read-only shared
data
MP paradigm benefits more from data replication
replicated computation for shared intermediate
results
using optimized collective interaction operations
usually use available implementations (e.g., by
MPI)
sometimes, it may be better to write your own
procedure
overlapping interactions with other interactions
example one-to-all broadcast

73
(No Transcript)
74
Parallel Algorithm Models

data-parallel model
data parallelism identicial operations applied
concurrently on different data items
task graph model
task parallelism independent tasks in a TDG
quicksort, sparse matrix factorization
work-pool or task-pool model
dynamic mapping of tasks onto processes
mapping may be centralized or distributed

75
Parallel Algorithm Models