Title: Principles of Parallel Algorithm Design
1Principles of Parallel Algorithm Design
- Prof. Dr. Cevdet Aykanat
- Bilkent Ãœniversitesi
- Bilgisayar Mühendisligi Bölümü
2 Principles of Parallel Algorithm
Design
- Identifying concurrent tasks
- Mapping tasks onto multiple processes
- Distributing input, output, intermediate data
- Managing access to shared data
- Synchronizing processors
3 Principles of Parallel Algorithm Design
- Identifying concurrent tasks
- Mapping tasks onto multiple processes
- Distributing input, output, intermediate data
- Managing access to shared data
- Synchronizing processors
- several choices for each step
- relatively a few combinations lead to a good
parallel algorithm - different choices yield best performance on
- different parallel architectures
- different parallel programming paradigms
4Decomposition, Tasks
- decomosition
- dividing a computation into smaller parts
- some or all parts can be executed concurrently
- atomic task
- user defined
- indivisible units of computation
- same size or different size
5(No Transcript)
6Task Dependence Graphs (TDG)
- directed acyclic graph
- nodes atomic tasks
- directed edges dependencies
- some tasks use data produced by other tasks
- TDG can be weighted
- node wgt amount of computation
- edge wgt amount of data
- multiple ways of expressing certain computations
- different ways of arranging computations
- lead to different TDGs
7(No Transcript)
8(No Transcript)
9(No Transcript)
10Granularity, Concurrency
- granularity number () and size of tasks
- fine grain large of small tasks
- coarse grain small of large tas
- degree of concurrency (DoC)
- of tasks that can be executed simultaneously
- max DoC max DoC at any given time
- tree TDGs max DoC of leaves (usually)
- avg DoC DoC over entire duration
11(No Transcript)
12Degree of Concurrency
- depends on granularity
- finer task granularity larger DoC
- bound on fine granularity of a decomposition
- depends on shape of TDG
- shallow and wide TDG larger DoC
- deep and thin TDG smaller DoC
- critical path
- longest directed path between a start node and a
finish node - critical path length sum of wgts along the path
- avg DoC total work / critical path length
13(No Transcript)
14Task Interaction Graph (TIG)
- tasks share input, output or intermediate data
- interactions among independent tasks of aTDG
- TIG pattern of interactions among tasks
- node task
- edge connects tasks that interact with each
other - TIG can be weighted
- node wgt amount of computation
- edge wgt amount of interaction
15(No Transcript)
16Processes and Mapping
- process vs processor
- logical computing agents that perform tasks
- mapping assigning tasks to processes
- conflicting goals in a good mapping
- maximize concurrency
- map independent tasks to different processes
- minimize idle time / interaction overhead
- map tasks along critical path to same process
- map tasks with high interaction to same processes
- e.g., map all tasks to the same process
17(No Transcript)
18Decomposition Techiques
- recursive decomposition
- data decomposition
- explaloratory decomposition
- speculative decomposition
19Recursive Decomposition
- divide-and-conquer strategy ? natural concurrency
- divide problem into a set of independent
subproblems - conquer recursively solve each subproblem
- combine solns to subproblems to a soln of
problem - if sequential algorithm is not based on DAC
- restructure computation as a DAC algorithm
- recursive decomposition to extract concurrency
- e.g., finding minimum of an array A of n
numbers
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Data Decomposition
- partition/decompose computational data domain
- use this partition to induce task decomposition
- tasks similar operations on different data parts
- partitioning output data
- each output can be computed independently as a fn
of input - example block matrix multiplication
- data decomposition may not lead to unique task
decompsition - another example computing itemset frequencies
- input transactions output itemset
frequencies
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28Data Decomposition
- partitioning input data
- may not be possible desirable to partition output
data - e.g., finding min, sum of a set of numbers,
sorting - a task created for each part of the input data
- task all computations that can be done using
local data - a combine step may be needed to combine results
of tasks - example finding the sum of an array A of n
numbers - example computing itemset frequencies
- partitioning both output and input data
- output data partitioning is feasible
- partitioning of input data offers additional
concurrency - example computing itemset frequencies
29Data Decomposition
- partitioning intermediate data
- multistage computations
- partioning input or output data of an
intermediate stage - may lead to higher concurrency
- some restructuring of the algorithm may be needed
- example block matrix multiplication
- owner computes rule
- each part performs all computations involving
data it owns - input perform all computations that can be done
using local data - ouput compute all data in the partition
30(No Transcript)
31(No Transcript)
32(No Transcript)
33Other Decomposition Techniques
- exploratory decomposition
- search of a configuration space for a solution
- partition the search space into smaller parts
- search each part concurrently
- total parallel work lt, , gt total serial work
- example 15-puzzle problem
- speculative decomposition
- hybrid decompositions
- computation structured into multiple stages
- may apply different decompositions in different
stages - examples finding min of an array and quicksort
- data decomposition then recursive decomposition
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39Characteristics of Tasks
- task generation static vs dynamic task
generation - static all tasks are known priori to execution
of algorithm - data decomposition matrix multiplication
- recursive decomposition finding min of an array
- dynamic actual tasks and TPG/TIG not available a
priori - rules, guideliness governing task generation may
be known - recursive decomposition quicksort
- another example ray tracing
- task sizes uniform vs non-uniform
- complexity of mapping depends on this
- tasks in matrix multiplication uniform
- tasks in quicksort non-uniform
40Characteristics of Tasks
- knowledge of task sizes
- can be used in mapping
- known tasks in decompositions for matrix
multiplication - unknown tasks in 15-puzzle problem
- do not know a priori how many moves will lead to
a soln. - size of data associated with tasks
- associated data must be available to the process
- size and location of the associated data
- consider data migration overhead in the mapping
41Characteristics of Inter-Task Interactions
- static vs dynamic
- static pattern and timing of interactions known
a priori - static interaction decompositions for matrix
multiplication - message-passing paradigm (MPP)
- active involvement of both interacting tasks
- static interactions easy to program
- dynamic interactions harder to program
- tasks assigned additional synchronization and
polling responsibilities - shared-address-space (SASP) can handle both
equally easily - regular vs irregular (spatial structure)
- regular structure that can be exploited for
efficient implement. - structured/curvilinear grids (implicit
connectivity) - image dithering (example)
- irregular no such regular pattern exists
- unstructured grids (connectivity maintained
explicitly) - SpMxV (sparse matrix vector multiplication)
- irregular and dynamic interactions harder to
handle in MPP
42(No Transcript)
43Characteristics of Inter-Task Interactions
- read-only vs read-write
- read-only tasks require read-only access to
shared data - example decompositions for matrix multiplication
- read-write tasks need to read and write on
shared data - example heuristic search for 15-puzzle problem
- one-way vs two-way
- 2-way data/work needed by a task explicitliy
supplied by another - usually involve predefined producer and consumer
- 1-way only one of a pair of comm. tasks
initiates completes interaction - read-only ? 1-way read-write ? either
1-way or 2-way - SASP can handle both interactions equally easily
- MPP cannot handle 1-way interaction directly
- source of data should explicitly send it to the
recipient - static 1-way easily converted to 2-way via
program restructuring - dynamic 1-way nontrivial program structuring for
converting to 2-way - polling task checks for pending requests from
others at regular intervals
44Mapping Techniques
- minimize overheads of parallel task execution
- overhead inter-process interaction
- overhead process idle time (uneven load
distribution) - load balancing
- balanced aggregate load necessary but not
sufficient - computations interactions well balanced at each
stage - example 12-task decomposition (9-12 depends on
1-8)
45(No Transcript)
46Static vs Dynamic Mapping
- static distribute tasks prior to execution
- static task generation either static or dynamic
mapping - good mapping knowledge of task sizes, data
sizes, TIG - non-trivial problem (usually NP-hard)
- task sizes known but non-uniform
- even if no TDG/TIG ? number partitioning problem
- dynamic distribute workload during execution
- dynamic task generation dynamic mapping
- task sizes unkown dynamic mapping more effective
- large data size dynamic mapping costly (in MPP)
47Static-Mapping Schemes
- mapping based on data partitioning
- data partitioning induces a decomposition
- partitioning selected with final mapping in mind
- i.e., p-way data decomposition
- dense arrays
- sparse data structures, graphs (FE meshes)
- mapping based on task partitioning
- task dependence graphs, task interaction graphs
- hierarchical partitioning
- hybrid decomposition and mapping techniques
48Array Distribution Schemes
- block distributions spatial locality of
interaction - each process receives a contigous block of
entries - 1D each part contains a block of consecutive
rows - i.e., kth part contains rows kn/p ... (k1)n/p-1
- 2D checkerboard partitioning
- higher dimensional distributions
- higher degree of concurrency
- less inter-process interaction
- example matrix multiplication
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56Array Distribution Schemes
- cyclic distribution
- amount of work differs for different matrix
entries - examples ray casting, dense LU factorization
- block distribution leads to load imbalance
- all processes have tasks from all parts of the
matrix - good load balance, but complete loss of locality
- block-cyclic distribution
- partition array into more than p blocks
- map blocks to processes in a round-robin
(scattered) manner - randomized block distribution
- when the distribution of work has some special
pattern - adaptive 2D array partitionings
- rectilinear, jagged, orthogonal bisection
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68Dynamic Mapping Schemes
- centralized schemes
- all tasks maintained in a common pool or by a
process - idle processes take task(s) from central pool or
master process - easier to implement
- limited scalability central pool/process becomes
a bottleneck - chunk scheduling idle processes get group of
tasks - danger of load imbalance due to large chunk sizes
- decrease chunk size as program progresses
- e.g., sorting entries in each row of a matrix
- non-uniform tasks unknown task sizes
- e.g., image-space parallel ray casting
69Dynamic Mapping Schemes
- distributed schemes
- tasks are distributed among processes
- more scalable (no bottleneck)
- critical parameters of distributed load balancing
- how sending and receiving processes ard paired?
- who initiates the work transfer sender or
receiver? - how much work transferred in each exchange?
- when is he work transfer performed?
- suitability to parallel architectures
- both can be implemented in both SAS and MP
paradigms - dynamic schemes require movement of tasks
- computational granularity of tasks should be high
in MP systems
70Methods for Interaction Overheads
- factors
- volume and frequency of interaction
- spatial and temporal pattern of interactions
- maximizing data locality
- minimize volume of data exchange
- minimize overall volume of shared data
- similar to maximizing temporal data locality
- minimize frequency of interaction
- high startup cost associated with each
interaction - restructure algorithm shared data accessed in
large pieces - similar to increasing spatial locality of data
access - minimizing contention and hot spots
- multiple tasks try to access same resource
concurrently - multiple simultaneous access to same memory
block/bank - multiple processes sending messages to same
process at the same time
71Methods for Interaction Overheads
- minimizing contention and hot spots
- multiple tasks try to access same resource
concurrently - multiple simultaneous access to same memory
block/bank - multiple processes sending messages to same
process simult. - e.g., matrix multiplication based on 2D
partitioning - overlapping computations with interactions
- early initiation of an interaction
- support from programming paradigm, OS, hardware
- MP non-blocking message-passing primitives
72Methods for Interaction Overheads
- replicating data or computation
- replicating frequently accessed read-only shared
data - MP paradigm benefits more from data replication
- replicated computation for shared intermediate
results - using optimized collective interaction operations
- usually use available implementations (e.g., by
MPI) - sometimes, it may be better to write your own
procedure - overlapping interactions with other interactions
- example one-to-all broadcast
73(No Transcript)
74Parallel Algorithm Models
- data-parallel model
- data parallelism identicial operations applied
concurrently on different data items - task graph model
- task parallelism independent tasks in a TDG
- quicksort, sparse matrix factorization
- work-pool or task-pool model
- dynamic mapping of tasks onto processes
- mapping may be centralized or distributed
75Parallel Algorithm Models
- master-slave or manager-worker model
- master process generates work allocates to
worker processes - pipeline or producer-consumer model
- stream parallelism execution of diff. programs
on a data stream - each process in the pipeline
- consumer of the sequence of data items for the
preceeding process - producer of data for the process following in the
pipeline - pipeline may not be a linear chain (it can be a
DAG) - hybrid models
- multiple models applied hierarchically
- multiple models applied sequentially to different
stages