Title: Partitioning, Divide
1Partitioning, Divide Conquer
- Useful strategies in designing efficient
sequential algorithms - merge sort, quick sort,
- Even more useful in parallel setting
- partition the workload and perform it
concurrently - divide the problem and the set of processes and
have each subset solve the corresponding
subproblem
2General Skeleton
- Partition the problem
- might be as simple as dividing an array into
non-overlapping blocks - or significant processing might be needed and
new data structures created - 2 way, k way, p - way, static, dynamic (data
dependent) - Solve the subproblems
- from simple processing to complex recursive
schemes - Combine results
- from simple concatenation to complex merging
3Data Partitioning
- Partitioning according to
- input data
- output data
- intermediate data (bucket sort)
- Associate tasks with the data
- do as much as you can with the data before
further communication - Partition in a way that minimizes the cost of
communication - maximize data locality
- minimize volume of data exchange
- minimize frequency of interactions
4Minimizing Communication Overhead
- Maximizing data locality
- Minimizing volume of data exchange
- Minimizing frequency of interaction
- Minimizing contention and hot spots
- Overlapping computation and communication
- Replicate data or computation
- Use optimized collective communication routines
- Overlap communication
5Maximizing data locality I
Minimizing volume of data exchange Matrix
multiplication example
Communication n2n/p per processor
x
Communication 2n2/sqrt(p) per processor
x
6Maximizing data locality II
Minimizing volume of data exchange surface to
volume ratio 2D Simulation example
Communication volume 2n per processor
Communication volume 4sqrt(n/p) per processor
7Maximizing data locality III
- Minimizing frequency of interaction
- communication start-up time is much greater then
per-byte time - Sparse matrix multiplication example
- examine your vectors, figure out which entries
are non-zero - request all data you need in one block (or as
much as fits into your memory) - process locally with received data
8Overlapping Computation
Example 4 consecutive broadcasts
p0
8 steps
p1
p4
4 x
p2
p3
p5
p6
6 steps with pipelining
9Outline of the remainder
- Simple static partitioning schemes
- 1D array
- 2D array
- More involved examples
- Bucket sort
- Divide conquer
- binary tree computations
- Examples with dynamic partitioning/divide
conquer - numerical integration
- N-body problem
- Barnes Hut algorithm
101D Array Partitioning
block partitioning
p1
p2
p3
p4
p0
striped partitioning
p1
p2
p3
p4
p0
block-striped partitioning
p1
p2
p3
p4
p0
112D Array Partitioning
p0
p2
p3
p4
p0
p1
p5
p2
p4
p0
p1
p2
p3
p1
p3
p5
p4
p5
p1
p2
p0
p1
p2
p0
p0
p1
p2
p3
p4
p5
12Exercise
- Problem Given an array of n integers, use p
processors to find the position of the first 0 in
the array - assume communication is costly (p ltlt comm lt n)
- what is the complexity? Speedup?
- describe that in terms of k the position of
the first 0 - how would you modify the algorithm to be better
in terms of k? - what if the communication is cheap? (comm p)