Title: Online adaptative parallel prefix computation
1On-line adaptative parallel prefix computation
- Jean-Louis Roch, Daouda Traore, Julien Bernard
- INRIA-CNRS Moais team - LIG Grenoble, France
Contents I. Motivation II. Work-stealing
scheduling of parallel algorithms
III. Processor-oblivious parallel prefix
computation
EUROPAR2006 - Dresden, Germany - 2006,
August 29th,
2 Parallel prefix on fixed architecture
- Prefix problem
- input a0, a1, , an
- output ?1, , ?n with
3The problem
To design a single algorithm that computes
efficiently prefix( a ) on an arbitrary dynamic
architecture
parallel Pmax
parallel P2
Sequential algorithm
parallel P100
. . .
?
Which algorithm to choose ?
Multi-user SMP server
Grid
Heterogeneous network
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode
4Lower bound for prefix on processors with
changing speeds
- Model of heterogeneous processors with
changing speed Benderal 02 gt ?i(t)
instantaneous speed of processor i at time t
(in operations per second
) Assumption ?max(t) lt constant .
?min(t) Def ?ave average speed per
processor for a computation with duration T
- Theorem 2 Lower bound for the time of
prefix computation on p processors with changing
speeds Sketch of the proof - extension
of the lower bound on p
identical processors Faith82 - based on the
analysis on the number of performed operations.
5Changing speeds and work-stealing
- Workstealing schedule on-line adapts to
processors availability and speeds
Bender-02 - Principle of work-stealing greedy
schedule but distributed and randomized - Each processor manages locally the tasks it
creates - When idle, a processor steals the oldest ready
task on a remote -non idle- victim processor
(randomly chosen)
Bender-Rabin02
6Work-stealing and adaptation
 Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)
- Interest if W1 fixed and W? small, near-optimal
adaptative schedulewith good probability on p
processors with average speeds ?ave - Moreover steals task migrations lt p.W?
Blumofe 98 Narlikar 01 Bender 02 - But lower bounds for prefix
- Minimal work W1 n ? W? n
? - Minimal depth W? lt 2 log n ? W1 gt 2n ?
- With work-stealing, how to reach the lower bound
?
7How to get both work W1 and depth W? small?
- General approach by coupling two algorithms
- a sequential algorithm with optimal number of
operations Ws - and a fine grain parallel algorithm with minimal
critical time W? butparallel work gtgt Ws - Folk technique parallel, than sequential
- Parallel algorithm until a certain  grainÂ
then use the sequential one - Drawback with changing speeds
- Either too much idle processors or too much
operations - Work-preserving speed-up technique Bini-Pan94
sequential, then parallel Cascading Jaja92
Careful interplay of both algorithms to build
one with both W? small and W1 O(
Wseq ) - Use the work-optimal sequential algorithm to
reduce the size - Then use the time-optimal parallel algorithm to
decrease the time Drawback sequential at coarse
grain and parallel at fine grain ?
8Alternative concurrently sequential and parallel
- Based on the work-stealing and the Work-first
principle - Execute always a sequential algorithm to
reduce parallelism overhead - use parallel algorithm only if a processor
becomes idle (ie workstealing) by extracting
parallelism from a sequential computation (ie
adaptive granularity) - Hypothesis two algorithms
- - 1 sequential SeqCompute- 1 parallel
LastPartComputation at any time, it is
possible to extract parallelism from the
remaining computations of the sequential
algorithm - Self-adaptive granularity based on
work-stealing
9Alternative concurrently sequential and parallel
SeqCompute
preempt
10Alternative concurrently sequential and parallel
merge/jump
SeqCompute
Seq
complete
11Adaptive Prefix on 3 processors
Sequential
?1
Parallel
12Adaptive Prefix on 3 processors
Sequential
?3
Parallel
?7
13Adaptive Prefix on 3 processors
Sequential
Parallel
?8
14Adaptive Prefix on 3 processors
Sequential
?8
Parallel
?8
?5
?6
?9
?11
15Adaptive Prefix on 3 processors
Sequential
?12
?11
?8
Parallel
?8
?5
?6
?7
?9
?11
?10
16Adaptive Prefix on 3 processors
Sequential
Implicit critical path on the sequential process
Parallel
17Analysis of the algorithm
- Theorem 3 Execution time
-
- Sketch of the proof Analysis of the operations
performed by - The sequential main performs S operations on one
processor - The (p-1) work-stealers perform X 2(n-S)
operations with depth log X - Each non constant time task can potentially be
splitted (variable speeds) -
- The coupling ensures both algorithms complete
simultaneously Ts Tp - O(log X)gt enables to
bound the whole number X of operations
performedand the overhead of parallelism (SX)
- ops_optimal
18Adaptive prefix experiments1
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Single user context
Time (s)
processors
Single-user context processor-adaptive prefix
achieves near-optimal performance - close to
the lower bound both on 1 proc and on p
processors - Less sensitive to system overhead
even better than the theoretically optimal
off-line parallel algorithm on p processors
19Adaptive prefix experiments 2
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Multi-user context
External charge (9-p external processes)
Time (s)
processors
Multi-user context Additional external
charge (9-p) additional external dummy processes
are concurrently executed Processor-adaptive
prefix computation is always the fastest
15 benefit over a parallel algorithm for p
processors with off-line schedule,
20Conclusion
- The interplay of an on-line parallel algorithm
directed by work-stealing schedule is useful for
the design of processor-oblivious algorithms - Application to prefix computation
- - theoretically reaches the lower bound on
heterogeneous processors with changing speeds - - practically, achieves near-optimal
performances on multi-user SMPs - Generic adaptive scheme to implement parallel
algorithms with provable performance - - work in progress parallel 3D
reconstruction oct-tree scheme with
deadline constraint
21 Thank you !
Interactive Distributed Simulation B Raffin E
Boyer - 5 cameras, - 6 PCs 3D-reconstruction
simulation rendering -gtAdaptive scheme to
maximize 3D-reconstruction precision within fixed
timestamp L Suares, B Raffin, JL Roch
22The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
23Adaptive prefix some experiments
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm
- Single user context
- Adaptive is equivalent to
- - sequential on 1 proc
- - optimal parallel-2 proc. on 2 processors
- -
- - optimal parallel-8 proc. on 8 processors
24With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)