Online adaptative parallel prefix computation - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Online adaptative parallel prefix computation

Description:

stealer 2. Adaptive Prefix on 3 processors. Steal request. Parallel. Sequential ... The (p-1) work-stealers perform X = 2(n-S) operations with depth log X ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 25

Provided by: JeanLou85

Category:

more less

Transcript and Presenter's Notes

Title: Online adaptative parallel prefix computation

1
On-line adaptative parallel prefix computation

Jean-Louis Roch, Daouda Traore, Julien Bernard
INRIA-CNRS Moais team - LIG Grenoble, France

Contents I. Motivation II. Work-stealing
scheduling of parallel algorithms
III. Processor-oblivious parallel prefix
computation
EUROPAR2006 - Dresden, Germany - 2006,
August 29th,
2
Parallel prefix on fixed architecture

Prefix problem
input a0, a1, , an
output ?1, , ?n with

3
The problem
To design a single algorithm that computes
efficiently prefix( a ) on an arbitrary dynamic
architecture
parallel Pmax
parallel P2
Sequential algorithm
parallel P100

. . .
?
Which algorithm to choose ?
Multi-user SMP server
Grid
Heterogeneous network
Dynamic architecture non-fixed number of
resources, variable speeds eg grid, but not
only SMP server in multi-users mode
4
Lower bound for prefix on processors with
changing speeds
- Model of heterogeneous processors with
changing speed Benderal 02 gt ?i(t)
instantaneous speed of processor i at time t
(in operations per second
) Assumption ?max(t) lt constant .
?min(t) Def ?ave average speed per
processor for a computation with duration T
- Theorem 2 Lower bound for the time of
prefix computation on p processors with changing
speeds Sketch of the proof - extension
of the lower bound on p
identical processors Faith82 - based on the
analysis on the number of performed operations.
5
Changing speeds and work-stealing

Workstealing schedule on-line adapts to
processors availability and speeds
Bender-02
Principle of work-stealing greedy
schedule but distributed and randomized
Each processor manages locally the tasks it
creates
When idle, a processor steals the oldest ready
task on a remote -non idle- victim processor
(randomly chosen)

Bender-Rabin02
6
Work-stealing and adaptation
Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)

Interest if W1 fixed and W? small, near-optimal
adaptative schedulewith good probability on p
processors with average speeds ?ave
Moreover steals task migrations lt p.W?
Blumofe 98 Narlikar 01 Bender 02
But lower bounds for prefix
Minimal work W1 n ? W? n
?
Minimal depth W? lt 2 log n ? W1 gt 2n ?
With work-stealing, how to reach the lower bound
?

7
How to get both work W1 and depth W? small?

General approach by coupling two algorithms
a sequential algorithm with optimal number of
operations Ws
and a fine grain parallel algorithm with minimal
critical time W? butparallel work gtgt Ws
Folk technique parallel, than sequential
Parallel algorithm until a certain grain
then use the sequential one
Drawback with changing speeds
Either too much idle processors or too much
operations
Work-preserving speed-up technique Bini-Pan94
sequential, then parallel Cascading Jaja92
Careful interplay of both algorithms to build
one with both W? small and W1 O(
Wseq )
Use the work-optimal sequential algorithm to
reduce the size
Then use the time-optimal parallel algorithm to
decrease the time Drawback sequential at coarse
grain and parallel at fine grain ?

8
Alternative concurrently sequential and parallel

Based on the work-stealing and the Work-first
principle
Execute always a sequential algorithm to
reduce parallelism overhead
use parallel algorithm only if a processor
becomes idle (ie workstealing) by extracting
parallelism from a sequential computation (ie
adaptive granularity)
Hypothesis two algorithms
- 1 sequential SeqCompute- 1 parallel
LastPartComputation at any time, it is
possible to extract parallelism from the
remaining computations of the sequential
algorithm
Self-adaptive granularity based on
work-stealing

9
Alternative concurrently sequential and parallel
SeqCompute
preempt
10
Alternative concurrently sequential and parallel
merge/jump
SeqCompute
Seq
complete
11
Adaptive Prefix on 3 processors
Sequential
?1
Parallel
12
Adaptive Prefix on 3 processors
Sequential
?3
Parallel
?7
13
Adaptive Prefix on 3 processors
Sequential
Parallel
?8
14
Adaptive Prefix on 3 processors
Sequential
?8
Parallel
?8
?5
?6
?9
?11
15
Adaptive Prefix on 3 processors
Sequential
?12
?11
?8
Parallel
?8
?5
?6
?7
?9
?11
?10
16
Adaptive Prefix on 3 processors
Sequential
Implicit critical path on the sequential process
Parallel
17
Analysis of the algorithm

Theorem 3 Execution time
Sketch of the proof Analysis of the operations
performed by
The sequential main performs S operations on one
processor
The (p-1) work-stealers perform X 2(n-S)
operations with depth log X
Each non constant time task can potentially be
splitted (variable speeds)
The coupling ensures both algorithms complete
simultaneously Ts Tp - O(log X)gt enables to
bound the whole number X of operations
performedand the overhead of parallelism (SX)
- ops_optimal

18
Adaptive prefix experiments1
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Single user context
Time (s)
processors
Single-user context processor-adaptive prefix
achieves near-optimal performance - close to
the lower bound both on 1 proc and on p
processors - Less sensitive to system overhead
even better than the theoretically optimal
off-line parallel algorithm on p processors
19
Adaptive prefix experiments 2
Prefix sum of 8.106 double on a SMP 8 procs (IA64
1.5GHz/ linux)
Multi-user context
External charge (9-p external processes)
Time (s)
processors
Multi-user context Additional external
charge (9-p) additional external dummy processes
are concurrently executed Processor-adaptive
prefix computation is always the fastest
15 benefit over a parallel algorithm for p
processors with off-line schedule,
20
Conclusion

The interplay of an on-line parallel algorithm
directed by work-stealing schedule is useful for
the design of processor-oblivious algorithms
Application to prefix computation
- theoretically reaches the lower bound on
heterogeneous processors with changing speeds
- practically, achieves near-optimal
performances on multi-user SMPs
Generic adaptive scheme to implement parallel
algorithms with provable performance
- work in progress parallel 3D
reconstruction oct-tree scheme with
deadline constraint

21
Thank you !
Interactive Distributed Simulation B Raffin E
Boyer - 5 cameras, - 6 PCs 3D-reconstruction
simulation rendering -gtAdaptive scheme to
maximize 3D-reconstruction precision within fixed
timestamp L Suares, B Raffin, JL Roch
22
The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
23
Adaptive prefix some experiments
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm

Single user context
Adaptive is equivalent to
- sequential on 1 proc
- optimal parallel-2 proc. on 2 processors
-
- optimal parallel-8 proc. on 8 processors

24
With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)

Write a Comment

User Comments (0)