Title: PARALLEL AND DISTRIBUTED COMPUTATION (Lucidi di L. Pagli)
1PARALLEL AND DISTRIBUTEDCOMPUTATION(Lucidi di
L. Pagli)
- MANY INTERCONNECTED PROCESSORS WORKING
CONCURRENTLY
P4
P5
P3
INTERCONNECTION
NETWORK
P2
Pn
. . . .
P1
- CONNECTION MACHINE (THINKING COMP. C.)
64.000 Pocessors
- INTERNET Connects all the
computers of the world
2- THREE TYPES OF MULTIPROCESSING FRAMEWORKS,
CLOSELY RELATED - CONCURRENT
- PARALLEL
- PRAM
- Bounded-degree network and VLSI
- DISTRIBUTED
- MULTIPROCESSING ACTVITIES TAKE PLACE IN A SINGLE
MACHINE (POSSIBLY USING - SEVERAL PROCESSORS), SHARING MEMORY AND TASKS.
- TECHNICAL ASPECTS
- PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT
SYNCRONY, SHARE MEMORY TO A LARGE EXTENT AND HAVE
A VERY FAST AND RELIABLE COMMUNICATION MECHANISM
BETWEEN THEM. - DISTRIBUTED COMPUTERS ARE MORE INDEPENDENT,
COMMUNICATION IS LESS - FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION
IS LIMITED. - PURPOSES
- PARALLEL COMPUTERS COOPERATE TO SOLVE MORE
EFFICIENTLY (POSSIBLY) - DIFFICULT PROBLEMS
- DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND
PRIVATE ACTIVITIES. - SOMETIME COMMUNICATIONS WITH OTHER ONES ARE
NEEDED. (E. G. DISTRIBUTED DATA BASE OPERATIONS).
3- FOR PARALLEL SYSTEMS
- WE ARE INTERESTED TO SOLVE ANY PROBLEM IN
PARALLEL - FOR DISTRIBUTED SYSTEMS
- WE ARE INTERESTED TO SOLVE IN PARALLEL
- PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE
- COMMUNICATION SERVICES
- ROUTING
- BROADCASTING
- MAINTENANCE OF CONTROL STUCTURE
- SPANNING TREE CONSTRUCTION
- TOPOLOGY UPDATE
- LEADER ELECTION
- RESOURCE CONTROL ACTIVITIES
4PARALLEL ALGORITHMS
- WHICH MODEL OF COMPUTATION IS THE BETTER TO USE?
- HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL
ALGORITHM? - HOW TO CONSTRUCT EFFICIENT ALGORITHMS?
- MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE
REVISITED - IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS?
- ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT
PARALLEL SOLUTION, -
- THAT IS INHERENTLY SEQUENTIAL PROBLEMS?
5 PRAM MODEL
- Joseph JajÃ
- An introduction to Parallel Algorithms
Addison-Wesley Pub. Comp. 1992 - Karp R.M., Ramachandra V.
- A survey of parallel algorithm for shared-memory
machines J. Van Leuwen - Ed. Handbook of Theoretical Comp. Science
- Jan Parberry
- Parallel Complexity Theory Research Notes in
Theoretical Computer Science. - John WileySon 1987
- TO FOCUS ON ALGORITHMIC ISSUES INDEPENDENTLY OF
PHYSICAL LOCATIONS
61
2
P1
3
P2
Common Memory
.
?
.
.
Pi
.
.
Pn
m
PRAM n RAM processors numbered from 1 to n
and connected to a common memory of m
cells ASSUMPTION at each time unit each Pi
can read a memory cell, make an internal
computation and write another memory
cell. CONSEQUENCE any pair of processor Pi Pj
can communicate in constant time! Pi
writes the message in cell x at time t Pi reads
the message in cell x at time t1
Pi
7- ASSUMPTIONS
- Shared-memory The array A is stored in the
global memory and can be accessed by any - processor.
- Synchronous mode of operation In each unit of
time, each processor is allowed to execute - an istruction or to stay idle.
- There are several variations regarding the
handling of simultaneous access to the same - memory location.
- EREW-PRAM (exclusive read exclusive write)
- CREW-PRAM (concurrent read exclusive write)
- CRCW-PRAM (concurrent read CONCURRENT write) and
a policy to resolve concurrent writes - Common, Priority,
Arbitrary - The three models do not differ substantially
in their computationa power! - If each processor can execute its own local
program we have a
8- Dal Bertossi Cap. 27
- Sommatoria n log n
- Sommatoria n
-
R. Grossi
9 Important parameters of the efficiency of a
parallel algorithm
Tp(n) (or Tp) parallel time
p(n) (or p) number of processors
LOWER BOUND
of the parallel computation
Let A a problem and Ts be the complexity of the
optimal sequential (or the best known) algorithm
for A, we have Tp gt Ts / p
Tp p cost of the parallel algorithm
Cn
The parallel algorithm can be converted into a
sequential algorithm that runs in O(Cn ) time
the single processor simulates the p processors
in p steps for each of the Tp parallel step. If
the parallel time would be less than Ts / p, we
could derive a sequential algorithm better than
the optimal one!!
10Parallel algorithm
time 1 2
3 processor P1 op1 op2
op3 processor P2 op4
op5
Tp3 C6
p2
can be simulated by a single processor in a
number of steps (time) Å 6
Sequential algorithm
time 1 2 3
4 5 op1
op4 op2 op5 op3
Ts5
Tp gt Ts/p
Ts/Tp speed up of the parallel algorithm
11MAXIMUM on the PRAM Input an array A of n2k
elements in the shared memory of a PRAM with n/2
processors Output the maximum element stored in
location S. Algorithm MAX begin for all k
where 1 lt k lt log n do in parallel if i lt n/2k
do in parallel Ai max A2i, A2i-1
MAX A1 end
S
P1
A(3)
P1
A(3)
A(7)
P2
P1
A(6)
A(7)
A(2)
A(3)
P3
P4
P2
P1
A(2)
A(3)
A(4)
A(5)
A(6)
A(7)
A(8)
A(1)
12From the previous lower bound and sequential
computation C Tpn
not optimal
From algorithm MAX
C Tpn O(nlog n)
- Better algorithm
- divide the n elements in k n/log n subsets of
log n elements each
P1
P2
P3
Pk
...................
m1
m2
m3
mk
- each processor computes the maximum mi of its
subsets - with the sequential algorithm in time O(log n)
- algorithm MAX is executed among the local maxima,
time - O(log (n/log n)) O(logn - loglog n) O(logn)
Overal time Tp O(log n) and
p n/ log n
optimal
C Tpn O(n)
13 PERFORMANCE OF PARALLEL ALGORITHM
Four ways of measuring the performance of
parallel algorithm 1. P(n) processors and
Tp(n) time. 2. C(n) P(n)Tp(n) cost and Tp(n)
time. The number of processors depends
on the size n of the problems. The
second relation can be generalized to any number
pltP(n) processors
each of the Tp parallel step can be
simulated by the p processors in O(P(n)/p)
substeps this simulation takes a total of O(
Tp(n)P(n)/p) time.
3. O( Tp(n)P(n)/p) time for any number pltP(n)
processors If the number of processors
p is larger than P(n), we can clearly achieve
the runnng time Tp(n) by using P(n)
processors only. Relation 3 can be
further generalized. 4. O(C(n)/p Tp(n)) time
for any number p processors
In conclusion,in the design of a PRAM alg., we
can assume as many processor we
need and use the proper relation to analyze it.
14 PERFORMANCE OF ALGORITHM MAX
1. P(n) n/2 processors and Tp(n) O(log n)
time. 2. C(n) P(n)Tp(n) O(n log n) cost and
Tp(n) O(log n) time Assume p log n
processors
3. O(Tp(n)P(n)/p) O (logn n/logn) O(n)
time Therefore
4. O(logn n/p logn) time. If pltn, O(log n)
time, otherwise O(logn n/p ) time.
Work W(n) of a parallel algorithm total number
of operations used.
Work of alg. MAX
W(n) SUMj1, logn(n/2j) 1 O(n)
W(n) lt C(n)
W(n) measures the total number of operations and
has nothing to do with the number of processors
available, while C(n) measures the cost of the
alg. relative to the number p of processors
available.
15 Work-time presentation of a parallel
algorithm
any number of parallel operations at each
time unit is allowed
BRENT PRINCIPLE given a parallel algorithm
that runs in time T(n) and requires W(n) work,
we can adapt this algorithm to run on a
p-processors PRAM in time
Tp(n) lt W(n)/p T(n)
Let Wi(n) be the number of operation of time unit
i, 1lt i lt T(n). Simulate each set of Wi(n)
operations in Wi(n)/p parallel steps of the p
processors, for each 1lt i lt T(n). The
p-processors PRAM algorithm takes lt SUMi
Wi(n)/p lt SUMi (Wi(n)/p 1)
lt SUMi Wi(n)/p
T(n).
The Brent Principle assumes that the scheduling
of the operations to the processors is always a
trivial task. This is not always true. Its easy
if we use C(n) in place of W(n)
16t
1
7
14
17
25
30
algorithm A1
2
8
15
18
3
9
16
19
T1 6
4
10
20
5
11
21
29
6
12
22
13
23
36
24
W(n) 36
6 7 3 8 5 7
Wi
A1 can be simulated by A2 with 3 processors in
time T2(n) lt 36/3 6 18
1
7
10
13
14
17
28
30
36
4
20
23
25
33
2
8
11
15
18
21
29
31
5
34
26
24
T2 14
9
12
16
19
22
32
6
35
27
3
t1
t2
t3
t4
t5
t6
17- Dal Bertossi cap. 27
- Tecniche di base
- Somme prefisse
- Ordinamento non ottimo
- List ranking con pointer jumping
- Ciclo euleriano
- R. Grossi
18 PARALLEL DIVIDE AND CONQUER
- Partion the input in several subsets of almost
equal size - Solve recursively the subproblem defined by each
subset - Combine the solutions of the subproblems into a
solution - to the overall problem
CONVEX HULL
sequential algorithm O(n logn)
v3
v4
v2
UPPER HULL
v1
v5
LOWER HULL
v7
v6
p
q
- Sort the x-coordinates in increasing order
- Compute the UPPER HULL
Tp O(log n)
19w. l. o. g., let x (v1) lt x (v2) lt . . .lt x
(vn) n 2k
- Divide the points in two subsets S1 (v1,
v2 , . . .,vn/2) S2 (vn/21, . . .,vn)
suppose the UPPER
HULL of S1 and S2 is already computed
q2
b
a
q2
q3
q3
Compute ab upper common tangent
q4
q4
q1
q1
S1
S2
q5
- The UPPER HULL of S is formed by
q1 , . . . , qi a, qj b, . . . ,qs
Algorithm UPPER HULL (Sketch) 1. if nlt4 use
brute force method to determine UH(S) 2.
Let S1 (v1, v2 , . . .,vn/2), S2 (vn/21, . .
.,vn) recursively compute UH(S1) and UH(S2)
in parallel
(Tp(n/2) time and 2W(n /2) operations) 3. Find
the Upper Common Tangent between UH(S1) and
UH(S2) and deduce UH(S)
O(log n sequential time) O(n )
operations Tp(n) Tp(n/2)
O(log n) O(log2 n) W(n) O(nlogn)
20Intractable problems remain intractable in
parallel
For an intractable problem (NP-hard) the only
known solution require exponential time
Ts abn p nc
(polynomial in the size of the input)
From the lower bound TP gt abn/ n c gt
a(b/2)n for large value of n
still exponential
We consider only the class P, and in particular
the class NCÃŽ P. NC is the class of all
(decision) problems that can be solved in solved
in polylog parallel time (i. e. Tp is of order
O(logkn)), with a polynomial number of processors.
NC contains problems that can be
efficiently solved in parallel
21PARALLEL
SEQUENTIAL
Class NC
Class P
Efficient Algorithm
Efficient Algorithm
?
?
NC P
P NP
- There are problems belonging to P for
- which NO EFFICIENT PARALLEL
- algorithm is known.
- There is no proof that such an algorithm not
- exists
P-complete Problems
NP-complete Problems
Monotone Circuit Value
Satisfiability
P
NP
NP1
P1
MCV
SAT
P2
NP2
.
P3
NP3
.
.
.
Ph
NPK
Goldshlager Th. (1984)
Cooks Th. (1969)
22MONOTONE CIRCUIT VALUE PROBLEM (MCVP)
a b c d e f g
z (((a AND b) OR c) AND d) AND ((e AND f) OR g)
z
Determine the value of the single output of a
Boolean Circuit consisting of two-valued AND and
OR gates and a set of inputs and their
complements,
DEPTH FIRST SEARCH
dfs numbers
1
a 1 b 2 c 5 d
3 e 4 f 6 g
7
a
b
2
2
1
1
3
c
d
e
1
2
2
3
f
g
arcs numbered according to the order of
appereance on the adjacency list
23 MAX FLOW
3
5
4
4
2
2
2
2
1
1
0
1
1
1
1
1
1
1
s
t
s
t
1
2
2
0
3
3
1
1
3
3
f 6
A directed graph N (network) with two
distinguished vertices source s and sink t each
arc is labelled with its capacity (positive
integer). A flow f is a function, such that
1. 0 lt f(e) lt c(e), for all arcs e
(capacity constraint) 2. the sum of
the flow of all incoming arcs to any node (!
s,t), is equal to sum of the
flow on all outgoing arcs. (conservation
constraint)
The value of the FLOW is given by the sum of the
flow of the outgoing arcs of s ( to the sum of
the flow of all incoming arcs to t).
Find the maximum possible value of
the flow.
Sequential Algorithm O(n3) No efficient
parallel solution is known
24Decisional Parallel Problems
Reducibility Notion Let A1 and A2 be
decisional problems. A1 is NC-reducible to A2
if there exists an NC-algorithm that transforms
an arbitrary input u1 of A1 into an input u2 of
A2, such that A1 answer yes for u1 if and only
if A2 answer yes for u2.
A2 is at least as difficult as A1
A problem A is P-Complete if every problem in
the class P is NC-reducible to A
If A is P-complete
If A is NP-complete
AÃŽNC iff PNC
AÃŽP iff PNP
The hope of finding an efficient parallel
algorithm is very low
To show that a problem A is P-Complete
- AÃŽ P
- MCVP is NC-reducible to
A
MCVP input Acyclic network of gates AND, OR
(two-valued input) and an assignement of constant
values 0,1 at each input line output compute the
value of the single output value
25- Sketch of the GOLDSHLAGERS theorem
- An arbitrary problem AÃŽ P can be formulated as
an MCVP problem. - MCVP ÃŽ P because z can be computed in O(n)
sequential time. - if A ÃŽ P is accepted by a deterministic TM in
time T(n), polynomial for any input n. - output
- 1
n
0
input
Q q1, . . . , qs
set of States å
a1, . . . , am tapes alphabet
d Q x å Q x å x
L, R transition function
The corresponding boolean circuit is defined by
the following boolean functions 1. H (i,t) 1
if the head is on cell i a time t. 0 lt T lt
T(n), 1lt i lt T(n). 2. C(i, j, t) 1 if the
cell i contains the symbol aj at time t. 0 lt
Tlt T(n), 1lt i ltT(n), 1lt j lt m. 3. S
(k,t) 1 if the state of the TM is qk at time t.
1ltklt s, 0lt Tlt T(n).
Each step of the Turing machine can be described
by one level of the circuit computing H (i, t),
C( i, j, t) and S(k, t ).
260
1
EX
1q2R
q1
-
TM
0q3R
1q2L
q2
Q q1, q2, q3 S 0,1
0q3L
1q3R
q3
1
1
t 0
1 , i 1
1 , i 1, j 2
1, k 1
H (i, 0)
C (i, j, 0)
S (k, 0)
0 , 2 lt i lt n
0 , i 1, j 2
0, k1
i-1 i i1
t gt 0
H (i, t) ( H (i-1, t-1) AND right shift) OR
( H(i1, t-1) AND left shift) left shift
((S (2, t-1) AND C (i1, 2, t-1)) OR (S(3, t-1)
AND C (i1, 1, t-1))
analogously compute C (i, j, t) and S (k, t).
The circuit value is given by C(1, , T
(n)) and can be computed in O (log n) time with
a quadratic number of processors.
27THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL
- The interconnection network between processors
and memory would require - a very large amount of area .
- The message-routing on the interconnection
network would require time - proportional to network size (i. e. the
assumption of a constant access time - to the memory is not realistic).
WHY THE PRAM IS A REFERENCE MODEL?
- Algorithms designers can forget the
communication problems and focus their - attention on the parallel computation only.
- There exist algorithms simulating any PRAM
algorithm on bounded degree - networks.
- E. G. A PRAM algorithm requiring time T(n), can
be simulated in a mesh of tree - in time T(n)log2n/loglogn, that is
each step can be simulated with a slow-down - of log2n/loglogn.
- Instead of design ad hoc algorithms for bounded
degree networks, design more - general algorithms for the PRAM model and
simulate them on a feasible network.
28- For the PRAM model there exists a well developed
body of techniques - and methods to handle different classes of
computational problems. - The discussion on parallel model of computation
is still HOT - The actual trend
- COARSE-GRAINED MODELS (BSP, LOGP)
-
- The degree of parallelism allowed is independent
from the number - of processors.
-
- The computation is divided in supersteps, each
one includes - local computation
- communication phase
- syncronization phase
the study is still at the beginning!