PARALLEL AND DISTRIBUTED COMPUTATION (Lucidi di L. Pagli) - PowerPoint PPT Presentation

About This Presentation

Title:

PARALLEL AND DISTRIBUTED COMPUTATION (Lucidi di L. Pagli)

Description:

parallel and distributed computation (lucidi di l. pagli) many interconnected processors working concurrently p4 p5 p3 interconnection network p2 pn. . . . p1 – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 29

Provided by: Rober905

Category:

more less

Transcript and Presenter's Notes

Title: PARALLEL AND DISTRIBUTED COMPUTATION (Lucidi di L. Pagli)

1
PARALLEL AND DISTRIBUTEDCOMPUTATION(Lucidi di
L. Pagli)

MANY INTERCONNECTED PROCESSORS WORKING
CONCURRENTLY

P4
P5
P3
INTERCONNECTION
NETWORK
P2
Pn
. . . .
P1

CONNECTION MACHINE (THINKING COMP. C.)
64.000 Pocessors

INTERNET Connects all the
computers of the world

THREE TYPES OF MULTIPROCESSING FRAMEWORKS,
CLOSELY RELATED
CONCURRENT
PARALLEL
PRAM
Bounded-degree network and VLSI
DISTRIBUTED

MULTIPROCESSING ACTVITIES TAKE PLACE IN A SINGLE
MACHINE (POSSIBLY USING
SEVERAL PROCESSORS), SHARING MEMORY AND TASKS.
TECHNICAL ASPECTS
PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT
SYNCRONY, SHARE MEMORY TO A LARGE EXTENT AND HAVE
A VERY FAST AND RELIABLE COMMUNICATION MECHANISM
BETWEEN THEM.
DISTRIBUTED COMPUTERS ARE MORE INDEPENDENT,
COMMUNICATION IS LESS
FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION
IS LIMITED.
PURPOSES
PARALLEL COMPUTERS COOPERATE TO SOLVE MORE
EFFICIENTLY (POSSIBLY)
DIFFICULT PROBLEMS
DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND
PRIVATE ACTIVITIES.
SOMETIME COMMUNICATIONS WITH OTHER ONES ARE
NEEDED. (E. G. DISTRIBUTED DATA BASE OPERATIONS).

FOR PARALLEL SYSTEMS
WE ARE INTERESTED TO SOLVE ANY PROBLEM IN
PARALLEL
FOR DISTRIBUTED SYSTEMS
WE ARE INTERESTED TO SOLVE IN PARALLEL
PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE
COMMUNICATION SERVICES
ROUTING
BROADCASTING
MAINTENANCE OF CONTROL STUCTURE
SPANNING TREE CONSTRUCTION
TOPOLOGY UPDATE
LEADER ELECTION
RESOURCE CONTROL ACTIVITIES

4
PARALLEL ALGORITHMS

WHICH MODEL OF COMPUTATION IS THE BETTER TO USE?
HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL
ALGORITHM?
HOW TO CONSTRUCT EFFICIENT ALGORITHMS?
MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE
REVISITED
IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS?
ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT
PARALLEL SOLUTION,
THAT IS INHERENTLY SEQUENTIAL PROBLEMS?

5
PRAM MODEL

Joseph Jajà
An introduction to Parallel Algorithms
Addison-Wesley Pub. Comp. 1992
Karp R.M., Ramachandra V.
A survey of parallel algorithm for shared-memory
machines J. Van Leuwen
Ed. Handbook of Theoretical Comp. Science
Jan Parberry
Parallel Complexity Theory Research Notes in
Theoretical Computer Science.
John WileySon 1987
TO FOCUS ON ALGORITHMIC ISSUES INDEPENDENTLY OF
PHYSICAL LOCATIONS

6
1
2
P1
3
P2
Common Memory
.
?
.

.

Pi
.
.
Pn
m
PRAM n RAM processors numbered from 1 to n
and connected to a common memory of m
cells ASSUMPTION at each time unit each Pi
can read a memory cell, make an internal
computation and write another memory
cell. CONSEQUENCE any pair of processor Pi Pj
can communicate in constant time! Pi
writes the message in cell x at time t Pi reads
the message in cell x at time t1
Pi
7

ASSUMPTIONS
Shared-memory The array A is stored in the
global memory and can be accessed by any
processor.
Synchronous mode of operation In each unit of
time, each processor is allowed to execute
an istruction or to stay idle.
There are several variations regarding the
handling of simultaneous access to the same
memory location.
EREW-PRAM (exclusive read exclusive write)
CREW-PRAM (concurrent read exclusive write)
CRCW-PRAM (concurrent read CONCURRENT write) and
a policy to resolve concurrent writes
Common, Priority,
Arbitrary
The three models do not differ substantially
in their computationa power!
If each processor can execute its own local
program we have a

Dal Bertossi Cap. 27
Sommatoria n log n
Sommatoria n
R. Grossi

9
Important parameters of the efficiency of a
parallel algorithm
Tp(n) (or Tp) parallel time
p(n) (or p) number of processors
LOWER BOUND
of the parallel computation
Let A a problem and Ts be the complexity of the
optimal sequential (or the best known) algorithm
for A, we have Tp gt Ts / p
Tp p cost of the parallel algorithm
Cn
The parallel algorithm can be converted into a
sequential algorithm that runs in O(Cn ) time
the single processor simulates the p processors
in p steps for each of the Tp parallel step. If
the parallel time would be less than Ts / p, we
could derive a sequential algorithm better than
the optimal one!!
10
Parallel algorithm
time 1 2
3 processor P1 op1 op2
op3 processor P2 op4
op5
Tp3 C6
p2
can be simulated by a single processor in a
number of steps (time) Š 6
Sequential algorithm
time 1 2 3
4 5 op1
op4 op2 op5 op3
Ts5
Tp gt Ts/p
Ts/Tp speed up of the parallel algorithm
11
MAXIMUM on the PRAM Input an array A of n2k
elements in the shared memory of a PRAM with n/2
processors Output the maximum element stored in
location S. Algorithm MAX begin for all k
where 1 lt k lt log n do in parallel if i lt n/2k
do in parallel Ai max A2i, A2i-1
MAX A1 end
S
P1
A(3)
P1
A(3)
A(7)
P2
P1
A(6)
A(7)
A(2)
A(3)
P3
P4
P2
P1
A(2)
A(3)
A(4)
A(5)
A(6)
A(7)
A(8)
A(1)
12
From the previous lower bound and sequential
computation C Tpn
not optimal
From algorithm MAX
C Tpn O(nlog n)

Better algorithm
divide the n elements in k n/log n subsets of
log n elements each

P1
P2
P3
Pk
...................
m1
m2
m3
mk

each processor computes the maximum mi of its
subsets
with the sequential algorithm in time O(log n)
algorithm MAX is executed among the local maxima,
time
O(log (n/log n)) O(logn - loglog n) O(logn)

Overal time Tp O(log n) and
p n/ log n

optimal
C Tpn O(n)
13
PERFORMANCE OF PARALLEL ALGORITHM
Four ways of measuring the performance of
parallel algorithm 1. P(n) processors and
Tp(n) time. 2. C(n) P(n)Tp(n) cost and Tp(n)
time. The number of processors depends
on the size n of the problems. The
second relation can be generalized to any number
pltP(n) processors
each of the Tp parallel step can be
simulated by the p processors in O(P(n)/p)
substeps this simulation takes a total of O(
Tp(n)P(n)/p) time.
3. O( Tp(n)P(n)/p) time for any number pltP(n)
processors If the number of processors
p is larger than P(n), we can clearly achieve
the runnng time Tp(n) by using P(n)
processors only. Relation 3 can be
further generalized. 4. O(C(n)/p Tp(n)) time
for any number p processors
In conclusion,in the design of a PRAM alg., we
can assume as many processor we
need and use the proper relation to analyze it.
14
PERFORMANCE OF ALGORITHM MAX
1. P(n) n/2 processors and Tp(n) O(log n)
time. 2. C(n) P(n)Tp(n) O(n log n) cost and
Tp(n) O(log n) time Assume p log n
processors
3. O(Tp(n)P(n)/p) O (logn n/logn) O(n)
time Therefore
4. O(logn n/p logn) time. If pltn, O(log n)
time, otherwise O(logn n/p ) time.
Work W(n) of a parallel algorithm total number
of operations used.
Work of alg. MAX
W(n) SUMj1, logn(n/2j) 1 O(n)
W(n) lt C(n)
W(n) measures the total number of operations and
has nothing to do with the number of processors
available, while C(n) measures the cost of the
alg. relative to the number p of processors
available.
15
Work-time presentation of a parallel
algorithm
any number of parallel operations at each
time unit is allowed
BRENT PRINCIPLE given a parallel algorithm
that runs in time T(n) and requires W(n) work,
we can adapt this algorithm to run on a
p-processors PRAM in time

Tp(n) lt W(n)/p T(n)
Let Wi(n) be the number of operation of time unit
i, 1lt i lt T(n). Simulate each set of Wi(n)
operations in Wi(n)/p parallel steps of the p
processors, for each 1lt i lt T(n). The
p-processors PRAM algorithm takes lt SUMi
Wi(n)/p lt SUMi (Wi(n)/p 1)

lt SUMi Wi(n)/p
T(n).
The Brent Principle assumes that the scheduling
of the operations to the processors is always a
trivial task. This is not always true. Its easy
if we use C(n) in place of W(n)
16
t
1
7
14
17
25
30
algorithm A1
2
8
15
18
3
9
16
19
T1 6
4
10
20
5
11
21
29
6
12
22
13
23
36
24
W(n) 36
6 7 3 8 5 7
Wi
A1 can be simulated by A2 with 3 processors in
time T2(n) lt 36/3 6 18
1
7
10
13
14
17
28
30
36
4
20
23
25
33
2
8
11
15
18
21
29
31
5
34
26
24
T2 14
9
12
16
19
22
32
6
35
27
3
t1
t2
t3
t4
t5
t6
17

Dal Bertossi cap. 27
Tecniche di base
Somme prefisse
Ordinamento non ottimo
List ranking con pointer jumping
Ciclo euleriano
R. Grossi

18
PARALLEL DIVIDE AND CONQUER

Partion the input in several subsets of almost
equal size
Solve recursively the subproblem defined by each
subset
Combine the solutions of the subproblems into a
solution
to the overall problem

CONVEX HULL
sequential algorithm O(n logn)
v3
v4
v2
UPPER HULL
v1
v5
LOWER HULL
v7
v6
p
q

Sort the x-coordinates in increasing order
Compute the UPPER HULL

Tp O(log n)
19
w. l. o. g., let x (v1) lt x (v2) lt . . .lt x
(vn) n 2k

Divide the points in two subsets S1 (v1,
v2 , . . .,vn/2) S2 (vn/21, . . .,vn)

suppose the UPPER
HULL of S1 and S2 is already computed
q2
b
a
q2
q3
q3
Compute ab upper common tangent
q4
q4
q1
q1
S1
S2
q5

The UPPER HULL of S is formed by

q1 , . . . , qi a, qj b, . . . ,qs
Algorithm UPPER HULL (Sketch) 1. if nlt4 use
brute force method to determine UH(S) 2.
Let S1 (v1, v2 , . . .,vn/2), S2 (vn/21, . .
.,vn) recursively compute UH(S1) and UH(S2)
in parallel
(Tp(n/2) time and 2W(n /2) operations) 3. Find
the Upper Common Tangent between UH(S1) and
UH(S2) and deduce UH(S)
O(log n sequential time) O(n )
operations Tp(n) Tp(n/2)
O(log n) O(log2 n) W(n) O(nlogn)

20
Intractable problems remain intractable in
parallel
For an intractable problem (NP-hard) the only
known solution require exponential time
Ts abn p nc
(polynomial in the size of the input)
From the lower bound TP gt abn/ n c gt
a(b/2)n for large value of n
still exponential
We consider only the class P, and in particular
the class NCÎ P. NC is the class of all
(decision) problems that can be solved in solved
in polylog parallel time (i. e. Tp is of order
O(logkn)), with a polynomial number of processors.
NC contains problems that can be
efficiently solved in parallel
21
PARALLEL
SEQUENTIAL
Class NC
Class P
Efficient Algorithm
Efficient Algorithm
?
?
NC P
P NP

There are problems belonging to P for
which NO EFFICIENT PARALLEL
algorithm is known.
There is no proof that such an algorithm not
exists

P-complete Problems
NP-complete Problems
Monotone Circuit Value
Satisfiability
P
NP
NP1
P1
MCV
SAT
P2
NP2
.

P3
NP3
.
.
.
Ph
NPK
Goldshlager Th. (1984)
Cooks Th. (1969)
22
MONOTONE CIRCUIT VALUE PROBLEM (MCVP)
a b c d e f g
z (((a AND b) OR c) AND d) AND ((e AND f) OR g)
z
Determine the value of the single output of a
Boolean Circuit consisting of two-valued AND and
OR gates and a set of inputs and their
complements,
DEPTH FIRST SEARCH
dfs numbers
1
a 1 b 2 c 5 d
3 e 4 f 6 g
7
a
b
2
2
1
1
3
c
d
e
1
2
2
3
f
g
arcs numbered according to the order of
appereance on the adjacency list
23
MAX FLOW
3
5
4
4
2
2
2
2
1
1
0
1
1
1
1
1
1
1
s
t
s
t
1
2
2
0
3
3
1
1
3
3
f 6
A directed graph N (network) with two
distinguished vertices source s and sink t each
arc is labelled with its capacity (positive
integer). A flow f is a function, such that
1. 0 lt f(e) lt c(e), for all arcs e
(capacity constraint) 2. the sum of
the flow of all incoming arcs to any node (!
s,t), is equal to sum of the
flow on all outgoing arcs. (conservation
constraint)
The value of the FLOW is given by the sum of the
flow of the outgoing arcs of s ( to the sum of
the flow of all incoming arcs to t).
Find the maximum possible value of
the flow.
Sequential Algorithm O(n3) No efficient
parallel solution is known
24
Decisional Parallel Problems
Reducibility Notion Let A1 and A2 be
decisional problems. A1 is NC-reducible to A2
if there exists an NC-algorithm that transforms
an arbitrary input u1 of A1 into an input u2 of
A2, such that A1 answer yes for u1 if and only
if A2 answer yes for u2.
A2 is at least as difficult as A1
A problem A is P-Complete if every problem in
the class P is NC-reducible to A
If A is P-complete
If A is NP-complete
AÎNC iff PNC
AÎP iff PNP
The hope of finding an efficient parallel
algorithm is very low
To show that a problem A is P-Complete
- AÎ P
- MCVP is NC-reducible to
A
MCVP input Acyclic network of gates AND, OR
(two-valued input) and an assignement of constant
values 0,1 at each input line output compute the
value of the single output value
25

Sketch of the GOLDSHLAGERS theorem
An arbitrary problem AÎ P can be formulated as
an MCVP problem.
MCVP Î P because z can be computed in O(n)
sequential time.
if A Î P is accepted by a deterministic TM in
time T(n), polynomial for any input n.
output
1
n

0
input
Q q1, . . . , qs
set of States å
a1, . . . , am tapes alphabet
d Q x å Q x å x
L, R transition function

The corresponding boolean circuit is defined by
the following boolean functions 1. H (i,t) 1
if the head is on cell i a time t. 0 lt T lt
T(n), 1lt i lt T(n). 2. C(i, j, t) 1 if the
cell i contains the symbol aj at time t. 0 lt
Tlt T(n), 1lt i ltT(n), 1lt j lt m. 3. S
(k,t) 1 if the state of the TM is qk at time t.
1ltklt s, 0lt Tlt T(n).
Each step of the Turing machine can be described
by one level of the circuit computing H (i, t),
C( i, j, t) and S(k, t ).
26
0
1
EX
1q2R
q1
-
TM
0q3R
1q2L
q2
Q q1, q2, q3 S 0,1
0q3L
1q3R
q3
1
1
t 0
1 , i 1
1 , i 1, j 2
1, k 1
H (i, 0)
C (i, j, 0)
S (k, 0)
0 , 2 lt i lt n
0 , i 1, j 2
0, k1
i-1 i i1
t gt 0
H (i, t) ( H (i-1, t-1) AND right shift) OR
( H(i1, t-1) AND left shift) left shift
((S (2, t-1) AND C (i1, 2, t-1)) OR (S(3, t-1)
AND C (i1, 1, t-1))
analogously compute C (i, j, t) and S (k, t).
The circuit value is given by C(1, , T
(n)) and can be computed in O (log n) time with
a quadratic number of processors.
27
THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL