Shietung Peng - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Shietung Peng

Description:

In order to improve the speed, we would like to have many ... The communication scheme should be very well planned in order to get a good parallel algorithm. ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 24

Provided by: spe9

Category:

more less

Transcript and Presenter's Notes

Title: Shietung Peng

1
Introduction to Parallelism
by Shietung Peng
2
Parallel Computers

In an ordinary computer there is only one
processor. In order to improve the speed, we
would like to have many processors in a computer.
Such a computer is called parallel computer.
The parallelism can be explained using the
character recognition application shown in the
following figure.

3
Character Recognition

A computers ability to understand a hand-written
script is called character recognition. In order
to achieve a desirable error tolerant capacity,
the processing speed of the computer must be
increased.
In character recognition, intensive computations
are involved. A hand-written character is scanned
by a camera as an image of very high resolution,
such as 14,400 pixels per inch (see the figure at
next page).

4
Character Recognition
5
Parallel Processing Concepts

In order to solve a problem using a parallel
computer, one must decompose the problem into
small sub-problems, which can be solved in
parallel. Then these results must be efficiently
combined to get the final result of the main
problem.
Because of data dependency among the
sub-problems, it is not easy to decompose a large
problem properly.

6
Parallel Processing Concepts

Because of data dependency, the processors may
have to communicate among each other.
The time taken for communication is very high
when compared with the processing time.
The communication scheme should be very well
planned in order to get a good parallel algorithm.

7
A Motivating Example

A major issue in devising a parallel algorithm
for a given problem is the way in which the
computational load is divided among the
processors.
The most efficient scheme often depends on the
problem and the architecture of the parallel
computer.
We consider the problem of constructing the list
of prime numbers in the interval 1,n.

8
A Simple Algorithm The Sieve of Eratosthe

Start with the list of numbers 1, 2, 3, , n
represented as a mark bit-vector initialized to
100000. In each step, the next unmarked number m
(associated with a 0 in the element m of the mark
bit-vector) is a prime. Find this number m and
mark all multiples of m beginning with . When
, the computation stops and all unmarked
elements are prime numbers.

9
The Computational Steps for n30
10
The Sequential Algorithm

The variable current prime is initialized to 2
and, in later stage, holds the latest prime
number found. For each prime found, index is
initialized to the square of this prime and is
then incremented by the current prime in order to
mark all its multiples.

11
The First Parallel Algorithm

This is a control-parallel approach.
The list of numbers and the current prime are
stored in a shared memory that is accessible to
all processors.
Using more than 3 processors would not reduce the
computational time in this control-parallel
scheme.

12
Control-parallel Realization with n1000
13
The Second Parallel Algorithm

This is a data-parallel approach.
Assume that , so that all of the primes
whose multiples have to be marked reside in P1,
which acts as a coordinator finds the next prime
and broadcasts it to all other processors.

14
Trade-off Between Communcation Computation
Times

Because of communication overhead, adding more
processors beyond a certain optimal number does
not lead to any improvement in the total time.

15
Taxonomy of Parallel Computers

Parallel computers can be divided into two main
categories of control flow and data flow.
Control-flow parallel computers are essentially
based on the same principles as the von Neumann
(sequential) computer, except that multiple
instructions can be executed at any given time.
Data-flow parallel computers, sometimes referred
to as non-von Neumann, are completely different
in that they have no pointer to active
instructions. The control is totally distributed,
with the availability of operands triggering the
activation of instructions

16
Flynn's Classification

In 1966, Flynn proposed a four-way classification
of computer systems based on notations of
instruction streams (single or multiple) and data
streams (single or multiple).
Four classes SISD, SIMD, MISD, and MIMD.
Flynns classification has became standard and is
widely used.
Attempts have been made to extend Flynns
classification to accommodate modern parallel
computers.

17
Johnson's Classification

The MIMD category includes a wide class of
computers. In 1988, Johnson extended the Flynns
classification based on the memory structure
(global or distributed) and the mechanism used
for communication (shared variables or message
passing).

18
Duncan's Classification

In 1989, Duncan modified Flynns scheme to
include some other architectures and to cover
lower level features of parallelism.

19
Effectiveness of Parallel Processing

Throughout this course, we will use certain
measures to compare effectiveness of various
parallel algorithms. The definitions of these
measures are
given
below.

20
The Equations

It is not difficult to establish the following
equations (left as an exercise).

21
An Example

Find the sum of 16 numbers can be represented by
the following binary-tree.
Exclude communication overhead
W(8)15, T(8)4, E(8)47, S(8)3.75, R(8)1,
Q(8)1.76
Assume each data transfer requires one unit of
time
W(8)22, T(8)7, E(8)27, S(8)2.14,
R(8)1.47, Q(8)0.39

22
Exercise 1

Consider the data-parallel implementation of the
sieve of Eratosthe algorithm for n 1000000.
Assume that marking of each cell takes one time
unit and broadcasting a value to all processors
takes b time units.
Plot 3 speed-up curves for b 1,10, and 100.
Repeat part (a), this time assuming that the
broadcasting time b 5p1, 5p10, and 5p100.
Consider two versions of task graph in the
following figure. In version 1, each node
requires 1 computation unit-time. In version 2,
each odd-numbered node requires unit-time and
each even-numbered node takes twice as long.
Find the maximum speed-up for each of the two
versions and show the minimum number of processor
needed to achieve that speed-up.
What is the maximum speed-up in each case with 3
processors?

23
Exercise 1 (Conti.)

HINT for Question 1
The number of items per processor is .
There are a total of 168 primes in 1,1000
2, 3, , 997. The computing time
the communication time 168(apb), and
the running time T
The optimal number of processors to minimize
the running time can be computed by the equation
dT/dp 0.
For a 5, the optimal number of p is 51.