Shietung Peng - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Shietung Peng

Description:

In order to improve the speed, we would like to have many ... The communication scheme should be very well planned in order to get a good parallel algorithm. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 24
Provided by: spe9
Category:

less

Transcript and Presenter's Notes

Title: Shietung Peng


1
Introduction to Parallelism
by Shietung Peng
2
Parallel Computers
  • In an ordinary computer there is only one
    processor. In order to improve the speed, we
    would like to have many processors in a computer.
    Such a computer is called parallel computer.
  • The parallelism can be explained using the
    character recognition application shown in the
    following figure.

3
Character Recognition
  • A computers ability to understand a hand-written
    script is called character recognition. In order
    to achieve a desirable error tolerant capacity,
    the processing speed of the computer must be
    increased.
  • In character recognition, intensive computations
    are involved. A hand-written character is scanned
    by a camera as an image of very high resolution,
    such as 14,400 pixels per inch (see the figure at
    next page).

4
Character Recognition
5
Parallel Processing Concepts
  • In order to solve a problem using a parallel
    computer, one must decompose the problem into
    small sub-problems, which can be solved in
    parallel. Then these results must be efficiently
    combined to get the final result of the main
    problem.
  • Because of data dependency among the
    sub-problems, it is not easy to decompose a large
    problem properly.

6
Parallel Processing Concepts
  • Because of data dependency, the processors may
    have to communicate among each other.
  • The time taken for communication is very high
    when compared with the processing time.
  • The communication scheme should be very well
    planned in order to get a good parallel algorithm.

7
A Motivating Example
  • A major issue in devising a parallel algorithm
    for a given problem is the way in which the
    computational load is divided among the
    processors.
  • The most efficient scheme often depends on the
    problem and the architecture of the parallel
    computer.
  • We consider the problem of constructing the list
    of prime numbers in the interval 1,n.

8
A Simple Algorithm The Sieve of Eratosthe
  • Start with the list of numbers 1, 2, 3, , n
    represented as a mark bit-vector initialized to
    100000. In each step, the next unmarked number m
    (associated with a 0 in the element m of the mark
    bit-vector) is a prime. Find this number m and
    mark all multiples of m beginning with . When
    , the computation stops and all unmarked
    elements are prime numbers.

9
The Computational Steps for n30
10
The Sequential Algorithm
  • The variable current prime is initialized to 2
    and, in later stage, holds the latest prime
    number found. For each prime found, index is
    initialized to the square of this prime and is
    then incremented by the current prime in order to
    mark all its multiples.

11
The First Parallel Algorithm
  • This is a control-parallel approach.
  • The list of numbers and the current prime are
    stored in a shared memory that is accessible to
    all processors.
  • Using more than 3 processors would not reduce the
    computational time in this control-parallel
    scheme.

12
Control-parallel Realization with n1000
13
The Second Parallel Algorithm
  • This is a data-parallel approach.
  • Assume that , so that all of the primes
    whose multiples have to be marked reside in P1,
    which acts as a coordinator finds the next prime
    and broadcasts it to all other processors.

14
Trade-off Between Communcation Computation
Times
  • Because of communication overhead, adding more
    processors beyond a certain optimal number does
    not lead to any improvement in the total time.

15
Taxonomy of Parallel Computers
  • Parallel computers can be divided into two main
    categories of control flow and data flow.
  • Control-flow parallel computers are essentially
    based on the same principles as the von Neumann
    (sequential) computer, except that multiple
    instructions can be executed at any given time.
  • Data-flow parallel computers, sometimes referred
    to as non-von Neumann, are completely different
    in that they have no pointer to active
    instructions. The control is totally distributed,
    with the availability of operands triggering the
    activation of instructions

16
Flynn's Classification
  • In 1966, Flynn proposed a four-way classification
    of computer systems based on notations of
    instruction streams (single or multiple) and data
    streams (single or multiple).
  • Four classes SISD, SIMD, MISD, and MIMD.
  • Flynns classification has became standard and is
    widely used.
  • Attempts have been made to extend Flynns
    classification to accommodate modern parallel
    computers.

17
Johnson's Classification
  • The MIMD category includes a wide class of
    computers. In 1988, Johnson extended the Flynns
    classification based on the memory structure
    (global or distributed) and the mechanism used
    for communication (shared variables or message
    passing).

18
Duncan's Classification
  • In 1989, Duncan modified Flynns scheme to
    include some other architectures and to cover
    lower level features of parallelism.

19
Effectiveness of Parallel Processing
  • Throughout this course, we will use certain
    measures to compare effectiveness of various
    parallel algorithms. The definitions of these
    measures are
    given
    below.

20
The Equations
  • It is not difficult to establish the following
    equations (left as an exercise).

21
An Example
  • Find the sum of 16 numbers can be represented by
    the following binary-tree.
  • Exclude communication overhead
  • W(8)15, T(8)4, E(8)47, S(8)3.75, R(8)1,
    Q(8)1.76
  • Assume each data transfer requires one unit of
    time
  • W(8)22, T(8)7, E(8)27, S(8)2.14,
    R(8)1.47, Q(8)0.39

22
Exercise 1
  • Consider the data-parallel implementation of the
    sieve of Eratosthe algorithm for n 1000000.
    Assume that marking of each cell takes one time
    unit and broadcasting a value to all processors
    takes b time units.
  • Plot 3 speed-up curves for b 1,10, and 100.
  • Repeat part (a), this time assuming that the
    broadcasting time b 5p1, 5p10, and 5p100.
  • Consider two versions of task graph in the
    following figure. In version 1, each node
    requires 1 computation unit-time. In version 2,
    each odd-numbered node requires unit-time and
    each even-numbered node takes twice as long.
  • Find the maximum speed-up for each of the two
    versions and show the minimum number of processor
    needed to achieve that speed-up.
  • What is the maximum speed-up in each case with 3
    processors?

23
Exercise 1 (Conti.)
  • HINT for Question 1
  • The number of items per processor is .
  • There are a total of 168 primes in 1,1000
    2, 3, , 997. The computing time
  • the communication time 168(apb), and
  • the running time T
  • The optimal number of processors to minimize
    the running time can be computed by the equation
    dT/dp 0.
  • For a 5, the optimal number of p is 51.
Write a Comment
User Comments (0)
About PowerShow.com