A Survey of Parallel Computer Architectures presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Survey of Parallel Computer Architectures

1
A Survey of Parallel Computer Architectures

CSC521 Advanced Computer Architecture
Dr. Craig Reinhart
The General and Logical Theory of Automata
Lin, Shu-Hsien (ANDY)
4/24/2008

2
What is the parallel computer architectures?

Pipelining Instruction
Multiple CPU Functional Units
Separate CPU and I/O processors

3
Pipelining Instruction

Decompose the pipelining instruction, it can be
executed into a linear series of autonomous
stages, allowing each stage to simultaneously
perform every portion of the execution process
(such as decoding, calculating effective address,
fetching operand, executing, and storing).

4
Pipelining vs. Single-cycle processors

For single-cycle processor it takes 16 nanosecond
to execute four instructions, while for
pipelining processor it takes only 7 nanoseconds.

5
Multiple CPU Functional Units(1)

Multiple CPU Functional Units provides
independent functional units for arithmetic and
Boolean operations that execute concurrently.

6
Multiple CPU Functional Units(2)

A parallel computer has three types of parts
Processors
Memory modules
Communication / synchronization network

7
Single Processor V.S. Multi-Processor
Energy-Efficient Performance

The two figures are showing the different
performance with a single processor and
multi-processors
The right above figure shows that based on the
single processor. Increasing clock frequency by
20 to single processor delivers a 13 percent
performance gain, but require73 greater Power.
The below figure shows that adding a second
processor on the under-clocking experience, the
clock frequency effectively delivers 73 more
performance.

8
Separate CPU and I/O processors

Freeing the CPU from I/O control responsibilities
by using dedicated I/O processors solutions
range from relatively simple I/O controllers to
complex peripheral processing units.

9
Example Intel IPX2800
10
High-level Taxonomy of Parallel Computer
Architectures

A parallel architecture provides an explicit,
high-level framework for the parallel programming
solutions by providing multiple processors,
whether simple or complex, that cooperate to
solve problems through concurrent execution.

11
Flynns TaxonomyClassifies Architectures (1)

SISD--(single instruction, single data stream)
-- Defines serial computers.
-- An ordinary computer
MISD--(multiple instruction, single data stream)
-- It would involve multiple processors
applying different instructions to a single
datum this hypothetical possibility is generally
deemed to be impractical.

12
Flynns TaxonomyClassifies Architectures (2)

SIMD--(single instruction, multiple data streams)
-- It involves multiple processors
simultaneously executing the same instruction on
different data
-- Massively parallel army-of-ants approach
processors execute the same sequence of
instructions (or else NO-OP) in lockstep (TMC
CM-2)
MIMD--(multiple instruction, multiple data
streams)
-- It involves multiple processors
autonomously executing diverse instructions on
diverse data
-- It is true, symmetric, parallel computing
(Sun Enterprise)

13
Pipelined Vector Processors

Vector processors are characterized by multiple,
pipelined functional units.
The architecture provides parallel vector
processing by sequentially streaming vector
elements through a functional unit pipeline and
by streaming the output results of one units into
the pipeline of another as input.

14
Register-to-Register Vector Architecture
Operation.

Each pipeline stage in the hypothetical
architecture has a cycle time of 20 nanoseconds.
And then 120 ns elapse from the time operands a1
and b1 enter stage 1 until result c1 is available.

15
SIMD Architectures

SIMD architectures employ
a central control unit
multiple processors
Interconnection network (IN)
Function
--For either processor-to-processor or
processor-to-memory communication.

16
SIMD Architectures computed Example

A SIMD architectures Vector Computer
For example Adding two real Arrays A, B. shows
in the below figure

17
SIMD Architectures Problems

Some SIMD problems, e.g.
SIMD cannot use commodity processors
SIMD cannot supports multiple users
SIMD is less efficiency in conditionally executed
parallel code

18
Bit-Plane Array Processing

Processor arrays structured for numerical SIMD
execution have been employed for large-scale
scientific calculations. For example Image
processing and nuclear energy model.
In bit-plane architectures, the array of
processors is arranged in a symmetrical grid
(such as 64x64) and associated with multiple
planes of memory bits that correspond to the
dimensions of the processor grid.

19
Associative Memory Processing Organization

The right figure is showing the
characteristic functional units of an
associative memory processor.
A program controller (serial computer) reads and
executes instructions, invoking a specialized
array controller when associative memory
instructions are encountered.
Special registers enable the program controller
and associative memory to share data

20
Associative Memory Comparison Operation

The right figure shows a row-oriented comparison
operation for a generic bit-serial architecture.
All of the associative processing elements start
at a specified memory column and compare the
contents of four consecutive bits in their row
against the comparison register contents, setting
a bit in the A register to indicate whether or
not their row contains a match.

21
Associative Memory Logical OR Operation

The right figure is showing that a logical OR
operation is performed on a bit-column and the
bit-vector in register A, with register B
receiving the results.
A zero in the mask register indicates that the
associated word is not to be included in the
current operation.

22
Systolic Flow of Data From and to Memory

Systolic architectures (systolic arrays) are
pipelined multiprocessors in which data is pulsed
in rhythmic fashion from memory and through a
network of processors before returning to memory

23
Systolic Matrix Multiplication

Right figure is a simple systolic array
A ab and B ef
cd gh
The zero inputs shown moving through the array
are used for synchronization.
Each processor begins with an accumulator set to
zero and, during each cycle, adds the product of
its two inputs to the accumulator.
After five cycles the matrix product is complete.

Go back page 34
24
MIMD Architectures

MIMD architectures employ multiple instruction
streams, using local data.
MIMD computers support parallel solution that
require processors to operate in a largely
autonomous manner

25
MIMD Distributed Memory Architecture Structure

Distributed memory architectures connect
processing nodes (consisting of an autonomous
processor and its local architecture structure.
memory) with a processor-to-processor
interconnection network.

26
Example Distributed Memory Architectures

Right figure is IBM RS/6000 SP machine. It is a
distributed memory architectures machine.

27
Interconnection Network Topologies

Various interconnection network topologies have
been proposed to support architectural
expandability and provide efficient performance
for parallel programs with differing
interprocessor communication patterns.
Right figure a-e depicts the topologies which are
a) ring, b) mesh, c) tree, d) hypercube, and e)
tree mapped to a reconfigurable mesh

28
Shared-Memory Architecture

Shared memory architectures accomplish
interprocessor coordination by providing a
global, shared memory that each processor can
address.
In the right figure is showing some major
alternatives for connecting multiple processors
to shared memory.
Figure a) shows bus interconnection, b) shows 2
?2 crossbar, c) shows 8 ?8 omega MIN routing a P3
require to M3

29
MIMD/SIMD Operations

A MIMD architecture to be controlled in SIMD
fashion.
The master/slaves relation of a SIMD
architectures controller and processors can be
mapped onto the node/descendents relation of a
subtree
When the root processor node of a subtree
operates as a SIMD controller, it transmits
instructions to descendent nodes that execute the
instructions on local memory data.

30
Dataflow Architectures

Dataflow architectures. The fundamental feature
of dataflow architectures is an execution
paradigm in which instructions are enabled for
execution as soon as all of their operands become
available.
The program fragment dataflow graphs is shown in
the right figure.

31
Dataflow Token-Matching Example

At step 1, the execution of (3a)gt the result is
15 and the instruction at node 3 requires an
operand.
Step 2, the match this tokengt the result of
token of (5b) with the node 3 instruction.
Step 3, the matching unit creates the instruction
token (template).
Step 4, the node store unit obtains the relevant
instruction opcode from memory.
Step 5, The node store unit then fills in the
relevant token fields and assigns
the instruction to a processor.
The execution of the instruction
will create a new result token
to be used as input to the node
4 instruction.

32
Reduction Architecture Demand Token Production(1)

The figure is a simplified version of a
graph-reduction architecture that maps the
program below onto trees-structured processors
and passes tokens that demand or return results.
In the right figure, it depicts all the demand
tokens produced by the program, as demands for
the values of references propagate down the tree.
The algorithms example is shown as below
abc
bde
cfg
dle3f5g7

33
Reduction Architecture Demand Token Production(2)

In the right figure, it depicts the last two
results that tokens produced are shown as they
are passed to the root node.
The Algorithm example is shown as below
abc
bde
cfg
dle3f5g7

34
Wavefront Array Matrix Multiplication (1)

The right figure a-c with the following
PowerPoint slices depicts the wavefront array
concepts, using the matrix multiplication example
used earlier to illustrate systolic operation on
page 23 of PowerPoint
Figure (a) shown on the right side is the
situation that buffers are initially filled after
memory input.

35
Wavefront Array Matrix Multiplication (2)

Figure (b) ,PE(1,1) adds the product ae to its
accumulator and transmits operands a and e to
neighbors thus, the first computational
wavefront is shown propagating from PE(l,l) to
PE(1,2) and PE(2,l).

36
Wavefront Array Matrix Multiplication (3)

Figure (c ) shows the first computational
wavefront continuing to propagate, while a second
wavefront is propagated by PE(l,l).

A Survey of Parallel Computer Architectures PowerPoint PPT Presentation