Extending the Unified Parallel Processing Speedup Model presentation

About This Presentation

Transcript and Presenter's Notes

Title: Extending the Unified Parallel Processing Speedup Model

1
Extending the Unified Parallel Processing Speedup
Model

Computer architectures take advantage of
low-level parallelism multiple pipelines
The next generations of integrated circuits will
continue to support increasing numbers of
transistors.
How to make efficient use of the additional
transistors?
Answer Parallelism beyond multiple pipelines
adding multiple processors or processing
components in a single chip or single package.
Each level of parallelism performance suffers
from the law of diminishing returns outlined by
Amdahl.
Incorporating multiple levels of parallelism
results in higher overall performance and
efficiency.

2
Presentation Content

A discussion of practical and theoretical
parallel speedup alternative methods and the
efficient use of hardware/processing resources in
capturing speedup.
Parallel Speedup/Amdahls Law, Scaled Speedup
Pipelined Processors
Multiprocessors and Multicomputers
Multiple concurrent threads
Multiple concurrent processes
Multiple levels of parallelism with integrated
chips/packages that combine microcontrollers with
Digital Signal Processing chips

3
Presentation Summary

Architects/Chip-Manufacturers are integrating
additional levels of parallelism.
Multiple levels of speedup achieve higher
speedups and greater efficiencies than increasing
hardware at a single parallel level.
A balanced approach would achieve about the same
level of efficiency in cost of hardware resources
allocated, in delivering parallel speedup at each
level of parallelism.
Numerous architectural approaches are possible,
each with different trade-offs and performance
returns.
Current technology is integrating DSP processing
with microcontroller functionality - achieving up
to three levels of parallelism.

4
Classic Model of Parallel Processing

Multiple Processors available (4)
A Process can be divided into serial and parallel
portions
The parallel parts are executed concurrently
Serial Time 10 time units
Parallel Time 4 time units

An example parallel process of time 10
S - Serial or non-parallel portion A - All A
parts can be executed concurrently B - All B
parts can be executed concurrently All A parts
must be completed prior to executing the B parts
Executed on a single processor
Executed in parallel on 4 processors
5
Amdahls Law (Analytical Model)

Analytical model of parallel speedup from 1960s
Parallel fraction (?) is run over n processors
taking ?/n time
The part that must be executed in serial (1- ?)
gets no speedup
Overall performance is limited by the fraction of
the work that cannot be done in parallel (1- ?)
diminishing returns with increasing processors (n)

6
Pipelined Processing

Single Processor enhanced with discrete stages
Instructions flow through pipeline stages
Parallel Speedup with multiple instructions being
executed (by parts) simultaneously
Realized speedup is partly determined by the
number of stages 5 stagesat most 5 times faster

Cycle 1 2 3 4
5
OF
WB
EX
F - Instruction Fetch D - Instruction Decode OF -
Operand Fetch EX - Execute WB - Write Back or
Result Store Processor clock/cycle is divided
into sub-cycles, each stage takes one sub-cycle
7
Pipeline Performance

Speedup is serial time (nS) over parallel time
Performance is limited by the number of pipeline
flushes (n) due to jumps
speculative execution and branch prediction can
minimize pipeline flushes
Performance is also reduced by pipeline stalls
(s), due to conflicts with bus access, data not
ready delays, and other sources

8
Super-Scalar Multiple Pipelines

Concurrent Execution of Multiple sets of
instructions
Example Simultaneous execution of instructions
though an integer pipeline while processing
instructions through a floating point pipeline
Compiler identifies and specifies separate
instruction sets for concurrent execution through
different pipes

9
Algorithm/Thread Level Parallelism

Example Algorithms to compute Fast Fourier
Transform (FFT) used in Digital Signal Processing
(DSP)
Many separate computations in parallel (High
Degree Of Parallelism)
Large exchange of data - much communication
between processors
Fine-Grained Parallelism
Communication time (latency) may be a
consideration if multiple processors are combined
on a board of motherboard
Large communication load (fine-grained
parallelism) can force the algorithm to become
bandwidth-bound rather than computation-bound.

10
Simple Algorithm/Thread Parallelism Model
P1 P2

Parallel threads of execution
could be a separate process
could be a multi-thread process
Each thread of execution obeys Amdahls parallel
speedup model
Multiple concurrently executing processes
resulting in
Multiple serial components executing
concurrently - another level of parallelism

Observe that the serial parts of Program 1 and
Program 2 are now running in parallel with each
other. Each program would take 6 time units on a
uniprocessor, or a total workload serial time of
12. Each has a speedup of 1.5. The total speedup
is 12/4 3, which is also the sum of the program
speedups.
11
Multiprocess Speedup

Concurrent Execution of Multiple Processes
Each process is limited by Amdahls parallel
speedup
Multiple concurrently executing processes
resulting in
Multiple serial components executing
concurrently - another level of parallelism
Avoid Degree of Parallelism (DOP) speedup
limitations
Linear scaling up to machine limits of processors
and memory n ? single process speedup

No speedup - uniprocessor 12 t
Single Process 8 t, Speedup 1.5
Multi-Process 4 t, Speedup 3
Two
12
Algorithm/Thread Parallelism - Analytical Model
Multi-Process/Thread Speedup ? fraction of work
that can be done in parallel nnumber of
processors N number concurrent (assumed
similar) processes or threads
Multi-Process/Thread Speedup ? fraction of work
that can be done in parallel nnumber of
processors N number concurrent (assumed
dissimilar) processes or threads
13
(Simple) Unified Model with Scaled Speedup
Adds scaling factor on parallel work, while
holding serial work constant k1 scaling factor
on parallel portion ? fraction of work that can
be done in parallel nnumber of processors N
number concurrent (assumed dissimilar) processes
or threads
14
Capturing Multiple Levels of Parallelism

Most parallelism suffers from diminishing returns
- resulting in limited scalability.
Allocating hardware resources to capture multiple
levels of parallelism - operate at efficient end
of speedup curves.
Manufacturers of microcontrollers are integrating
multiple levels of parallelism on a single chip

15
Trend in Microprocessor Architectures

Architectural Variations
DSP and microcontroller cores on same chip
DSP also does microprocessor
Microprocessor also does DSP
Multiprocessor
Each variation captures some speedup from all
three levels
Varying amounts of speedup from each level
Each parallel level operates at a more efficient
level than if all hardware resources were
allocated to a single parallel level

1. Intra-Instruction Parallelism Pipelines
2. Instruction-Level Parallelism Super-Scalar -
Multiple Pipelines
3. Algorithm/Thread Parallelism
Multiple processing elements
Integrated DSP with microcontroller
Enhanced microcontroller to do DSP
Enhanced DSP processor that also functions as a
microcontroller

16
More Levels of Parallelism Outside the Chip

Multiple Processors in a box
on a motherboard
on back-plane with daughter-boards
Shared-Memory Multiprocessors
communication is through shared memory
Clustered Multiprocessors
another hierarchical level
processors are grouped into clusters
intra-cluster can be bus or network
inter-cluster can be bus or network
Distributed Multicomputers
multiple computers loosely coupled through a
network
n-tiered Architectures
modern client/server architectures

17
Speedup of Client-Server, 2-Tier Systems

? - workload balance, of workload on client
? 1 (100), completely distributed
? 0 (100), completely centralized
n clients, m servers

n CLIENTS
m SERVERS
LAN
INTERNET
LAN
18
Speedup of Client-Server, n-Tier Systems

m1 level 1 machines (clients)
m2 server2, m3 server3, m3 server3, etc.
?1 - workload balance, of workload on client
?2 - of workload on server2, ?3 - of workload
on server3, etc.

SERVERS m2 m3 m4
m1 CLIENTS
INTERNET
LAN
LAN
SAN
19
Hierarchy of Embedded Parallelism

1. N-tiered Client-Server Distributed Systems
2. Clustered Multi-computers
3. Clustered-Multiprocessor
4. Multiple Processors on a Chip
5. Multiple Processing Elements
6. Multiple Pipelines
7. Multiple Stages per Pipeline
Goals
Single analytical model that captures
parallelism from all levels
Simulator that allows exploration

20
References

K. Hoganson, "Alternative Mechanisms to Achieve
Parallel Speedup", First IEEE Online Symposium
for Electronics Engineers, IEEE Society, August
2000.
K. Hoganson, Mapping Parallel Application
Communication Topology to Rhombic
Overlapping-Cluster Multiprocessors, accepted
for publication, to appear in The Journal of
Supercomputing, To appear 8/2000, Vol. 17, No. 1.
K. Hoganson, Workload Execution Strategies and
Parallel Speedup on Clustered Computers,
accepted for publication, IEEE Transactions on
Computers, Vol. 48, No. 11, November 1999.
Undergraduate Research Project
Unified Parallel System Modeling project,
Directed Study, Summer-Fall 2000

Write a Comment

User Comments (0)

About PowerShow.com

Extending the Unified Parallel Processing Speedup Model PowerPoint PPT Presentation