Title: Extending the Unified Parallel Processing Speedup Model
1Extending the Unified Parallel Processing Speedup
Model
- Computer architectures take advantage of
low-level parallelism multiple pipelines - The next generations of integrated circuits will
continue to support increasing numbers of
transistors. - How to make efficient use of the additional
transistors? - Answer Parallelism beyond multiple pipelines
adding multiple processors or processing
components in a single chip or single package. - Each level of parallelism performance suffers
from the law of diminishing returns outlined by
Amdahl. - Incorporating multiple levels of parallelism
results in higher overall performance and
efficiency.
2Presentation Content
- A discussion of practical and theoretical
parallel speedup alternative methods and the
efficient use of hardware/processing resources in
capturing speedup. - Parallel Speedup/Amdahls Law, Scaled Speedup
- Pipelined Processors
- Multiprocessors and Multicomputers
- Multiple concurrent threads
- Multiple concurrent processes
- Multiple levels of parallelism with integrated
chips/packages that combine microcontrollers with
Digital Signal Processing chips
3Presentation Summary
- Architects/Chip-Manufacturers are integrating
additional levels of parallelism. - Multiple levels of speedup achieve higher
speedups and greater efficiencies than increasing
hardware at a single parallel level. - A balanced approach would achieve about the same
level of efficiency in cost of hardware resources
allocated, in delivering parallel speedup at each
level of parallelism. - Numerous architectural approaches are possible,
each with different trade-offs and performance
returns. - Current technology is integrating DSP processing
with microcontroller functionality - achieving up
to three levels of parallelism.
4Classic Model of Parallel Processing
- Multiple Processors available (4)
- A Process can be divided into serial and parallel
portions - The parallel parts are executed concurrently
- Serial Time 10 time units
- Parallel Time 4 time units
An example parallel process of time 10
S - Serial or non-parallel portion A - All A
parts can be executed concurrently B - All B
parts can be executed concurrently All A parts
must be completed prior to executing the B parts
Executed on a single processor
Executed in parallel on 4 processors
5Amdahls Law (Analytical Model)
- Analytical model of parallel speedup from 1960s
- Parallel fraction (?) is run over n processors
taking ?/n time - The part that must be executed in serial (1- ?)
gets no speedup - Overall performance is limited by the fraction of
the work that cannot be done in parallel (1- ?) - diminishing returns with increasing processors (n)
6Pipelined Processing
- Single Processor enhanced with discrete stages
- Instructions flow through pipeline stages
- Parallel Speedup with multiple instructions being
executed (by parts) simultaneously - Realized speedup is partly determined by the
number of stages 5 stagesat most 5 times faster
Cycle 1 2 3 4
5
OF
WB
EX
F - Instruction Fetch D - Instruction Decode OF -
Operand Fetch EX - Execute WB - Write Back or
Result Store Processor clock/cycle is divided
into sub-cycles, each stage takes one sub-cycle
7Pipeline Performance
- Speedup is serial time (nS) over parallel time
- Performance is limited by the number of pipeline
flushes (n) due to jumps - speculative execution and branch prediction can
minimize pipeline flushes - Performance is also reduced by pipeline stalls
(s), due to conflicts with bus access, data not
ready delays, and other sources
8Super-Scalar Multiple Pipelines
- Concurrent Execution of Multiple sets of
instructions - Example Simultaneous execution of instructions
though an integer pipeline while processing
instructions through a floating point pipeline - Compiler identifies and specifies separate
instruction sets for concurrent execution through
different pipes
9Algorithm/Thread Level Parallelism
- Example Algorithms to compute Fast Fourier
Transform (FFT) used in Digital Signal Processing
(DSP) - Many separate computations in parallel (High
Degree Of Parallelism) - Large exchange of data - much communication
between processors - Fine-Grained Parallelism
- Communication time (latency) may be a
consideration if multiple processors are combined
on a board of motherboard - Large communication load (fine-grained
parallelism) can force the algorithm to become
bandwidth-bound rather than computation-bound.
10Simple Algorithm/Thread Parallelism Model
P1 P2
- Parallel threads of execution
- could be a separate process
- could be a multi-thread process
- Each thread of execution obeys Amdahls parallel
speedup model - Multiple concurrently executing processes
resulting in - Multiple serial components executing
concurrently - another level of parallelism
Observe that the serial parts of Program 1 and
Program 2 are now running in parallel with each
other. Each program would take 6 time units on a
uniprocessor, or a total workload serial time of
12. Each has a speedup of 1.5. The total speedup
is 12/4 3, which is also the sum of the program
speedups.
11Multiprocess Speedup
- Concurrent Execution of Multiple Processes
- Each process is limited by Amdahls parallel
speedup - Multiple concurrently executing processes
resulting in - Multiple serial components executing
concurrently - another level of parallelism - Avoid Degree of Parallelism (DOP) speedup
limitations - Linear scaling up to machine limits of processors
and memory n ? single process speedup
No speedup - uniprocessor 12 t
Single Process 8 t, Speedup 1.5
Multi-Process 4 t, Speedup 3
Two
12Algorithm/Thread Parallelism - Analytical Model
Multi-Process/Thread Speedup ? fraction of work
that can be done in parallel nnumber of
processors N number concurrent (assumed
similar) processes or threads
Multi-Process/Thread Speedup ? fraction of work
that can be done in parallel nnumber of
processors N number concurrent (assumed
dissimilar) processes or threads
13(Simple) Unified Model with Scaled Speedup
Adds scaling factor on parallel work, while
holding serial work constant k1 scaling factor
on parallel portion ? fraction of work that can
be done in parallel nnumber of processors N
number concurrent (assumed dissimilar) processes
or threads
14Capturing Multiple Levels of Parallelism
- Most parallelism suffers from diminishing returns
- resulting in limited scalability. - Allocating hardware resources to capture multiple
levels of parallelism - operate at efficient end
of speedup curves. - Manufacturers of microcontrollers are integrating
multiple levels of parallelism on a single chip
15Trend in Microprocessor Architectures
- Architectural Variations
- DSP and microcontroller cores on same chip
- DSP also does microprocessor
- Microprocessor also does DSP
- Multiprocessor
- Each variation captures some speedup from all
three levels - Varying amounts of speedup from each level
- Each parallel level operates at a more efficient
level than if all hardware resources were
allocated to a single parallel level
- 1. Intra-Instruction Parallelism Pipelines
- 2. Instruction-Level Parallelism Super-Scalar -
Multiple Pipelines - 3. Algorithm/Thread Parallelism
- Multiple processing elements
- Integrated DSP with microcontroller
- Enhanced microcontroller to do DSP
- Enhanced DSP processor that also functions as a
microcontroller
16More Levels of Parallelism Outside the Chip
- Multiple Processors in a box
- on a motherboard
- on back-plane with daughter-boards
- Shared-Memory Multiprocessors
- communication is through shared memory
- Clustered Multiprocessors
- another hierarchical level
- processors are grouped into clusters
- intra-cluster can be bus or network
- inter-cluster can be bus or network
- Distributed Multicomputers
- multiple computers loosely coupled through a
network - n-tiered Architectures
- modern client/server architectures
17Speedup of Client-Server, 2-Tier Systems
- ? - workload balance, of workload on client
- ? 1 (100), completely distributed
- ? 0 (100), completely centralized
- n clients, m servers
n CLIENTS
m SERVERS
LAN
INTERNET
LAN
18Speedup of Client-Server, n-Tier Systems
- m1 level 1 machines (clients)
- m2 server2, m3 server3, m3 server3, etc.
- ?1 - workload balance, of workload on client
- ?2 - of workload on server2, ?3 - of workload
on server3, etc.
SERVERS m2 m3 m4
m1 CLIENTS
INTERNET
LAN
LAN
SAN
19Hierarchy of Embedded Parallelism
- 1. N-tiered Client-Server Distributed Systems
- 2. Clustered Multi-computers
- 3. Clustered-Multiprocessor
- 4. Multiple Processors on a Chip
- 5. Multiple Processing Elements
- 6. Multiple Pipelines
- 7. Multiple Stages per Pipeline
- Goals
- Single analytical model that captures
parallelism from all levels - Simulator that allows exploration
20References
- K. Hoganson, "Alternative Mechanisms to Achieve
Parallel Speedup", First IEEE Online Symposium
for Electronics Engineers, IEEE Society, August
2000. - K. Hoganson, Mapping Parallel Application
Communication Topology to Rhombic
Overlapping-Cluster Multiprocessors, accepted
for publication, to appear in The Journal of
Supercomputing, To appear 8/2000, Vol. 17, No. 1.
- K. Hoganson, Workload Execution Strategies and
Parallel Speedup on Clustered Computers,
accepted for publication, IEEE Transactions on
Computers, Vol. 48, No. 11, November 1999. - Undergraduate Research Project
- Unified Parallel System Modeling project,
Directed Study, Summer-Fall 2000