Title: Extending the Unified Parallel Processing Speedup Model
 1Extending the Unified Parallel Processing Speedup 
Model
- Computer architectures take advantage of 
 low-level parallelism multiple pipelines
- The next generations of integrated circuits will 
 continue to support increasing numbers of
 transistors.
- How to make efficient use of the additional 
 transistors?
- Answer Parallelism beyond multiple pipelines 
 adding multiple processors or processing
 components in a single chip or single package.
- Each level of parallelism performance suffers 
 from the law of diminishing returns outlined by
 Amdahl.
- Incorporating multiple levels of parallelism 
 results in higher overall performance and
 efficiency.
2Presentation Content
- A discussion of practical and theoretical 
 parallel speedup alternative methods and the
 efficient use of hardware/processing resources in
 capturing speedup.
-  Parallel Speedup/Amdahls Law, Scaled Speedup 
-  Pipelined Processors 
-  Multiprocessors and Multicomputers 
-  Multiple concurrent threads 
-  Multiple concurrent processes 
-  Multiple levels of parallelism with integrated 
 chips/packages that combine microcontrollers with
 Digital Signal Processing chips
3Presentation Summary
- Architects/Chip-Manufacturers are integrating 
 additional levels of parallelism.
- Multiple levels of speedup achieve higher 
 speedups and greater efficiencies than increasing
 hardware at a single parallel level.
- A balanced approach would achieve about the same 
 level of efficiency in cost of hardware resources
 allocated, in delivering parallel speedup at each
 level of parallelism.
- Numerous architectural approaches are possible, 
 each with different trade-offs and performance
 returns.
- Current technology is integrating DSP processing 
 with microcontroller functionality - achieving up
 to three levels of parallelism.
4Classic Model of Parallel Processing
- Multiple Processors available (4) 
- A Process can be divided into serial and parallel 
 portions
- The parallel parts are executed concurrently 
- Serial Time 10 time units 
- Parallel Time 4 time units 
An example parallel process of time 10
S - Serial or non-parallel portion A - All A 
parts can be executed concurrently B - All B 
parts can be executed concurrently All A parts 
must be completed prior to executing the B parts
Executed on a single processor
Executed in parallel on 4 processors 
 5Amdahls Law (Analytical Model)
- Analytical model of parallel speedup from 1960s 
- Parallel fraction (?) is run over n processors 
 taking ?/n time
- The part that must be executed in serial (1- ?) 
 gets no speedup
- Overall performance is limited by the fraction of 
 the work that cannot be done in parallel (1- ?)
- diminishing returns with increasing processors (n)
6Pipelined Processing
- Single Processor enhanced with discrete stages 
- Instructions flow through pipeline stages 
- Parallel Speedup with multiple instructions being 
 executed (by parts) simultaneously
- Realized speedup is partly determined by the 
 number of stages 5 stagesat most 5 times faster
Cycle 1 2 3 4 
 5
OF
WB
EX
F - Instruction Fetch D - Instruction Decode OF - 
Operand Fetch EX - Execute WB - Write Back or 
Result Store Processor clock/cycle is divided 
into sub-cycles, each stage takes one sub-cycle 
 7Pipeline Performance
- Speedup is serial time (nS) over parallel time 
- Performance is limited by the number of pipeline 
 flushes (n) due to jumps
- speculative execution and branch prediction can 
 minimize pipeline flushes
- Performance is also reduced by pipeline stalls 
 (s), due to conflicts with bus access, data not
 ready delays, and other sources
8Super-Scalar Multiple Pipelines
- Concurrent Execution of Multiple sets of 
 instructions
- Example Simultaneous execution of instructions 
 though an integer pipeline while processing
 instructions through a floating point pipeline
- Compiler identifies and specifies separate 
 instruction sets for concurrent execution through
 different pipes
9Algorithm/Thread Level Parallelism
- Example Algorithms to compute Fast Fourier 
 Transform (FFT) used in Digital Signal Processing
 (DSP)
-  Many separate computations in parallel (High 
 Degree Of Parallelism)
-  Large exchange of data - much communication 
 between processors
-  Fine-Grained Parallelism 
-  Communication time (latency) may be a 
 consideration if multiple processors are combined
 on a board of motherboard
-  Large communication load (fine-grained 
 parallelism) can force the algorithm to become
 bandwidth-bound rather than computation-bound.
10Simple Algorithm/Thread Parallelism Model
P1 P2
- Parallel threads of execution 
- could be a separate process 
- could be a multi-thread process 
- Each thread of execution obeys Amdahls parallel 
 speedup model
- Multiple concurrently executing processes 
 resulting in
-  Multiple serial components executing 
 concurrently - another level of parallelism
Observe that the serial parts of Program 1 and 
Program 2 are now running in parallel with each 
other. Each program would take 6 time units on a 
uniprocessor, or a total workload serial time of 
12. Each has a speedup of 1.5. The total speedup 
is 12/4  3, which is also the sum of the program 
speedups. 
 11Multiprocess Speedup 
- Concurrent Execution of Multiple Processes 
- Each process is limited by Amdahls parallel 
 speedup
- Multiple concurrently executing processes 
 resulting in
-  Multiple serial components executing 
 concurrently - another level of parallelism
- Avoid Degree of Parallelism (DOP) speedup 
 limitations
- Linear scaling up to machine limits of processors 
 and memory n ? single process speedup
No speedup - uniprocessor 12 t
Single Process 8 t, Speedup  1.5
Multi-Process 4 t, Speedup  3
Two 
 12Algorithm/Thread Parallelism - Analytical Model
Multi-Process/Thread Speedup ?  fraction of work 
that can be done in parallel nnumber of 
processors N  number concurrent (assumed 
similar) processes or threads
Multi-Process/Thread Speedup ?  fraction of work 
that can be done in parallel nnumber of 
processors N  number concurrent (assumed 
dissimilar) processes or threads 
 13(Simple) Unified Model with Scaled Speedup
Adds scaling factor on parallel work, while 
holding serial work constant k1  scaling factor 
on parallel portion ?  fraction of work that can 
be done in parallel nnumber of processors N  
number concurrent (assumed dissimilar) processes 
or threads 
 14Capturing Multiple Levels of Parallelism
- Most parallelism suffers from diminishing returns 
 - resulting in limited scalability.
- Allocating hardware resources to capture multiple 
 levels of parallelism - operate at efficient end
 of speedup curves.
- Manufacturers of microcontrollers are integrating 
 multiple levels of parallelism on a single chip
15Trend in Microprocessor Architectures
- Architectural Variations 
- DSP and microcontroller cores on same chip 
- DSP also does microprocessor 
-  Microprocessor also does DSP 
-  Multiprocessor 
- Each variation captures some speedup from all 
 three levels
- Varying amounts of speedup from each level 
- Each parallel level operates at a more efficient 
 level than if all hardware resources were
 allocated to a single parallel level
- 1. Intra-Instruction Parallelism Pipelines 
- 2. Instruction-Level Parallelism Super-Scalar - 
 Multiple Pipelines
- 3. Algorithm/Thread Parallelism 
-  Multiple processing elements 
-  Integrated DSP with microcontroller 
-  Enhanced microcontroller to do DSP 
-  Enhanced DSP processor that also functions as a 
 microcontroller
16More Levels of Parallelism Outside the Chip
- Multiple Processors in a box 
- on a motherboard 
-  on back-plane with daughter-boards 
- Shared-Memory Multiprocessors 
-  communication is through shared memory 
- Clustered Multiprocessors 
-  another hierarchical level 
-  processors are grouped into clusters 
-  intra-cluster can be bus or network 
-  inter-cluster can be bus or network 
- Distributed Multicomputers 
-  multiple computers loosely coupled through a 
 network
- n-tiered Architectures 
-  modern client/server architectures 
17Speedup of Client-Server, 2-Tier Systems
- ? - workload balance, of workload on client 
-  ?  1 (100), completely distributed 
-  ?  0 (100), completely centralized 
- n clients, m servers
n CLIENTS
m SERVERS
LAN
INTERNET
LAN 
 18Speedup of Client-Server, n-Tier Systems
- m1 level 1 machines (clients) 
- m2 server2, m3 server3, m3 server3, etc. 
- ?1 - workload balance, of workload on client 
- ?2 -  of workload on server2, ?3 -  of workload 
 on server3, etc.
 SERVERS m2 m3 m4
m1 CLIENTS
INTERNET
LAN
LAN
SAN 
 19Hierarchy of Embedded Parallelism
- 1. N-tiered Client-Server Distributed Systems 
- 2. Clustered Multi-computers 
- 3. Clustered-Multiprocessor 
- 4. Multiple Processors on a Chip 
- 5. Multiple Processing Elements 
-  6. Multiple Pipelines 
-  7. Multiple Stages per Pipeline 
- Goals 
-  Single analytical model that captures 
 parallelism from all levels
-  Simulator that allows exploration 
20References
- K. Hoganson, "Alternative Mechanisms to Achieve 
 Parallel Speedup", First IEEE Online Symposium
 for Electronics Engineers, IEEE Society, August
 2000.
- K. Hoganson, Mapping Parallel Application 
 Communication Topology to Rhombic
 Overlapping-Cluster Multiprocessors, accepted
 for publication, to appear in The Journal of
 Supercomputing, To appear 8/2000, Vol. 17, No. 1.
 
- K. Hoganson, Workload Execution Strategies and 
 Parallel Speedup on Clustered Computers,
 accepted for publication, IEEE Transactions on
 Computers, Vol. 48, No. 11, November 1999.
- Undergraduate Research Project 
- Unified Parallel System Modeling project, 
 Directed Study, Summer-Fall 2000