Title: Parallel Programming Platforms
1Parallel Programming Platforms
Chapter 2
Reference http//www-users.cs.umn.edu/karypis/pa
rbook/ http//www.eel.tsint.edu.tw/teacher/ttsu/te
ach01.htm
2Introduction
- The traditional logical view of a sequential
computer (?????) consists of - a memory connected to a processor via a datapath.
- All these three components processor, memory,
and datapath - Present bottlenecks to the overall processing
rate of a computer system.
3Introduction
- A number of architectural innovations?? over the
years have addressed these bottlenecks. One of
the most important innovation is multiplicity(??)
in - processor units,
- datapaths, and
- memory units.
- This multiplicity is either entirely hidden from
the programmer, as in the case of implicit
parallelism, or exposed to the programmer in
different forms.
4Introduction
- Learning objectives in this chapter
- An overview of important architecture concepts as
they relate to parallel processing. - To provide sufficient detail for programmers to
be able to write efficient code on a variety of
platforms. - It develops cost models and abstractions for
quantifying the performance of various parallel
algorithms, and identifying bottlenecks resulting
from various programming constructs.
5Introduction
- Parallelizing sub-optimal serial codes often has
undesirable effects of unreliable speedups and
misleading???runtimes. - It advocates??optimizing serial performance of
codes before attempting parallelization. - The tasks of serial and parallel optimization
often have very similar characteristics.
(????????????????,???????????????????????????,????
????????????)
6Outline
- Implicit Parallelism
- Limitations of Memory System Performance
- Dichotomy???of Parallel Computing Platforms
- Physical Organization of Parallel Platforms
- Communication Costs in Parallel Machines
- Routing Mechanisms for Interconnection Networks
- Impact of Processor-Processor Mapping and Mapping
Techniques - Case Studies
7Implicit Parallelism ????
- Trend in Microprocessor Architecture
- Pipelining and Superscalar Execution
- Very Long Instruction Word Processors VLIW
8Trend in Microprocessor Architecture
- Clock speeds of microprocessors have posted
impressive???????gains two to three orders of
magnitude over the past 20 years. - However, these increments are severely diluted??
by the limitations of memory technology. - Consequently, techniques that enable execution of
multiple instructions in a single clock cycle
have become popular.
9Trend in Microprocessor Architecture
- Mechanisms used by various processors for
supporting multiple instruction execution. - Pipelining and Superscalar Execution (?????????)
- Very Long Instruction Word Processors (????????)
10Pipelining and Superscalar Execution
- By overlapping various stages in instruction
execution, pipelining enables faster execution.
(????????????????,????) - To increase the speed of a single pipeline, one
would break down the tasks into smaller and
smaller units, thus lengthening the pipeline and
increasing overlap in execution.
(?????????????????,??????????,?????)
11 Pipelining and Superscalar Execution
- For example, the Pentium 4, which operate at 2.0
GHz, at a 20 stages pipeline. - Long instruction pipelines therefore need
effective techniques for predicting branch
destinations so that pipelines can be
speculatively??filled.????????????????????????,???
??????? - An obvious way to improve instruction execution
rate beyond this level is to use multiple
pipelines. - During each clock cycle, multiple instructions
are piped into the processor in parallel.
.(????????????,????????,?????????????)
12Superscalar Execution Example 2.1
Example of a two-way superscalar execution of
instructions.
13- Consider Fig. 2.1(a) first code
- t0 The first and second instructions are
independent and therefore can be issued
concurrently. - t1
- The next two instructions (row 3,4) are also
mutually independent, although they must be
executed after the first two instructions (t0) - They can be issued concurrently at t1 since the
processors are pipelined. - t2 Only add instruction is issued
- t3 Only store instruction is issued
- Two instructions (row 5,6) cannot be executed
concurrently since the result of the former is
used by the latter.
14Superscalar Execution
- Scheduling of instructions is determined by a
number of factors - True Data Dependency The result of one operation
is an input to the next. - Resource Dependency Two operations require the
same resource. - Branch Dependency Scheduling instructions across
conditional branch statements cannot be done
deterministically a-priori???.
15Superscalar Execution
- Scheduling of instructions is determined by a
number of factors - The scheduler, a piece of hardware looks at a
large number of instructions in an instruction
queue and selects appropriate number of
instructions to execute concurrently based on
these factors. - The complexity of this hardware is an important
constraint on superscalar processors.
16Dependency- True data dependency
- The results of an instruction must be required
for subsequent instructions. - Consider the 2nd code fragment, there is a true
data dependency between load R1, _at_1000 and add
R1, _at_1004, - Since the resolution is done at runtime, it must
supported in hardware. The complexity of this
hardware can be high. - The amount of instruction level of parallelism in
a program is often limited and is a function of
coding technique.
17Dependency- True data dependency
- In 2nd code fragment, there can be no
simultaneous issue, leading to poor resource
utilization. - The third code fragments also illustrate in many
cases it is possible to extract more parallelism
by reordering the instructions and by altering
the code. - The code reorganization corresponds to
exposing??parallelism in a form that can be used
by the instruction issue mechanism.
18Dependency- Resource dependency
- The form of dependency in which two instructions
compete for a single processor resource. - As an example, consider the co-scheduling of two
floating point operations on a dual issue machine
with a single floating point unit. - Although there might be no data dependencies
between the instructions, they cannot be
scheduled together since both need the floating
point unit.
19Dependency- Branch or procedural dependencies
- Since the branch destination is known only at the
point of execution, scheduling instructions a
priori across branches may lead to errors. - These dependencies are referred to as branch or
procedural dependencies and are typically handled
by speculatively scheduling across branches and
rolling back in case of errors.
20Dependency- Branch or procedural dependencies
- On average, a branch instruction is encountered
between every five to six instructions. - Therefore, just as in populating instruction
pipelines, accurate branch prediction is critical
for efficient superscalar execution. - The ability of a processor to detect and
concurrent instruction is critical to superscalar
performance.
21Dependency- Branch or procedural dependencies
- The 3rd code fragment is merely semantically
equivalent reordering of the 1st code fragment.
However, there is a data dependency between load
R1, _at_1000 and add R1, _at_1004 . - Therefore, these instructions cannot be issued
simultaneously. However, if the processor has
ability to look ahead, it would realize that it
is possible to schedule the 3rd instruction with
the 1st instruction. - In this way, the same execution schedule can be
derived for the 1st and 3rd code fragments.
However, the processor needs the ability to issue
instruction out-of-order to accomplish desired
ordering.
22Dependency- Branch or procedural dependencies
- Most current microprocessor are capable of
out-of-order issue and completion. - The model, also referred as dynamic instruction
issue, exploits maximum instruction level
parallelism. The processor uses a window of
instructions from which it selects instructions
for simultaneous issue. This window corresponds
to the look ahead of the scheduler. (Dynamic
Dependency Analysis)
23Dependency-Branch or procedural dependencies
- In Fig. 2.1(C)
- These are essentially wasted cycles from the
point of view of the execution unit. If, during a
particular cycle, no instructions are issued on
the execution units, it is referred to as
vertical waste. If only part of the execution
units are used during a cycle, it is termed
horizontal waste. - In all, only three of the eight available cycles
are used for computation. This implies that the
code fragment will yield no more than
three-eighths of the peak rated FLOPS count of
the processor.
24Dependency- Branch or procedural dependencies
- Often, due to limited parallelism, resource
dependencies, or the ability of a processor to
extract parallelism, the resources of superscalar
processors are heavily under-utilized. - Current microprocessors typically support up to
four-issue superscalar execution.
25Very Long Instruction Word Processors VLIW
- The parallelism extracted by superscalar
processors is often limited by the instruction
look ahead. - The hardware logic for Dynamic Dependency
Analysis is typically in the range of 5-10 of
the total logic on conventional microprocessors. - The complexity grows roughly quadratic????with
the number of issues and become a bottleneck.
26Very Long Instruction Word Processors (VLIW)
- An alternate concept for exploiting
instruction-level parallelism used in every long
instruction word (VLIW) processors relies on the
compiler to resolve dependencies and resource
availability at compile time.
27Very Long Instruction Word Processors VLIW
- Instruction that can be executed concurrently are
packed into groups and parceled?? off the
processor as a single long instruction word to be
executed on multiple functional units at the same
time.
28Very Long Instruction Word Processors (VLIW)
- VLIW advantages
- Since the schedule is done in software, the
decoding and instruction issue mechanisms are
simpler in VLIW processors. - The compiler has a larger context from which to
select instructions and can use a variety of
transformations to optimize parallelism when
compared to a hardware issue unit. - Additional parallel instructions are typically
made available to the compiler to control
parallel execution.
29Very Long Instruction Word Processors VLIW
- VLIW disadvantages
- Compilers does not have the dynamic program state
(e.g. the branch history buffer) available to
make scheduling decisions. - This reduces the accuracy of branch and memory
prediction, but allows the use of more
sophisticated???static predictions schemas. - Others runtime situations are extremely
difficulty to predict accurately. - This limits the scope and performance of static
compiler-based scheduling.
30Very Long Instruction Word Processors VLIW
- VLIW is very sensitive to the compilers ability
to detect data and resource dependencies and R/W
hazards, and to schedule instructions for maximum
parallelism. Loop unrolling, branch prediction,
and speculative execution all play important
roles in the performance of VLIW processors. - While superscalar and VLIW processors have been
successful in exploiting implicit parallelism,
they are generally limited to smaller scales of
parallelism concurrency in the range of
four-to-eight-way parallelism.
31Limitations of Memory System Performance
32Limitations of Memory System Performance
- Memory system, and not processor speed, is often
the bottleneck for many applications. - Memory system performance is largely captured by
two parameters, latency and bandwidth.
33Limitations of Memory System Performance
- Latency is the time from the issue of a memory
request to the time the data is available at the
processor. - Bandwidth is the rate at which data can be pumped
to the processor by the memory system.
34Example2.2 Effect of memory latency on
performance
- Consider a processor operating at 1GHz (1 ns
clock) connected to a DRAM with a latency of 100
ns (no caches) 100 cycles. - Assume that the processor has two multiple-add
units and is capable of executing four
instructions in each cycle of 1 ns. The peak
processor rating is therefore 4 GFLOPS. - (4 FLOPS/cycle x 109 cycles/s4x109FLOPS)
35Example2.2 Effect of memory latency on
performance
- Since the memory latency is equal to 100 cycles
and the block size is one word, every time a
memory request is made, the processor must wait
100 cycles before it can process the data. - It is easy to see that the peak speed of this
computation is limited to one floating point
operation every 100 ns(10010-910-7), or a
speed of 10 MFLOPS (10106107).
36Limitations of Memory System Performance
- Improve Effective Memory Latency Using Caches
- Impact of Memory Bandwidth
- Alternate Approaches for Hiding Memory Latency
- Multithreading for Latency Hiding
- Prefetching for Latency Hiding
- Tradeoffs of Multithreading and Prefetching
37Improve Effective Memory Latency Using Caches
- One innovation??address the speed mismatch by
placing a smaller and faster memory between the
processor and the DRAM. - The fraction of data references satisfied by the
cache is called the cache hit ratio. - The notation of repeated reference to a data item
in a small time window is called temporal
locality.
38Improve Effective Memory Latency Using Caches
- The effective computation rate of many
applications is bounded not by the processing
rate of the CPU, but by the rate at which data
can be pumped into the CPU. Such computations are
referred to as being memory bound.
39Example2.3 Impact of caches on memory system
performance
- As Example2.2, consider a 1GHz processor with a
100 ns latency DRAM and we introduce a cache of
size 32 KB with a latency of 1ns or one cycle. We
use this setup to multiply two matrices A and B
of Dimension 32x32 - A32x322101K B32x322101K then, 1K1K2K
words2000 words - Fetching two matrices into cache takes about
2000x100ns 200 µs - Multiplying two n x n matrices takes 2n3
operations 2(32)3 64K operations
40Example2.3 Impact of caches on memory system
performance
- Because the processor has two multiple-add units
and is capable of executing four instructions in
each cycle of 1 ns. - Then 64K4x16K cycles (or 16 µs) at four
instructions per cycle - Total 200 µs 16 µs 216 µs
- Peak computation rate 64K / 216 303.4074 x
106 303 MFLOPS - Compare with example 2.2
- Improvement ratio 303 / 10 30.3 about thirty
fold
41Impact of Memory Bandwidth
- Memory Bandwidth
- The rate at which data can be moved between the
processor and memory. - It is determined by the memory bus as well as the
memory units. - The single memory request returns a
contiguous???block of four words. The single unit
of four words in this case is also referred to as
a cache line.
42Impact of Memory Bandwidth
- In following example, the data layouts were
assumed to be such that consecutive data words in
memory were used by successive instructions. In
other words, if we take a computation-centric
view, there is a spatial locality of memory
access.
43Example2.4 Effect of block size dot-product of
two vectors
- A peak speed of 10 MFLOPS as illustrated in
example 2.2 - If the block size is increased to four words
i.e., the processor can fetch a four-word cache
line every 100 cycles - For each pair of words, the dot-product performs
one multiply-add, i.e., 2 FLOPs, - then four words need 8 FLOPs can be performed in
2x100 200 cycles - The corresponds to a FLOP every 200/8 25 ns, for
a peak of 1/25ns109/25 40 MFLOPs
44Impact of Memory Bandwidth
- If we take a data-layout centric point view, the
computation is ordered so that successive
computations require contiguous data. - If the computation (or access pattern) does not
have spatial locality, then effective bandwidth
can be much smaller than the peak bandwidth.
45Row majority vs. Column Majority
- Row majority
- for( i0 ilt100 i )
- for( j0jlt100j )
- aijbijcij
- Column majority
- for( j0jlt100j )
- for( i0 ilt100 i )
- aijbijcij
46Impact of strided??access Example 2.5
- Consider the following code fragment
- for (i 0 i lt 1000 i)
- column_sumi 0.0
- for (j 0 j lt 1000 j)
- column_sumi Aji
- The code fragment sums columns of the matrix A
into a vector column_sum. - Assumption the matrix has been stored in a
row-major fashion in memory.
47- Example2.5 Impact of strided? access
Example 2.5 column sum
Example 2.6 column sum II
48Eliminating strided access Example 2.6
- We can fix the above code as follows
- for (i 0 i lt 1000 i)
- column_sumi 0.0
- for (j 0 j lt 1000 j)
- for (i 0 i lt 1000 i)
- column_sumi Aji
- In this case, the matrix is traversed in a
row-order and performance can be expected to be
significantly better.
49Memory System Performance Summary
- The series of examples presented in this section
illustrate the following concepts - Exploiting spatial and temporal locality in
applications is critical for amortizing??memory
latency and increasing effective memory
bandwidth. - The ratio of the number of operations to number
of memory accesses is a good indicator of
anticipated??tolerance to memory bandwidth. - Memory layouts and organizing computation
appropriately can make a significant impact on
the spatial and temporal locality.
50Alternate Approaches for Hiding Memory Latency
- Imaging sitting at your computer browsing the web
during peak network network traffic hours. The
lack of response from your browser can be
alleviated?? - Multithreading for Latency Hidinglike we open
multiple browsers and access different pages in
each browser,thus while we are waiting for one
page to load, we could be reading others - Prefetching for Latency Hiding like we
anticipate??which pages we are going to browse
ahead of time and issue requests for them in
advance. - Spatial locality in accessing memory wordslike
we access a whole bunch?of pages in one go.
51Multithreading for Latency Hiding
- A thread is a single stream of control in the
flow of a program. - We illustrate threads with a simple example 2.7
- for (i 0 i lt n i)
- ci dot_product(get_row(a, i), b)
- Each dot-product is independent of the other, and
therefore represents a concurrent unit of
execution. We can safely rewrite the above code
segment as - for (i 0 i lt n i)
- ci create_thread(dot_product,get_row(a,
i), b)
52Multithreading for Latency Hiding Example 2.7
- In the code, the first instance of this function
accesses a pair of vector elements and waits for
them. - In the meantime, the second instance of this
function can access two other vector elements in
the next cycle, and so on. - After l units of time, where l is the latency of
the memory system, the first function instance
gets the requested data from memory and can
perform the required computation. - In the next cycle, the data items for the next
function instance arrive, and so on. - In this way, in every clock cycle, we can perform
a computation.
53Multithreading for Latency Hiding
- The execution schedule in the previous example is
predicated upon two assumptions - the memory system is capable of servicing
multiple outstanding???requests, and - the processor is capable of switching threads at
every cycle.
54Multithreading for Latency Hiding
- It also requires the program to have an explicit
specification of concurrency in the form of
threads. - Machines such as the HEP and Tera rely on
multithreaded processors that can switch the
context of execution in every cycle. - Consequently, they are able to hide latency
effectively.
55Prefetching for Latency Hiding
- Misses on loads cause programs to stall.
- Why not advance the loads so that by the time the
data is actually needed, it is already there! - The only drawback is that you might need more
space to store advanced loads. - However, if the advanced loads are overwritten,
we are no worse than before!
56Example 2.8 Hiding latency of perfecting
- Consider the problem of adding two vectors a and
b using a single loop. - In the first iteration of the loop
- The processor request a0 and b0
- Since these are not in the cache, the processor
must pay the memory latency. - While these requests are being serviced, the
processor also requests a1 and b1. - Assuming that each request is generated in one
cycle (1ns) and memory requests are satisfied in
100 ns - After 100 such requests the first set of data
items is return by the memory system - Subsequently, one pair of vector components will
be returned every cycle.
57Example2.9 Impact of bandwidth on multithreaded
programs
- Consider a computation running on a machine with
a 1GHz clock, 4 word cache line, single cycle
access to the cache, and 100 ns latency to DRAM.
The computation has a cache hit ratio at 1 KB of
25 and at 32 KB of 90. - A single threaded execution in which the entire
cache (32KB) is available to the serial context - A multithreaded execution with 32 threads where
each thread has a cache residency of 1 KB - If the computation makes one data request in
every cycle of 1 ns
58Example2.9
- In the first case
- DRAM latency 100ns
- 4 words/cycle computation
- 4 words/ns ( 4-ways)
- 100ns need support 400 words to CPU
- 1s need support 107x400words4000MB
- 10 from DRAM then, 10x4000MB400MB/s
- DRAM bandwidth 400MB/s
- A single thread
59Example2.9
- In the second case,
- DRAM latency 100ns
- 4 words/cycle computation
- 4 words/ns ( 4-ways)
- 100ns need support 400 words to CPU
- 1s need support 107x400words4000MB
- 75 from DRAM then, 75x4000MB3000MB3GB
- DRAM bandwidth 3GB/s
- 32 threads
60Tradeoffs of Multithreading and Prefetching
- Bandwidth requirements of a multithreaded system
may increase very significantly because of the
smaller cache residency of each thread. - Multithreaded systems become bandwidth bound
instead of latency bound. - Multithreading and prefetching only address the
latency problem and may often exacerbate?? the
bandwidth problem. - Multithreading and prefetching also require
significantly more hardware resources in the form
of storage.
61Dichotomy???of Parallel Computing Platforms
62Dichotomy of Parallel Computing Platforms
- Logical
- Control Structure of Parallel Platformsthe
former - Communication Model of Parallel Platforms(chap
10) the latter - Shared-Address-Space Platforms(chap 07)
- Message-Passing Platforms(chap 06)
- Physical
- Architecture of an ideal Parallel Computer
- Interconnection Networks for Parallel Computers
- Network Topologies
- Evaluating Static Interconnection Networks
- Evaluating Dynamic Interconnection Networks
- Cache Coherence in Multiprocessor Systems
63Control Structure of Parallel Programs
- Parallelism can be expressed at various levels of
granularity??- from instruction level to
processes. - Between these extremes exist a range of models,
along with corresponding architectural support.
64Example2.10 Parallelism from single instruction
on multiple processors
- Consider the following code segment that adds two
vectors - for (i0 ilt1000 i)
- ciaibi
- C0a0b0C1a1b1..etc.,can be
executed independently of each other. - If there is a mechanism for executing the same
instruction, all the processors with appropriate
data, we could execute this loop much faster.
65SIMD and MIMD
66SIMD
- SIMD (Single instruction stream, multiple data
stream) Architecture - A single control unit dispatches instructions to
each processing unit. - In an SIMD parallel computer, the same
instruction is executed synchronously by all
processing units. - These Architectural enhancements rely on the
highly structured (regular) nature of underlying
computations, for example in image processing and
graphics, to deliver improved performance.
67MIMD
- MIMD (Multiple instruction stream, multiple data
stream) Architecture - Computers in which each processing element can
execute a different program independence of the
other processing elements - A simple variant of this model, called the
single program multiple data (SPMD), relies on
multiple instances of the same program executing
on different data. - The SPMD model is widely used by many parallel
platforms and requires minimal architecture
support. Examples of such platforms include the
SUN Ultra Servers, Microprocessor PCs,
workstation clusters, and the IBM SP.
68SIMD vs. MIMD
- SIMD computers require less hardware than MIMD
computers because they only one global control
unit. - Furthermore, SIMD computers require less memory
because only one copy of the program needs to be
stored. - Platforms supporting the SIMD paradigm can be
built from inexpensive off-the-shelf components
with relatively little effort in a short amount
of time.
69SIMD Disadvantages
- Since the underlying serial processors change so
rapidly, SIMD computers suffer from fast
obsolescence??. - The irregular nature of many applications makes
SIMD architecture less suitable. - Example 2.11 illustrates a case in which SIMD
architectures yield poor resource utilization in
the conditional execution.
70Example2.11 Conditional Execution in SIMD
Processors
71Communication Model of Parallel Platforms
- There are two primary forms of data exchange
between parallel tasks - Shared-Address-Space Platforms(ch 07)
- Message-Passing Platforms(ch 06)
72Shared-Address-Space Platforms
- Part (or all) of the memory is accessible to all
processors. - Processors interact by modifying data objects
stored in this shared-address-space. - If the time taken by a processor to access any
memory word in the system global or local is
identical, the platform is classified as a
uniform memory access (UMA), else, a non-uniform
memory access (NUMA) machine.
73Shared-Address-Space Platforms
74Shared-Address-Space Platforms
- The Shared-Address-Space view of a parallel
supports a common data space that is accessible
to all processors. - Processors interact by modifying data objects
stored in this Shared-Address-Space. - Memory in Shared-Address-Space platforms can be
local or global - Shared-Address-Space platforms supporting SPMD
programming are also referred to as
multiprocessors.
75Shared-Address-Space vs. Shared Memory Machines
- It is important to note the difference between
the terms shared address space and shared memory.
- We refer to the former as a programming
abstraction and to the latter as a physical
machine attribute. - It is possible to provide a shared address space
using a physically distributed memory
76Message-Passing Platforms
- These platforms comprise of a set of processors
and their own (exclusive) memory. - Instances of such a view come naturally from
clustered workstations and non-shared-address-spac
e multicomputers. - These platforms are programmed using (variants
of) send and receive primitives. - Libraries such as MPI and PVM provide such
primitives.
77Message Passing vs. Shared Address Space
Platforms
- Message passing requires little hardware support,
other than a network. - Shared address space platforms can easily
emulate??message passing. The reverse is more
difficult to do (in an efficient manner).