Title: Thread Level Parallelism
1Thread Level Parallelism
- Since ILP has inherent limitations, can we
exploit multithreading? - a thread is defined as a separate process with
its own instructions and data - this is unlike the traditional (OS) definition of
a thread which shares instructions with other
threads but they each have their own stack and
data (a thread in this case is multiple versions
of the same process) - a thread may be a traditional thread or a
separate process or a single program executing in
parallel - the idea here is that the thread offers different
instructions and data so that the processor, when
it has to stall, can switch to another thread and
continue execution so that it does not cause time
consuming stalls - TLP exploits a different kind of parallelism than
ILP
2Approaches to TLP
- We want to enhance our current processor
- superscalar with dynamic scheduling
- Fine-grained multi-threading
- switches between threads at each clock cycle
- thus, threads are executed in an interleaved
fashion - as the processor switches from one thread to the
next, a thread that is currently stalled is
skipped over - CPU must be able to switch between threads at
every clock cycle so that it needs extra hardware
support - Coarse-grained multi-threading
- switches between threads only when current thread
is likely to stall for some time (e.g., level 2
cache miss) - the switching process can be more time consuming
since we are not switching nearly as often and
therefore does not need extra hardware support
3Advantages/Disadvantages
- Fine-grained
- Adv less susceptible to stalling situations
- Adv throughput costs can be hidden because
stalls are often unnoticed - Disadv slows down execution of each thread
- Disadv requires a switching process that does
not cost any cycles this can be done at the
expense of more hardware (we will require at a
minimum a PC for every thread) - Coarse-grained
- Adv more natural flow for any given thread
- Adv easier to implement switching process
- Adv can take advantage of current processors to
implement coarse-grained, but not fine-grained - Disadv limited in its ability to overcome
throughput losses because of short stalling
situations because the cost of starting the
pipeline on a new thread is expensive (in
comparison to fine-grained)
4Simultaneous Multi-threading (SMT)
- SMT uses multiple issue and dynamic scheduling on
our superscalar architecture but adds
multi-threading - (a) is the traditional approach with idle slots
caused by stalls and a lack of ILP - (b) and (c) are fine-grained and coarse-grained
MT respectively - (d) shows the potential payoff for SMT
- (e) goes one step further to illustrate
multiprocessing
5Four Approaches
- Superscalar on a single thread (a)
- we are limited to ILP or, if we switch threads
when one is going to stall, then the switch is
equivalent to a context switch, which takes many
(dozens or hundreds) of cycles - Superscalar coarse-grained MT (c)
- fairly easy to implement, performance increase
over no MT support, but still contains empty
instruction slots due to short stalling
situations (as opposed to lengthier stalls
associated with cache miss) - Superscalar fine-grained MT (b)
- requires switching between threads at each cycle
which requires more complex and expensive
hardware, but eliminates most stalls, the only
problem is that a thread that lacks ILP or cannot
make use of all instruction issue slots will not
take full advantage of the hardware - Superscalar SMT (d)
- most efficient way to use hardware and
multithreading so that as many functional units
as possible can be occupied
6Superscalar Limitations for SMT
- In spite of the performance increase by combining
our superscalar hardware and SMT, there are still
inherent limitations - how many active threads can be considered at one
time? - we will be limited by resources such as number of
PCs available to keep track of each thread, size
of bus to accommodate multiple threads having
instruction fetches at the same time, how many
threads can be stored in main memory, etc - finite limitation on buffers used to support the
superscalar - reorder buffer, instruction queue, issue buffer
- limitations on bandwidth between CPU and
cache/memory - limitation on the combination of instructions
that can be issued at the same time - consider four threads, each of which contains an
abnormally large number of FP but no FP , then
the multiplier functional unit(s) will be very
busy while the adder remains idle
7SMT Design Challenges
- Superscalars best perform on lengthier pipelines
- We will only implement SMT using fine-grained MT
so we need - large register file to accommodate multiple
threads - per-thread renaming table and more registers for
renaming - separate PCs for each thread
- ability to commit instructions of multiple
threads in the same cycle - added logic that does not require an increase in
clock cycle time - cache and TLB setups that can handle simultaneous
thread access without a degradation in their
performance (miss rate, hit time) - In spite of the design challenges, we will find
- performance on each individual thread will
decrease (this is natural since every thread will
be interrupted as the CPU switches to other
threads, cycle-by-cycle) - One alternative strategy is to have a preferred
thread of which instructions are issued every
cycle as is possible - the functional unit slots not used are filled by
alternate threads - if the preferred thread reaches a substantial
stall, other threads fill in until the stall ends
8SMT Example Design
- The IBM Power5 was built on top of the Power4
pipeline - but in this case, the Power5 implements SMT
- simple design choices whenever possible
- increase associativity of L1 instruction cache
and TLB to offset the impact that might arise
because of multithreading access to the cache and
TLB - add per-thread load/store queues
- increase size of L2 and L3 caches to permit more
threads to be represented in these caches - add separate instruction prefetch and buffering
hardware - increase number of virtual registers for renaming
- increase size of instruction issue queues
- the cost for these enhancements is not extreme
(although it does take up more space on the chip)
are the performance payoffs worthwhile?
9Performance Improvement of SMT
- As it turns out, the improvement gains of SMT
over a single thread processor is only modest - in part this is because multi-issue processors
have not increased their issue size over the past
few years to best take advantage of SMT, issue
size should increase from maybe 4 to 8 or more,
but this is not practical - Pentium IV Extreme had improvements of
- 1.01 and 1.07 for SPEC int and SPEC FP benchmarks
respectively over Pentium IV (Extreme Pentium
IV SMT support) - When running 2 SPEC benchmarks at the same time
in SMT mode, improvements ranged from - 0.9 to 1.58 with an average improvement of 1.20
- Conclusions
- SMT has benefits but the costs do not necessarily
pay for the improvement - another option use multiple CPU cores on a
single processor (see (e) from the figure on
slide 4) - another factor discussed in the text (but skipped
here) is the increasing demands on power
consumption as we continue to add support for
ILP/TLP/SMT
10Advanced Multi-Issue Processors
- Here, we wrap up chapter 3 with a brief
comparison of multi-issue superscalar processors
11Comparison on Integer Benchmarks
12Comparison on FP Benchmarks
13Introduction to Multiprocessors
- Taking parallelism to the next level, we have
entirely independent processors - unlike the superscalar which permitted parallel
processing on one or more threads but had an
overhead by having to keep track of which thread
it was executing and how to switch between
threads - The multiprocessor can execute
- multiple programs simultaneously
- one program distributed across processors using
shared memory or interprocessor communication - Flynn defined the classification of architecture
types in the 1960s and this has largely remained
the same - SISD single instruction on single data, the
traditional processor whether it is pipelined or
not - SIMD single instruction on multiple data, this
achieves data-level parallelism, useful for
vector/array-based operations - MISD multiple instructions on a single datum,
never developed - MIMD multiple instructions on multiple data,
this is the true multiprocessor
14SIMD Architectures
- Bit-slice processors
- processors execute bit-level operations on
bit-slices from memory - a group of processors perform one words worth of
operations by distributing the operation
bit-by-bit
- Processor arrays
- operate on 1-D or 2-D data so that each
processing element executes the current
instruction on one datum - two flavors vector machines and matrix machines
- Hypercube processors
- similar to a processor array except that
processors are connected to nearest neighbor
processors for communication purposes - in a machine with 2n processors, each processor
connects to n neighbors
A vector machine has one control unit sending a
single instruction to multiple processing
elements, each executing the instruction on one
datum from the array
15Bit-Slice Example
- We have an array of boolean values
- we want to know if any of the bits is equal to 1
(do a parallel OR) - assume that processors can read the single datum
in parallel but only one processor can write a
result - on a tie, the processor to write the result will
be the processor with the lowest ID - show a parallel algorithm and determine the
complexity of the algorithm assuming that you
have n processors for an n-bit value - The following solution takes O(1) time (instead
of O(n) time for a normal processor)
Only processors with a bit 1 will write to
result, so if there are none, result stays 0,
otherwise the processor with the smallest ID
writes 1 to result
Processor 0 writes 0 to result For each processor
j read datum x extract bit j, storing it in
location temp if temp 1 then write 1 to
result
16Vector-Array Example
- An array stores n int values
- using an n/2 vector processor, show how we can
find the largest value in O(log n) time
Let incr 1 Each processor j operates on the
following (until j is no longer
needed) while(incr lt n) if(aj gt
ajincr) aj ajincr Processor
0 incr 2 The answer is in a0
This is known as a tournament algorithm
16 elements find the largest in 4 iterations
17Development of Multiprocessors
- Multiprocessor systems have been developed since
the 1960s - but it wasnt until the 1980s when processors and
memories were more affordable that systems could
largely be built using off-the-shelf components
attached to a single bus
- Two flavors of multiprocessor systems
- tightly coupled or shared memory
- loosely coupled or distributed memory
- this category is similar to a network of
computers
18Multiprocessor Problems
- There are two inherent problems with
multiprocessors - accessing remote memory is very time consuming
- processes are often sequential in nature making
them hard to parallelize - Examples
- we have 100 processors from which we want to
achieve 80 times speedup over a uniprocessor on a
given application, how much of the time can the
application be executing sequentially? - 100 1 / (1 x x / 80), solving for x gives
.9975, so our application must be running in
parallel 99.75 of the time, or it can be
sequential only 1 99.75 0.25 of the time! - 32 processor system has 200 ns remote access
time, a 2 GHz clock speed and a computation CPI
0.5, assuming all local memory accesses are hits,
how much faster is the machine if all memory
accesses are local as opposed to 2 remote? - remote memory access time 200 ns / .5 ns (clock
cycle time) 400 clock cycles - CPI with remote access .5 2 400 1.3
- CPI without remote access .5
- application with no remote accesses 1.3 / .5
2.6 times faster!
19Symmetric Shared Memory Architecture
- The typical form of a tightly coupled
architecture is one where we expect memory access
to be symmetric - that is, where we expect all memory accesses to
take about the same amount of time - SSM limit the number of processors because the
shared memory becomes a bottleneck - this can be alleviated with memories distributed
across chips and high order interleaving, and
multiple buses (but that gets expensive) - One way to reduce the impact of the bottleneck is
large, local multi-level caches - although caching seems an obvious way to reduce
memory contention and the bottleneck, in addition
to improving processor CPI, it comes with a cost
in a shared memory system cache coherence
20Cache Coherence
- A memory system is considered coherent if any
read obtains the most recently written version of
the datum - And 2 writes to the same datum are serialized so
that all processors would see the data in the
same order - Example
- cache coherence is required so that this problem
does not arise how can we prevent it? - two solutions directory-based and snooping
caches
Action Cache for Cache for Main
Memory Processor 1 Processor 2 of X
--- --- 0 Proc 1 reads X
0 --- 0 Proc 2 reads X
0 0 0 Proc 1 writes 1 to X
1 0 1 Proc 2 reads X
1 0 1
Processor 2 now has an obsolete value for X but
since it is in local memory, the access is a hit
and so it continues to read the wrong value!
21Snooping Protocols
- All caches monitor some centralized communication
mechanism - typically a single bus that connects each cache
to memory - Upon any write, the processor signals that the
particular datum (denoted by memory address) has
been modified - all other caches are snooping this line
(listening for such a message) - upon receiving a write message, each cache checks
to see if the updated value is stored locally
(this value will now be invalid) - Two common versions of snooping are
- write invalidate where the cache marks the
updated address as invalid so that upon a future
read it is treated as a cache miss - write update where the updated datum is
distributed itself so that all caches that have
that datum can update their values immediately - note in either case, it is easier to implement
this if the cache write protocol is write through
rather than write back
22Write Invalidate
- Write update sounds like the better approach
- no data will have to become obsolete
- But in fact it is more expensive to implement in
terms of bandwidth usage and so is less common - Write invalidate only costs in terms of increased
miss rates - to ensure that the write can proceed, the cache
must invalidate all other caches values of the
datum prior to the write - the processor must first gain control of the bus
- in case of a tie (two processors wanting to
invalidate a datum simultaneously), the bus
arbiter will cause both processors to wait and
will randomly select between them - example
Action Cache for Cache for Main
Memory Processor 1 Processor 2 of X
--- --- 0 Proc 1 reads X
0 --- 0 Proc 2 reads X
0 0 0 Proc 1 writes 1 to X
1 --- 1 Proc 2
reads X 1 1 1
denotes a cache miss
23MESI Protocol
- The common implementation for a snoopy cache is
to use the MESI Protocol - M modified
- the data (or block) has been modified, thus other
cached versions are invalidated, continued reads
or writes to this block cause no changes locally - E exclusive
- the datum is stored exclusively in this cache so
can be written to without invalidation elsewhere - S shared
- the datum cannot be written to since it is
shared, a write will require first an
invalidation - I invalid
- same as a miss, the datum has been flagged as
invalid - The table on the next slide (figure 4.5 page 213)
demonstrates how the MESI protocol works - A variation of the MESI protocol is the MOESI
protocol where O stands for ownership where a
given cache block can be shared, but the cache
which is denoted as the owner is responsible
for updating other processors when the block has
been modified
24(No Transcript)
25Implementing the MESI Protocol
26Performance of SSM Multiprocessors
- Overall cache performance is a combination of
- uniprocessor cache miss rate
- traffic caused by communication (which includes
cache invalidations) - these are complicated by
- true sharing misses, caused by invalidations
- false sharing misses, caused when a block is
invalidated but the requested word in that block
has not been modified - Changing cache size, block size, number of
processors can impact the performance in complex
ways - here, we consider an AlphaServer
- 4 Alpha 21164 processors with 300 MHz clock
cycles - 3 levels of cache (1st level is a write-through
cache to permit easier snooping, 2nd and 3rd
level caches are write-back), latencies are 7
cycles, 21 cycles and 80 cycles for misses at
each level respectively - cache to cache transfer of shared data takes 125
cycles - of the time CPU is idle because of cache misses
is lt 1 - when run on 3 server benchmark programs (online
transaction-processing, decision support system,
web index search)
27Benchmark Performance
28Distributed Shared Memory
- A 16 processor multiprocessor with 64 byte block
sizes and 512 KB data caches can require as much
as 170 GB/sec of bus bandwidth! - this is obviously a huge problem where modern
processors might have a memory/bus bandwidth
capable of supporting 12 GB/sec - In order to get around this problem, we must use
a distributed memory layout - this will lower the local bandwidth to an
acceptable amount because most of the traffic
will be local between a processor and its own
memory - But a distributed memory makes snooping much more
difficult - all changes to data would have to be broadcast
over the interconnection network and all caches
would have to snoop there - instead, we turn to a different cache coherence
mechanism the directory-based approach - A directory will keep track of every block that
might be in a cache - what blocks are in which caches, whether the
block has been modified or not, and other useful
information
29Implementing Directory-Based Caches
- We store every memory block as an entry in the
directory - this method will work for reasonably large number
of processors (e.g., 200 or less) as we would
expect a decent amount of overhead for the
directory - for extremely large systems with enormous
memories, we might need better data structures
(such as storing fewer bits per block entry) - the directory is actually distributed or
interleaved so that it does not become a
bottleneck - a directory on a single site would quickly become
a bottleneck - as an example, a portion of the directory might
be placed with every processors local memory
(see figure 4.19) - the directory will have to handle two operations
- handling a read miss
- handling a write to shared data
- and store the status of every block
- shared, uncached, modified
- communication between processors and between
instances of the directory are performed by
message passing
30P request processor A request address D
data contents
Local means local cache, Remote means remote
cache Home means home directory
31Implementation Details
- When a block is currently uncached
- copy in memory is the current value
- read miss requesting processor is sent the
block from memory and requestor is made the only
sharing node - write miss requesting processor is sent the
datum and made the owner, the block is made
exclusive although is indicated as shared - When a block is shared
- the memory value is up to date but there is at
least one cached copy, maybe more - read miss requesting processor is sent the
block from memory and processor is added to the
sharing list - write miss requesting processor is sent block
from memory, requestor added to sharers list, all
other sharers are sent invalidation messages,
state of the block is made exclusive
32Continued
- When a block is exclusive
- current block stored by owner (the block that
has been made exclusive) is the up-to-date value - read miss owner transmits the copy to
requestor, directory adds requestor to sharers
list, owner also updates memory at the same time,
changing status from exclusive to shared - data write back updates memory copy, the home
directory becomes the owner again, block is now
uncached and sharers set to empty - write miss block ownership given to requestor,
message sent to old owner to invalidate block and
send value to requestor, which updates value,
sharers set to new owner and block remains
exclusive - See figures 4.21 and 4.22 for more details
33Sun T1 Server
- We wrap up this chapter with a brief look at the
Sun T1 multiprocessor server - T1 uses an 8-core processor (1.2 GHz)
- that is, there are 8 pipelines
- T1 is a single issue pipeline where each pipeline
performs fine-grained multithreading on up to 4
threads - the limitation of 4 threads per core allows us to
support multithreading directly in hardware by
using 4 PCs per pipeline - T1 pipeline is 6 stage
- this is the same as the MIPS 5 stage pipeline
with 1 added stage to perform thread switching - branches and loads each have a 3 cycle penalty,
which can be hidden if the other 3 threads are
active (not idle or stalled) - as this is a server, FP operations are not
emphasized and therefore all 8 pipelines share
the same floating point unit - cache coherency is enforced using a directory
approach where directories are distributed across
the L2 caches, keeping track of which L1 caches
have copies of data that are stored in their L2
cache
34The T1 Architecture
The crossbar is an interconnection network Each
core has its own L1 cache Notice that there
are only 4 L2 caches (instead of 8) The
single FP unit is used by all cores as needed
35T1 Performance
- 4 thread core hides some latency that arises from
limited ILP - manifesting itself by lowering the L1 cache miss
penalty (latency) - figure 4.26 shows that a single thread will have
1.1 to 1.2 times the latency from cache misses
than the 4 thread core - Similarly, larger L2 caches with bigger blocks
hide latency - the increased size should have an obvious result,
but the impact is not as much as one would expect
- compare for instance the decrease from a 3 MB L2
cache with 32 Byte blocks versus a 6 MB L2 cache
with 32 Byte blocks - increasing the block size causes additional
message traffic in the interconnection network
resulting in larger latencies - so caches with smaller block sizes (fewer words
per block) are preferred - figure 4.28 shows these impacts
- For a 4 thread core, the ideal per-thread CPI is
4 - the T1 averages between 5.6 and 6.6
- but the per core CPI (ideal is 1) ranges from 1.4
to 1.8 - For all 8 cores, the effective CPI is between
.175 and .225 - compare this to multi-issue processors that might
have CPIs of .3 or .4 - so while the per-thread performance is not
impressive, the overall throughput and effective
CPI of the entire processor is
36Comparison
- The T1 as just described is compared to the
- AMD Opteron (2 cores, 3 instruction issue per
cycle, 2.4 GHz, does not support multithreading) - Intel Pentium D (2 cores, 3 instruction issue per
cycle, 3.2 GHz, supports SMT) - IBM Power 5 (2 cores, 4 instruction issue per
cycle, 1.9 GHz, supports SMT) - we dont compare the T1 on FP benchmarks because,
being a server, the only FP hardware is shared
and therefore FP benchmarks will perform poorly - Details are given in figure 4.33 where the T1
clearly outperforms the other processors in all
non-FP benchmarks - except for the Spec integer benchmarks where it
is marginally better than the Power5 and Opteron - the moral of this story appears to be that proper
support of fine-grained threads coupled with
multiple non-FP threads provides a better
performance than superscalar pipelines
37Sample Problem 1
- How many processors will it take a processor
array to find the maximum value of an array of n
values in O(1) time? - Solution n2 as follows
- Denote each processor as pi,j where a processor
pi,j is assigned the value of ai (so that n
processors will have each array value) - each processor pi,j compares the two array values
ai and aj - if ai lt aj then write 1 to array location
bi else write 1 to bj - there may be multiple writes so that the
processor with the lowest ID writes to bi (or
bj), but we are only writing 1, so a 1 will be
written to some elements of b - Using n of the original processors, assign a
processor to each element of b - if bi 0 then write i to datum k
- ak will be the maximum item
- can you figure out why?
- This algorithm executes in parallel taking a
total of 2 loads, two comparisons and two writes
no matter the size of n, so is O(1)
38Sample Problem 2
- Assume the memory setup as shown to the right
- Determine the resulting state of cache and memory
of each operation using write invalidate
- P0 read 120
- P0 B0 (S, 120, 00, 20)
- reads 20
- P0 write 120 ? 80
- P0 B0 modified to (M, 120, 00, 80)
- 120 ? 80 is broadcast over the bus
- P15 invalidates B0 (I, 120, 00, 20)
39Continued
- P15 write 120 ? 80
- P15 B0 (M, 120, 00, 80)
- P0 is unchanged since it is the same value as in
P0 B0 - P1 read 110
- P0 B2 (S, 110, 00, 30)
- P1 B2 (S, 110, 00, 30)
- M110 (00, 30), read (returns 30)
- P0 write 108 ? 48
- P0 B1 (M, 108, 00, 48)
- P15 B1 becomes invalid (I, 108, 00, 08)
- P0 write 130 ? 78
- P0 B2 (M, 130, 00, 78)
- M110 (00, 30)
- P15 write 130 ? 78
- P0 B2 (M, 130, 00, 78)
40Sample Problem 3
- We use the figure from the previous example as
our memory/cache layout - use MESI protocol where memory access takes 100
cycles, remote cache access takes 70 cycles,
invalidate takes 15 cycles and write back takes
10 cycles
1) P0 read 100 read miss, satisfied in
memory P0 write 100 ? 40 send out invalidate
signal 100 15 115 cycles 2) P0 read
100 read miss (100 invalid in P0), satisfied in
mem P0 read 120 read miss, satisfied in
memory 100 100 200 cycles 3) P0 read
100 read miss, satisfied in memory P1 write
100 ? 60 write miss, satisfied in memory 100
100 200 cycles 4) P0 read 100 read miss,
satisfied in memory P0 write 100 ? 60 write
hit, send out invalidate P1 write 100 ?
40 write miss, satisfied by P0s cache, write
back 100 15 70 10 195 cycles