Thread Level Parallelism

About This Presentation

Title:

Thread Level Parallelism

Description:

this is unlike the traditional (OS) definition of a thread which shares ... The common implementation for a snoopy cache is to use the MESI Protocol. M modified ... – PowerPoint PPT presentation

Number of Views:564

Avg rating:3.0/5.0

Slides: 41

Provided by: NKU

Category:

more less

Transcript and Presenter's Notes

Title: Thread Level Parallelism

1
Thread Level Parallelism

Since ILP has inherent limitations, can we
exploit multithreading?
a thread is defined as a separate process with
its own instructions and data
this is unlike the traditional (OS) definition of
a thread which shares instructions with other
threads but they each have their own stack and
data (a thread in this case is multiple versions
of the same process)
a thread may be a traditional thread or a
separate process or a single program executing in
parallel
the idea here is that the thread offers different
instructions and data so that the processor, when
it has to stall, can switch to another thread and
continue execution so that it does not cause time
consuming stalls
TLP exploits a different kind of parallelism than
ILP

2
Approaches to TLP

We want to enhance our current processor
superscalar with dynamic scheduling
Fine-grained multi-threading
switches between threads at each clock cycle
thus, threads are executed in an interleaved
fashion
as the processor switches from one thread to the
next, a thread that is currently stalled is
skipped over
CPU must be able to switch between threads at
every clock cycle so that it needs extra hardware
support
Coarse-grained multi-threading
switches between threads only when current thread
is likely to stall for some time (e.g., level 2
cache miss)
the switching process can be more time consuming
since we are not switching nearly as often and
therefore does not need extra hardware support

3
Advantages/Disadvantages

Fine-grained
Adv less susceptible to stalling situations
Adv throughput costs can be hidden because
stalls are often unnoticed
Disadv slows down execution of each thread
Disadv requires a switching process that does
not cost any cycles this can be done at the
expense of more hardware (we will require at a
minimum a PC for every thread)
Coarse-grained
Adv more natural flow for any given thread
Adv easier to implement switching process
Adv can take advantage of current processors to
implement coarse-grained, but not fine-grained
Disadv limited in its ability to overcome
throughput losses because of short stalling
situations because the cost of starting the
pipeline on a new thread is expensive (in
comparison to fine-grained)

4
Simultaneous Multi-threading (SMT)

SMT uses multiple issue and dynamic scheduling on
our superscalar architecture but adds
multi-threading
(a) is the traditional approach with idle slots
caused by stalls and a lack of ILP
(b) and (c) are fine-grained and coarse-grained
MT respectively
(d) shows the potential payoff for SMT
(e) goes one step further to illustrate
multiprocessing

5
Four Approaches

Superscalar on a single thread (a)
we are limited to ILP or, if we switch threads
when one is going to stall, then the switch is
equivalent to a context switch, which takes many
(dozens or hundreds) of cycles
Superscalar coarse-grained MT (c)
fairly easy to implement, performance increase
over no MT support, but still contains empty
instruction slots due to short stalling
situations (as opposed to lengthier stalls
associated with cache miss)
Superscalar fine-grained MT (b)
requires switching between threads at each cycle
which requires more complex and expensive
hardware, but eliminates most stalls, the only
problem is that a thread that lacks ILP or cannot
make use of all instruction issue slots will not
take full advantage of the hardware
Superscalar SMT (d)
most efficient way to use hardware and
multithreading so that as many functional units
as possible can be occupied

6
Superscalar Limitations for SMT

In spite of the performance increase by combining
our superscalar hardware and SMT, there are still
inherent limitations
how many active threads can be considered at one
time?
we will be limited by resources such as number of
PCs available to keep track of each thread, size
of bus to accommodate multiple threads having
instruction fetches at the same time, how many
threads can be stored in main memory, etc
finite limitation on buffers used to support the
superscalar
reorder buffer, instruction queue, issue buffer
limitations on bandwidth between CPU and
cache/memory
limitation on the combination of instructions
that can be issued at the same time
consider four threads, each of which contains an
abnormally large number of FP but no FP , then
the multiplier functional unit(s) will be very
busy while the adder remains idle

7
SMT Design Challenges

Superscalars best perform on lengthier pipelines
We will only implement SMT using fine-grained MT
so we need
large register file to accommodate multiple
threads
per-thread renaming table and more registers for
renaming
separate PCs for each thread
ability to commit instructions of multiple
threads in the same cycle
added logic that does not require an increase in
clock cycle time
cache and TLB setups that can handle simultaneous
thread access without a degradation in their
performance (miss rate, hit time)
In spite of the design challenges, we will find
performance on each individual thread will
decrease (this is natural since every thread will
be interrupted as the CPU switches to other
threads, cycle-by-cycle)
One alternative strategy is to have a preferred
thread of which instructions are issued every
cycle as is possible
the functional unit slots not used are filled by
alternate threads
if the preferred thread reaches a substantial
stall, other threads fill in until the stall ends

8
SMT Example Design

The IBM Power5 was built on top of the Power4
pipeline
but in this case, the Power5 implements SMT
simple design choices whenever possible
increase associativity of L1 instruction cache
and TLB to offset the impact that might arise
because of multithreading access to the cache and
TLB
add per-thread load/store queues
increase size of L2 and L3 caches to permit more
threads to be represented in these caches
add separate instruction prefetch and buffering
hardware
increase number of virtual registers for renaming
increase size of instruction issue queues
the cost for these enhancements is not extreme
(although it does take up more space on the chip)
are the performance payoffs worthwhile?

9
Performance Improvement of SMT

As it turns out, the improvement gains of SMT
over a single thread processor is only modest
in part this is because multi-issue processors
have not increased their issue size over the past
few years to best take advantage of SMT, issue
size should increase from maybe 4 to 8 or more,
but this is not practical
Pentium IV Extreme had improvements of
1.01 and 1.07 for SPEC int and SPEC FP benchmarks
respectively over Pentium IV (Extreme Pentium
IV SMT support)
When running 2 SPEC benchmarks at the same time
in SMT mode, improvements ranged from
0.9 to 1.58 with an average improvement of 1.20
Conclusions
SMT has benefits but the costs do not necessarily
pay for the improvement
another option use multiple CPU cores on a
single processor (see (e) from the figure on
slide 4)
another factor discussed in the text (but skipped
here) is the increasing demands on power
consumption as we continue to add support for
ILP/TLP/SMT

10
Advanced Multi-Issue Processors

Here, we wrap up chapter 3 with a brief
comparison of multi-issue superscalar processors

11
Comparison on Integer Benchmarks
12
Comparison on FP Benchmarks
13
Introduction to Multiprocessors

Taking parallelism to the next level, we have
entirely independent processors
unlike the superscalar which permitted parallel
processing on one or more threads but had an
overhead by having to keep track of which thread
it was executing and how to switch between
threads
The multiprocessor can execute
multiple programs simultaneously
one program distributed across processors using
shared memory or interprocessor communication
Flynn defined the classification of architecture
types in the 1960s and this has largely remained
the same
SISD single instruction on single data, the
traditional processor whether it is pipelined or
not
SIMD single instruction on multiple data, this
achieves data-level parallelism, useful for
vector/array-based operations
MISD multiple instructions on a single datum,
never developed
MIMD multiple instructions on multiple data,
this is the true multiprocessor

14
SIMD Architectures

Bit-slice processors
processors execute bit-level operations on
bit-slices from memory
a group of processors perform one words worth of
operations by distributing the operation
bit-by-bit

Processor arrays
operate on 1-D or 2-D data so that each
processing element executes the current
instruction on one datum
two flavors vector machines and matrix machines
Hypercube processors
similar to a processor array except that
processors are connected to nearest neighbor
processors for communication purposes
in a machine with 2n processors, each processor
connects to n neighbors

A vector machine has one control unit sending a
single instruction to multiple processing
elements, each executing the instruction on one
datum from the array
15
Bit-Slice Example

We have an array of boolean values
we want to know if any of the bits is equal to 1
(do a parallel OR)
assume that processors can read the single datum
in parallel but only one processor can write a
result
on a tie, the processor to write the result will
be the processor with the lowest ID
show a parallel algorithm and determine the
complexity of the algorithm assuming that you
have n processors for an n-bit value
The following solution takes O(1) time (instead
of O(n) time for a normal processor)

Only processors with a bit 1 will write to
result, so if there are none, result stays 0,
otherwise the processor with the smallest ID
writes 1 to result
Processor 0 writes 0 to result For each processor
j read datum x extract bit j, storing it in
location temp if temp 1 then write 1 to
result
16
Vector-Array Example

An array stores n int values
using an n/2 vector processor, show how we can
find the largest value in O(log n) time

Let incr 1 Each processor j operates on the
following (until j is no longer
needed) while(incr lt n) if(aj gt
ajincr) aj ajincr Processor
0 incr 2 The answer is in a0
This is known as a tournament algorithm
16 elements find the largest in 4 iterations
17
Development of Multiprocessors

Multiprocessor systems have been developed since
the 1960s
but it wasnt until the 1980s when processors and
memories were more affordable that systems could
largely be built using off-the-shelf components
attached to a single bus

Two flavors of multiprocessor systems
tightly coupled or shared memory
loosely coupled or distributed memory
this category is similar to a network of
computers

18
Multiprocessor Problems

There are two inherent problems with
multiprocessors
accessing remote memory is very time consuming
processes are often sequential in nature making
them hard to parallelize
Examples
we have 100 processors from which we want to
achieve 80 times speedup over a uniprocessor on a
given application, how much of the time can the
application be executing sequentially?
100 1 / (1 x x / 80), solving for x gives
.9975, so our application must be running in
parallel 99.75 of the time, or it can be
sequential only 1 99.75 0.25 of the time!
32 processor system has 200 ns remote access
time, a 2 GHz clock speed and a computation CPI
0.5, assuming all local memory accesses are hits,
how much faster is the machine if all memory
accesses are local as opposed to 2 remote?
remote memory access time 200 ns / .5 ns (clock
cycle time) 400 clock cycles
CPI with remote access .5 2 400 1.3
CPI without remote access .5
application with no remote accesses 1.3 / .5
2.6 times faster!

19
Symmetric Shared Memory Architecture

The typical form of a tightly coupled
architecture is one where we expect memory access
to be symmetric
that is, where we expect all memory accesses to
take about the same amount of time
SSM limit the number of processors because the
shared memory becomes a bottleneck
this can be alleviated with memories distributed
across chips and high order interleaving, and
multiple buses (but that gets expensive)
One way to reduce the impact of the bottleneck is
large, local multi-level caches
although caching seems an obvious way to reduce
memory contention and the bottleneck, in addition
to improving processor CPI, it comes with a cost
in a shared memory system cache coherence

20
Cache Coherence

A memory system is considered coherent if any
read obtains the most recently written version of
the datum
And 2 writes to the same datum are serialized so
that all processors would see the data in the
same order
Example
cache coherence is required so that this problem
does not arise how can we prevent it?
two solutions directory-based and snooping
caches

Action Cache for Cache for Main
Memory Processor 1 Processor 2 of X
--- --- 0 Proc 1 reads X
0 --- 0 Proc 2 reads X
0 0 0 Proc 1 writes 1 to X
1 0 1 Proc 2 reads X
1 0 1
Processor 2 now has an obsolete value for X but
since it is in local memory, the access is a hit
and so it continues to read the wrong value!
21
Snooping Protocols

All caches monitor some centralized communication
mechanism
typically a single bus that connects each cache
to memory
Upon any write, the processor signals that the
particular datum (denoted by memory address) has
been modified
all other caches are snooping this line
(listening for such a message)
upon receiving a write message, each cache checks
to see if the updated value is stored locally
(this value will now be invalid)
Two common versions of snooping are
write invalidate where the cache marks the
updated address as invalid so that upon a future
read it is treated as a cache miss
write update where the updated datum is
distributed itself so that all caches that have
that datum can update their values immediately
note in either case, it is easier to implement
this if the cache write protocol is write through
rather than write back

22
Write Invalidate

Write update sounds like the better approach
no data will have to become obsolete
But in fact it is more expensive to implement in
terms of bandwidth usage and so is less common
Write invalidate only costs in terms of increased
miss rates
to ensure that the write can proceed, the cache
must invalidate all other caches values of the
datum prior to the write
the processor must first gain control of the bus
in case of a tie (two processors wanting to
invalidate a datum simultaneously), the bus
arbiter will cause both processors to wait and
will randomly select between them
example

Action Cache for Cache for Main
Memory Processor 1 Processor 2 of X
--- --- 0 Proc 1 reads X
0 --- 0 Proc 2 reads X
0 0 0 Proc 1 writes 1 to X
1 --- 1 Proc 2
reads X 1 1 1
denotes a cache miss
23
MESI Protocol

The common implementation for a snoopy cache is
to use the MESI Protocol
M modified
the data (or block) has been modified, thus other
cached versions are invalidated, continued reads
or writes to this block cause no changes locally
E exclusive
the datum is stored exclusively in this cache so
can be written to without invalidation elsewhere
S shared
the datum cannot be written to since it is
shared, a write will require first an
invalidation
I invalid
same as a miss, the datum has been flagged as
invalid
The table on the next slide (figure 4.5 page 213)
demonstrates how the MESI protocol works
A variation of the MESI protocol is the MOESI
protocol where O stands for ownership where a
given cache block can be shared, but the cache
which is denoted as the owner is responsible
for updating other processors when the block has
been modified

24
(No Transcript)
25
Implementing the MESI Protocol
26
Performance of SSM Multiprocessors

Overall cache performance is a combination of
uniprocessor cache miss rate
traffic caused by communication (which includes
cache invalidations)
these are complicated by
true sharing misses, caused by invalidations
false sharing misses, caused when a block is
invalidated but the requested word in that block
has not been modified
Changing cache size, block size, number of
processors can impact the performance in complex
ways
here, we consider an AlphaServer
4 Alpha 21164 processors with 300 MHz clock
cycles
3 levels of cache (1st level is a write-through
cache to permit easier snooping, 2nd and 3rd
level caches are write-back), latencies are 7
cycles, 21 cycles and 80 cycles for misses at
each level respectively
cache to cache transfer of shared data takes 125
cycles
of the time CPU is idle because of cache misses
is lt 1
when run on 3 server benchmark programs (online
transaction-processing, decision support system,
web index search)

27
Benchmark Performance
28
Distributed Shared Memory

A 16 processor multiprocessor with 64 byte block
sizes and 512 KB data caches can require as much
as 170 GB/sec of bus bandwidth!
this is obviously a huge problem where modern
processors might have a memory/bus bandwidth
capable of supporting 12 GB/sec
In order to get around this problem, we must use
a distributed memory layout
this will lower the local bandwidth to an
acceptable amount because most of the traffic
will be local between a processor and its own
memory
But a distributed memory makes snooping much more
difficult
all changes to data would have to be broadcast
over the interconnection network and all caches
would have to snoop there
instead, we turn to a different cache coherence
mechanism the directory-based approach
A directory will keep track of every block that
might be in a cache
what blocks are in which caches, whether the
block has been modified or not, and other useful
information

29
Implementing Directory-Based Caches

We store every memory block as an entry in the
directory
this method will work for reasonably large number
of processors (e.g., 200 or less) as we would
expect a decent amount of overhead for the
directory
for extremely large systems with enormous
memories, we might need better data structures
(such as storing fewer bits per block entry)
the directory is actually distributed or
interleaved so that it does not become a
bottleneck
a directory on a single site would quickly become
a bottleneck
as an example, a portion of the directory might
be placed with every processors local memory
(see figure 4.19)
the directory will have to handle two operations
handling a read miss
handling a write to shared data
and store the status of every block
shared, uncached, modified
communication between processors and between
instances of the directory are performed by
message passing

30
P request processor A request address D
data contents
Local means local cache, Remote means remote
cache Home means home directory
31
Implementation Details

When a block is currently uncached
copy in memory is the current value
read miss requesting processor is sent the
block from memory and requestor is made the only
sharing node
write miss requesting processor is sent the
datum and made the owner, the block is made
exclusive although is indicated as shared
When a block is shared
the memory value is up to date but there is at
least one cached copy, maybe more
read miss requesting processor is sent the
block from memory and processor is added to the
sharing list
write miss requesting processor is sent block
from memory, requestor added to sharers list, all
other sharers are sent invalidation messages,
state of the block is made exclusive

32
Continued

When a block is exclusive
current block stored by owner (the block that
has been made exclusive) is the up-to-date value
read miss owner transmits the copy to
requestor, directory adds requestor to sharers
list, owner also updates memory at the same time,
changing status from exclusive to shared
data write back updates memory copy, the home
directory becomes the owner again, block is now
uncached and sharers set to empty
write miss block ownership given to requestor,
message sent to old owner to invalidate block and
send value to requestor, which updates value,
sharers set to new owner and block remains
exclusive
See figures 4.21 and 4.22 for more details

33
Sun T1 Server

We wrap up this chapter with a brief look at the
Sun T1 multiprocessor server
T1 uses an 8-core processor (1.2 GHz)
that is, there are 8 pipelines
T1 is a single issue pipeline where each pipeline
performs fine-grained multithreading on up to 4
threads
the limitation of 4 threads per core allows us to
support multithreading directly in hardware by
using 4 PCs per pipeline
T1 pipeline is 6 stage
this is the same as the MIPS 5 stage pipeline
with 1 added stage to perform thread switching
branches and loads each have a 3 cycle penalty,
which can be hidden if the other 3 threads are
active (not idle or stalled)
as this is a server, FP operations are not
emphasized and therefore all 8 pipelines share
the same floating point unit
cache coherency is enforced using a directory
approach where directories are distributed across
the L2 caches, keeping track of which L1 caches
have copies of data that are stored in their L2
cache

34
The T1 Architecture
The crossbar is an interconnection network Each
core has its own L1 cache Notice that there
are only 4 L2 caches (instead of 8) The
single FP unit is used by all cores as needed
35
T1 Performance

4 thread core hides some latency that arises from
limited ILP
manifesting itself by lowering the L1 cache miss
penalty (latency)
figure 4.26 shows that a single thread will have
1.1 to 1.2 times the latency from cache misses
than the 4 thread core
Similarly, larger L2 caches with bigger blocks
hide latency
the increased size should have an obvious result,
but the impact is not as much as one would expect
compare for instance the decrease from a 3 MB L2
cache with 32 Byte blocks versus a 6 MB L2 cache
with 32 Byte blocks
increasing the block size causes additional
message traffic in the interconnection network
resulting in larger latencies
so caches with smaller block sizes (fewer words
per block) are preferred
figure 4.28 shows these impacts
For a 4 thread core, the ideal per-thread CPI is
4
the T1 averages between 5.6 and 6.6
but the per core CPI (ideal is 1) ranges from 1.4
to 1.8
For all 8 cores, the effective CPI is between
.175 and .225
compare this to multi-issue processors that might
have CPIs of .3 or .4
so while the per-thread performance is not
impressive, the overall throughput and effective
CPI of the entire processor is

36
Comparison

The T1 as just described is compared to the
AMD Opteron (2 cores, 3 instruction issue per
cycle, 2.4 GHz, does not support multithreading)
Intel Pentium D (2 cores, 3 instruction issue per
cycle, 3.2 GHz, supports SMT)
IBM Power 5 (2 cores, 4 instruction issue per
cycle, 1.9 GHz, supports SMT)
we dont compare the T1 on FP benchmarks because,
being a server, the only FP hardware is shared
and therefore FP benchmarks will perform poorly
Details are given in figure 4.33 where the T1
clearly outperforms the other processors in all
non-FP benchmarks
except for the Spec integer benchmarks where it
is marginally better than the Power5 and Opteron
the moral of this story appears to be that proper
support of fine-grained threads coupled with
multiple non-FP threads provides a better
performance than superscalar pipelines

37
Sample Problem 1

How many processors will it take a processor
array to find the maximum value of an array of n
values in O(1) time?
Solution n2 as follows
Denote each processor as pi,j where a processor
pi,j is assigned the value of ai (so that n
processors will have each array value)
each processor pi,j compares the two array values
ai and aj
if ai lt aj then write 1 to array location
bi else write 1 to bj
there may be multiple writes so that the
processor with the lowest ID writes to bi (or
bj), but we are only writing 1, so a 1 will be
written to some elements of b
Using n of the original processors, assign a
processor to each element of b
if bi 0 then write i to datum k
ak will be the maximum item
can you figure out why?
This algorithm executes in parallel taking a
total of 2 loads, two comparisons and two writes
no matter the size of n, so is O(1)

38
Sample Problem 2

Assume the memory setup as shown to the right
Determine the resulting state of cache and memory
of each operation using write invalidate

P0 read 120
P0 B0 (S, 120, 00, 20)
reads 20
P0 write 120 ? 80
P0 B0 modified to (M, 120, 00, 80)
120 ? 80 is broadcast over the bus
P15 invalidates B0 (I, 120, 00, 20)

39
Continued

P15 write 120 ? 80
P15 B0 (M, 120, 00, 80)
P0 is unchanged since it is the same value as in
P0 B0
P1 read 110
P0 B2 (S, 110, 00, 30)
P1 B2 (S, 110, 00, 30)
M110 (00, 30), read (returns 30)
P0 write 108 ? 48
P0 B1 (M, 108, 00, 48)
P15 B1 becomes invalid (I, 108, 00, 08)
P0 write 130 ? 78
P0 B2 (M, 130, 00, 78)
M110 (00, 30)
P15 write 130 ? 78
P0 B2 (M, 130, 00, 78)

40
Sample Problem 3

We use the figure from the previous example as
our memory/cache layout
use MESI protocol where memory access takes 100
cycles, remote cache access takes 70 cycles,
invalidate takes 15 cycles and write back takes
10 cycles

1) P0 read 100 read miss, satisfied in
memory P0 write 100 ? 40 send out invalidate
signal 100 15 115 cycles 2) P0 read
100 read miss (100 invalid in P0), satisfied in
mem P0 read 120 read miss, satisfied in
memory 100 100 200 cycles 3) P0 read
100 read miss, satisfied in memory P1 write
100 ? 60 write miss, satisfied in memory 100
100 200 cycles 4) P0 read 100 read miss,
satisfied in memory P0 write 100 ? 60 write
hit, send out invalidate P1 write 100 ?
40 write miss, satisfied by P0s cache, write
back 100 15 70 10 195 cycles

Write a Comment

User Comments (0)