Thread Level Parallelism - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Thread Level Parallelism

Description:

this is unlike the traditional (OS) definition of a thread which shares ... The common implementation for a snoopy cache is to use the MESI Protocol. M modified ... – PowerPoint PPT presentation

Number of Views:565
Avg rating:3.0/5.0
Slides: 41
Provided by: NKU
Category:

less

Transcript and Presenter's Notes

Title: Thread Level Parallelism


1
Thread Level Parallelism
  • Since ILP has inherent limitations, can we
    exploit multithreading?
  • a thread is defined as a separate process with
    its own instructions and data
  • this is unlike the traditional (OS) definition of
    a thread which shares instructions with other
    threads but they each have their own stack and
    data (a thread in this case is multiple versions
    of the same process)
  • a thread may be a traditional thread or a
    separate process or a single program executing in
    parallel
  • the idea here is that the thread offers different
    instructions and data so that the processor, when
    it has to stall, can switch to another thread and
    continue execution so that it does not cause time
    consuming stalls
  • TLP exploits a different kind of parallelism than
    ILP

2
Approaches to TLP
  • We want to enhance our current processor
  • superscalar with dynamic scheduling
  • Fine-grained multi-threading
  • switches between threads at each clock cycle
  • thus, threads are executed in an interleaved
    fashion
  • as the processor switches from one thread to the
    next, a thread that is currently stalled is
    skipped over
  • CPU must be able to switch between threads at
    every clock cycle so that it needs extra hardware
    support
  • Coarse-grained multi-threading
  • switches between threads only when current thread
    is likely to stall for some time (e.g., level 2
    cache miss)
  • the switching process can be more time consuming
    since we are not switching nearly as often and
    therefore does not need extra hardware support

3
Advantages/Disadvantages
  • Fine-grained
  • Adv less susceptible to stalling situations
  • Adv throughput costs can be hidden because
    stalls are often unnoticed
  • Disadv slows down execution of each thread
  • Disadv requires a switching process that does
    not cost any cycles this can be done at the
    expense of more hardware (we will require at a
    minimum a PC for every thread)
  • Coarse-grained
  • Adv more natural flow for any given thread
  • Adv easier to implement switching process
  • Adv can take advantage of current processors to
    implement coarse-grained, but not fine-grained
  • Disadv limited in its ability to overcome
    throughput losses because of short stalling
    situations because the cost of starting the
    pipeline on a new thread is expensive (in
    comparison to fine-grained)

4
Simultaneous Multi-threading (SMT)
  • SMT uses multiple issue and dynamic scheduling on
    our superscalar architecture but adds
    multi-threading
  • (a) is the traditional approach with idle slots
    caused by stalls and a lack of ILP
  • (b) and (c) are fine-grained and coarse-grained
    MT respectively
  • (d) shows the potential payoff for SMT
  • (e) goes one step further to illustrate
    multiprocessing

5
Four Approaches
  • Superscalar on a single thread (a)
  • we are limited to ILP or, if we switch threads
    when one is going to stall, then the switch is
    equivalent to a context switch, which takes many
    (dozens or hundreds) of cycles
  • Superscalar coarse-grained MT (c)
  • fairly easy to implement, performance increase
    over no MT support, but still contains empty
    instruction slots due to short stalling
    situations (as opposed to lengthier stalls
    associated with cache miss)
  • Superscalar fine-grained MT (b)
  • requires switching between threads at each cycle
    which requires more complex and expensive
    hardware, but eliminates most stalls, the only
    problem is that a thread that lacks ILP or cannot
    make use of all instruction issue slots will not
    take full advantage of the hardware
  • Superscalar SMT (d)
  • most efficient way to use hardware and
    multithreading so that as many functional units
    as possible can be occupied

6
Superscalar Limitations for SMT
  • In spite of the performance increase by combining
    our superscalar hardware and SMT, there are still
    inherent limitations
  • how many active threads can be considered at one
    time?
  • we will be limited by resources such as number of
    PCs available to keep track of each thread, size
    of bus to accommodate multiple threads having
    instruction fetches at the same time, how many
    threads can be stored in main memory, etc
  • finite limitation on buffers used to support the
    superscalar
  • reorder buffer, instruction queue, issue buffer
  • limitations on bandwidth between CPU and
    cache/memory
  • limitation on the combination of instructions
    that can be issued at the same time
  • consider four threads, each of which contains an
    abnormally large number of FP but no FP , then
    the multiplier functional unit(s) will be very
    busy while the adder remains idle

7
SMT Design Challenges
  • Superscalars best perform on lengthier pipelines
  • We will only implement SMT using fine-grained MT
    so we need
  • large register file to accommodate multiple
    threads
  • per-thread renaming table and more registers for
    renaming
  • separate PCs for each thread
  • ability to commit instructions of multiple
    threads in the same cycle
  • added logic that does not require an increase in
    clock cycle time
  • cache and TLB setups that can handle simultaneous
    thread access without a degradation in their
    performance (miss rate, hit time)
  • In spite of the design challenges, we will find
  • performance on each individual thread will
    decrease (this is natural since every thread will
    be interrupted as the CPU switches to other
    threads, cycle-by-cycle)
  • One alternative strategy is to have a preferred
    thread of which instructions are issued every
    cycle as is possible
  • the functional unit slots not used are filled by
    alternate threads
  • if the preferred thread reaches a substantial
    stall, other threads fill in until the stall ends

8
SMT Example Design
  • The IBM Power5 was built on top of the Power4
    pipeline
  • but in this case, the Power5 implements SMT
  • simple design choices whenever possible
  • increase associativity of L1 instruction cache
    and TLB to offset the impact that might arise
    because of multithreading access to the cache and
    TLB
  • add per-thread load/store queues
  • increase size of L2 and L3 caches to permit more
    threads to be represented in these caches
  • add separate instruction prefetch and buffering
    hardware
  • increase number of virtual registers for renaming
  • increase size of instruction issue queues
  • the cost for these enhancements is not extreme
    (although it does take up more space on the chip)
    are the performance payoffs worthwhile?

9
Performance Improvement of SMT
  • As it turns out, the improvement gains of SMT
    over a single thread processor is only modest
  • in part this is because multi-issue processors
    have not increased their issue size over the past
    few years to best take advantage of SMT, issue
    size should increase from maybe 4 to 8 or more,
    but this is not practical
  • Pentium IV Extreme had improvements of
  • 1.01 and 1.07 for SPEC int and SPEC FP benchmarks
    respectively over Pentium IV (Extreme Pentium
    IV SMT support)
  • When running 2 SPEC benchmarks at the same time
    in SMT mode, improvements ranged from
  • 0.9 to 1.58 with an average improvement of 1.20
  • Conclusions
  • SMT has benefits but the costs do not necessarily
    pay for the improvement
  • another option use multiple CPU cores on a
    single processor (see (e) from the figure on
    slide 4)
  • another factor discussed in the text (but skipped
    here) is the increasing demands on power
    consumption as we continue to add support for
    ILP/TLP/SMT

10
Advanced Multi-Issue Processors
  • Here, we wrap up chapter 3 with a brief
    comparison of multi-issue superscalar processors

11
Comparison on Integer Benchmarks
12
Comparison on FP Benchmarks
13
Introduction to Multiprocessors
  • Taking parallelism to the next level, we have
    entirely independent processors
  • unlike the superscalar which permitted parallel
    processing on one or more threads but had an
    overhead by having to keep track of which thread
    it was executing and how to switch between
    threads
  • The multiprocessor can execute
  • multiple programs simultaneously
  • one program distributed across processors using
    shared memory or interprocessor communication
  • Flynn defined the classification of architecture
    types in the 1960s and this has largely remained
    the same
  • SISD single instruction on single data, the
    traditional processor whether it is pipelined or
    not
  • SIMD single instruction on multiple data, this
    achieves data-level parallelism, useful for
    vector/array-based operations
  • MISD multiple instructions on a single datum,
    never developed
  • MIMD multiple instructions on multiple data,
    this is the true multiprocessor

14
SIMD Architectures
  • Bit-slice processors
  • processors execute bit-level operations on
    bit-slices from memory
  • a group of processors perform one words worth of
    operations by distributing the operation
    bit-by-bit
  • Processor arrays
  • operate on 1-D or 2-D data so that each
    processing element executes the current
    instruction on one datum
  • two flavors vector machines and matrix machines
  • Hypercube processors
  • similar to a processor array except that
    processors are connected to nearest neighbor
    processors for communication purposes
  • in a machine with 2n processors, each processor
    connects to n neighbors

A vector machine has one control unit sending a
single instruction to multiple processing
elements, each executing the instruction on one
datum from the array
15
Bit-Slice Example
  • We have an array of boolean values
  • we want to know if any of the bits is equal to 1
    (do a parallel OR)
  • assume that processors can read the single datum
    in parallel but only one processor can write a
    result
  • on a tie, the processor to write the result will
    be the processor with the lowest ID
  • show a parallel algorithm and determine the
    complexity of the algorithm assuming that you
    have n processors for an n-bit value
  • The following solution takes O(1) time (instead
    of O(n) time for a normal processor)

Only processors with a bit 1 will write to
result, so if there are none, result stays 0,
otherwise the processor with the smallest ID
writes 1 to result
Processor 0 writes 0 to result For each processor
j read datum x extract bit j, storing it in
location temp if temp 1 then write 1 to
result
16
Vector-Array Example
  • An array stores n int values
  • using an n/2 vector processor, show how we can
    find the largest value in O(log n) time

Let incr 1 Each processor j operates on the
following (until j is no longer
needed) while(incr lt n) if(aj gt
ajincr) aj ajincr Processor
0 incr 2 The answer is in a0
This is known as a tournament algorithm
16 elements find the largest in 4 iterations
17
Development of Multiprocessors
  • Multiprocessor systems have been developed since
    the 1960s
  • but it wasnt until the 1980s when processors and
    memories were more affordable that systems could
    largely be built using off-the-shelf components
    attached to a single bus
  • Two flavors of multiprocessor systems
  • tightly coupled or shared memory
  • loosely coupled or distributed memory
  • this category is similar to a network of
    computers

18
Multiprocessor Problems
  • There are two inherent problems with
    multiprocessors
  • accessing remote memory is very time consuming
  • processes are often sequential in nature making
    them hard to parallelize
  • Examples
  • we have 100 processors from which we want to
    achieve 80 times speedup over a uniprocessor on a
    given application, how much of the time can the
    application be executing sequentially?
  • 100 1 / (1 x x / 80), solving for x gives
    .9975, so our application must be running in
    parallel 99.75 of the time, or it can be
    sequential only 1 99.75 0.25 of the time!
  • 32 processor system has 200 ns remote access
    time, a 2 GHz clock speed and a computation CPI
    0.5, assuming all local memory accesses are hits,
    how much faster is the machine if all memory
    accesses are local as opposed to 2 remote?
  • remote memory access time 200 ns / .5 ns (clock
    cycle time) 400 clock cycles
  • CPI with remote access .5 2 400 1.3
  • CPI without remote access .5
  • application with no remote accesses 1.3 / .5
    2.6 times faster!

19
Symmetric Shared Memory Architecture
  • The typical form of a tightly coupled
    architecture is one where we expect memory access
    to be symmetric
  • that is, where we expect all memory accesses to
    take about the same amount of time
  • SSM limit the number of processors because the
    shared memory becomes a bottleneck
  • this can be alleviated with memories distributed
    across chips and high order interleaving, and
    multiple buses (but that gets expensive)
  • One way to reduce the impact of the bottleneck is
    large, local multi-level caches
  • although caching seems an obvious way to reduce
    memory contention and the bottleneck, in addition
    to improving processor CPI, it comes with a cost
    in a shared memory system cache coherence

20
Cache Coherence
  • A memory system is considered coherent if any
    read obtains the most recently written version of
    the datum
  • And 2 writes to the same datum are serialized so
    that all processors would see the data in the
    same order
  • Example
  • cache coherence is required so that this problem
    does not arise how can we prevent it?
  • two solutions directory-based and snooping
    caches

Action Cache for Cache for Main
Memory Processor 1 Processor 2 of X
--- --- 0 Proc 1 reads X
0 --- 0 Proc 2 reads X
0 0 0 Proc 1 writes 1 to X
1 0 1 Proc 2 reads X
1 0 1
Processor 2 now has an obsolete value for X but
since it is in local memory, the access is a hit
and so it continues to read the wrong value!
21
Snooping Protocols
  • All caches monitor some centralized communication
    mechanism
  • typically a single bus that connects each cache
    to memory
  • Upon any write, the processor signals that the
    particular datum (denoted by memory address) has
    been modified
  • all other caches are snooping this line
    (listening for such a message)
  • upon receiving a write message, each cache checks
    to see if the updated value is stored locally
    (this value will now be invalid)
  • Two common versions of snooping are
  • write invalidate where the cache marks the
    updated address as invalid so that upon a future
    read it is treated as a cache miss
  • write update where the updated datum is
    distributed itself so that all caches that have
    that datum can update their values immediately
  • note in either case, it is easier to implement
    this if the cache write protocol is write through
    rather than write back

22
Write Invalidate
  • Write update sounds like the better approach
  • no data will have to become obsolete
  • But in fact it is more expensive to implement in
    terms of bandwidth usage and so is less common
  • Write invalidate only costs in terms of increased
    miss rates
  • to ensure that the write can proceed, the cache
    must invalidate all other caches values of the
    datum prior to the write
  • the processor must first gain control of the bus
  • in case of a tie (two processors wanting to
    invalidate a datum simultaneously), the bus
    arbiter will cause both processors to wait and
    will randomly select between them
  • example

Action Cache for Cache for Main
Memory Processor 1 Processor 2 of X
--- --- 0 Proc 1 reads X
0 --- 0 Proc 2 reads X
0 0 0 Proc 1 writes 1 to X
1 --- 1 Proc 2
reads X 1 1 1
denotes a cache miss
23
MESI Protocol
  • The common implementation for a snoopy cache is
    to use the MESI Protocol
  • M modified
  • the data (or block) has been modified, thus other
    cached versions are invalidated, continued reads
    or writes to this block cause no changes locally
  • E exclusive
  • the datum is stored exclusively in this cache so
    can be written to without invalidation elsewhere
  • S shared
  • the datum cannot be written to since it is
    shared, a write will require first an
    invalidation
  • I invalid
  • same as a miss, the datum has been flagged as
    invalid
  • The table on the next slide (figure 4.5 page 213)
    demonstrates how the MESI protocol works
  • A variation of the MESI protocol is the MOESI
    protocol where O stands for ownership where a
    given cache block can be shared, but the cache
    which is denoted as the owner is responsible
    for updating other processors when the block has
    been modified

24
(No Transcript)
25
Implementing the MESI Protocol
26
Performance of SSM Multiprocessors
  • Overall cache performance is a combination of
  • uniprocessor cache miss rate
  • traffic caused by communication (which includes
    cache invalidations)
  • these are complicated by
  • true sharing misses, caused by invalidations
  • false sharing misses, caused when a block is
    invalidated but the requested word in that block
    has not been modified
  • Changing cache size, block size, number of
    processors can impact the performance in complex
    ways
  • here, we consider an AlphaServer
  • 4 Alpha 21164 processors with 300 MHz clock
    cycles
  • 3 levels of cache (1st level is a write-through
    cache to permit easier snooping, 2nd and 3rd
    level caches are write-back), latencies are 7
    cycles, 21 cycles and 80 cycles for misses at
    each level respectively
  • cache to cache transfer of shared data takes 125
    cycles
  • of the time CPU is idle because of cache misses
    is lt 1
  • when run on 3 server benchmark programs (online
    transaction-processing, decision support system,
    web index search)

27
Benchmark Performance
28
Distributed Shared Memory
  • A 16 processor multiprocessor with 64 byte block
    sizes and 512 KB data caches can require as much
    as 170 GB/sec of bus bandwidth!
  • this is obviously a huge problem where modern
    processors might have a memory/bus bandwidth
    capable of supporting 12 GB/sec
  • In order to get around this problem, we must use
    a distributed memory layout
  • this will lower the local bandwidth to an
    acceptable amount because most of the traffic
    will be local between a processor and its own
    memory
  • But a distributed memory makes snooping much more
    difficult
  • all changes to data would have to be broadcast
    over the interconnection network and all caches
    would have to snoop there
  • instead, we turn to a different cache coherence
    mechanism the directory-based approach
  • A directory will keep track of every block that
    might be in a cache
  • what blocks are in which caches, whether the
    block has been modified or not, and other useful
    information

29
Implementing Directory-Based Caches
  • We store every memory block as an entry in the
    directory
  • this method will work for reasonably large number
    of processors (e.g., 200 or less) as we would
    expect a decent amount of overhead for the
    directory
  • for extremely large systems with enormous
    memories, we might need better data structures
    (such as storing fewer bits per block entry)
  • the directory is actually distributed or
    interleaved so that it does not become a
    bottleneck
  • a directory on a single site would quickly become
    a bottleneck
  • as an example, a portion of the directory might
    be placed with every processors local memory
    (see figure 4.19)
  • the directory will have to handle two operations
  • handling a read miss
  • handling a write to shared data
  • and store the status of every block
  • shared, uncached, modified
  • communication between processors and between
    instances of the directory are performed by
    message passing

30
P request processor A request address D
data contents
Local means local cache, Remote means remote
cache Home means home directory
31
Implementation Details
  • When a block is currently uncached
  • copy in memory is the current value
  • read miss requesting processor is sent the
    block from memory and requestor is made the only
    sharing node
  • write miss requesting processor is sent the
    datum and made the owner, the block is made
    exclusive although is indicated as shared
  • When a block is shared
  • the memory value is up to date but there is at
    least one cached copy, maybe more
  • read miss requesting processor is sent the
    block from memory and processor is added to the
    sharing list
  • write miss requesting processor is sent block
    from memory, requestor added to sharers list, all
    other sharers are sent invalidation messages,
    state of the block is made exclusive

32
Continued
  • When a block is exclusive
  • current block stored by owner (the block that
    has been made exclusive) is the up-to-date value
  • read miss owner transmits the copy to
    requestor, directory adds requestor to sharers
    list, owner also updates memory at the same time,
    changing status from exclusive to shared
  • data write back updates memory copy, the home
    directory becomes the owner again, block is now
    uncached and sharers set to empty
  • write miss block ownership given to requestor,
    message sent to old owner to invalidate block and
    send value to requestor, which updates value,
    sharers set to new owner and block remains
    exclusive
  • See figures 4.21 and 4.22 for more details

33
Sun T1 Server
  • We wrap up this chapter with a brief look at the
    Sun T1 multiprocessor server
  • T1 uses an 8-core processor (1.2 GHz)
  • that is, there are 8 pipelines
  • T1 is a single issue pipeline where each pipeline
    performs fine-grained multithreading on up to 4
    threads
  • the limitation of 4 threads per core allows us to
    support multithreading directly in hardware by
    using 4 PCs per pipeline
  • T1 pipeline is 6 stage
  • this is the same as the MIPS 5 stage pipeline
    with 1 added stage to perform thread switching
  • branches and loads each have a 3 cycle penalty,
    which can be hidden if the other 3 threads are
    active (not idle or stalled)
  • as this is a server, FP operations are not
    emphasized and therefore all 8 pipelines share
    the same floating point unit
  • cache coherency is enforced using a directory
    approach where directories are distributed across
    the L2 caches, keeping track of which L1 caches
    have copies of data that are stored in their L2
    cache

34
The T1 Architecture
The crossbar is an interconnection network Each
core has its own L1 cache Notice that there
are only 4 L2 caches (instead of 8) The
single FP unit is used by all cores as needed
35
T1 Performance
  • 4 thread core hides some latency that arises from
    limited ILP
  • manifesting itself by lowering the L1 cache miss
    penalty (latency)
  • figure 4.26 shows that a single thread will have
    1.1 to 1.2 times the latency from cache misses
    than the 4 thread core
  • Similarly, larger L2 caches with bigger blocks
    hide latency
  • the increased size should have an obvious result,
    but the impact is not as much as one would expect
  • compare for instance the decrease from a 3 MB L2
    cache with 32 Byte blocks versus a 6 MB L2 cache
    with 32 Byte blocks
  • increasing the block size causes additional
    message traffic in the interconnection network
    resulting in larger latencies
  • so caches with smaller block sizes (fewer words
    per block) are preferred
  • figure 4.28 shows these impacts
  • For a 4 thread core, the ideal per-thread CPI is
    4
  • the T1 averages between 5.6 and 6.6
  • but the per core CPI (ideal is 1) ranges from 1.4
    to 1.8
  • For all 8 cores, the effective CPI is between
    .175 and .225
  • compare this to multi-issue processors that might
    have CPIs of .3 or .4
  • so while the per-thread performance is not
    impressive, the overall throughput and effective
    CPI of the entire processor is

36
Comparison
  • The T1 as just described is compared to the
  • AMD Opteron (2 cores, 3 instruction issue per
    cycle, 2.4 GHz, does not support multithreading)
  • Intel Pentium D (2 cores, 3 instruction issue per
    cycle, 3.2 GHz, supports SMT)
  • IBM Power 5 (2 cores, 4 instruction issue per
    cycle, 1.9 GHz, supports SMT)
  • we dont compare the T1 on FP benchmarks because,
    being a server, the only FP hardware is shared
    and therefore FP benchmarks will perform poorly
  • Details are given in figure 4.33 where the T1
    clearly outperforms the other processors in all
    non-FP benchmarks
  • except for the Spec integer benchmarks where it
    is marginally better than the Power5 and Opteron
  • the moral of this story appears to be that proper
    support of fine-grained threads coupled with
    multiple non-FP threads provides a better
    performance than superscalar pipelines

37
Sample Problem 1
  • How many processors will it take a processor
    array to find the maximum value of an array of n
    values in O(1) time?
  • Solution n2 as follows
  • Denote each processor as pi,j where a processor
    pi,j is assigned the value of ai (so that n
    processors will have each array value)
  • each processor pi,j compares the two array values
    ai and aj
  • if ai lt aj then write 1 to array location
    bi else write 1 to bj
  • there may be multiple writes so that the
    processor with the lowest ID writes to bi (or
    bj), but we are only writing 1, so a 1 will be
    written to some elements of b
  • Using n of the original processors, assign a
    processor to each element of b
  • if bi 0 then write i to datum k
  • ak will be the maximum item
  • can you figure out why?
  • This algorithm executes in parallel taking a
    total of 2 loads, two comparisons and two writes
    no matter the size of n, so is O(1)

38
Sample Problem 2
  • Assume the memory setup as shown to the right
  • Determine the resulting state of cache and memory
    of each operation using write invalidate
  • P0 read 120
  • P0 B0 (S, 120, 00, 20)
  • reads 20
  • P0 write 120 ? 80
  • P0 B0 modified to (M, 120, 00, 80)
  • 120 ? 80 is broadcast over the bus
  • P15 invalidates B0 (I, 120, 00, 20)

39
Continued
  • P15 write 120 ? 80
  • P15 B0 (M, 120, 00, 80)
  • P0 is unchanged since it is the same value as in
    P0 B0
  • P1 read 110
  • P0 B2 (S, 110, 00, 30)
  • P1 B2 (S, 110, 00, 30)
  • M110 (00, 30), read (returns 30)
  • P0 write 108 ? 48
  • P0 B1 (M, 108, 00, 48)
  • P15 B1 becomes invalid (I, 108, 00, 08)
  • P0 write 130 ? 78
  • P0 B2 (M, 130, 00, 78)
  • M110 (00, 30)
  • P15 write 130 ? 78
  • P0 B2 (M, 130, 00, 78)

40
Sample Problem 3
  • We use the figure from the previous example as
    our memory/cache layout
  • use MESI protocol where memory access takes 100
    cycles, remote cache access takes 70 cycles,
    invalidate takes 15 cycles and write back takes
    10 cycles

1) P0 read 100 read miss, satisfied in
memory P0 write 100 ? 40 send out invalidate
signal 100 15 115 cycles 2) P0 read
100 read miss (100 invalid in P0), satisfied in
mem P0 read 120 read miss, satisfied in
memory 100 100 200 cycles 3) P0 read
100 read miss, satisfied in memory P1 write
100 ? 60 write miss, satisfied in memory 100
100 200 cycles 4) P0 read 100 read miss,
satisfied in memory P0 write 100 ? 60 write
hit, send out invalidate P1 write 100 ?
40 write miss, satisfied by P0s cache, write
back 100 15 70 10 195 cycles
Write a Comment
User Comments (0)
About PowerShow.com