A Scalable Approach to Thread-Level Speculation - PowerPoint PPT Presentation

About This Presentation
Title:

A Scalable Approach to Thread-Level Speculation

Description:

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 28
Provided by: Manoj66
Category:

less

Transcript and Presenter's Notes

Title: A Scalable Approach to Thread-Level Speculation


1
A Scalable Approach to Thread-Level Speculation
  • J. Gregory Steffan,
  • Christopher B. Colohan,
  • Antonia Zhai, and
  • Todd C. Mowry
  • Carnegie Mellon University

2
Outline
  • Motivation
  • Thread level speculation (TLS)
  • Coherence scheme
  • Optimizations
  • Methodology
  • Results
  • Conclusion

3
Motivation
  • Leading chip manufactures going for multi-core
    architectures
  • Usually used to increase throughput
  • To exploit these parallel resources to increase
    performance need to parallelize programs
  • Integer programs hard to parallelize
  • Use speculation thread level speculation (TLS)!

4
Thread level speculation (TLS)
5
Scalable Approach
  • The paper aims to design a scalable approach
    which applies to wide variety of multi-processor
    like architectures
  • Only limitation is that the architecture should
    be shared memory based
  • The TLS is implemented over the invalidation
    based cache coherence protocol

6
Example
  • Each cache line has special bits
  • SL speculative load has accessed the line
  • SM the line is speculatively modified
  • Thread is squashed if
  • Line is present
  • SL is set
  • If epoch number indicates an earlier thread

7
Speculation level
  • We are concerned only with the speculation level
    level in the cache hierarchy where the cache
    protocol begins
  • We can ignore all the other levels

8
Cache line states
  • Apart from the cache state bits we need SL and SM
    bits
  • A cache line with speculative bits set cannot be
    replaced
  • The thread is either squashed or the operation is
    delayed

9
Basic cache coherence protocol
  • When a processor wants to load a value, it
    atleast needs shared access to the line
  • When it wants to write, it needs exclusive access
  • Coherence mechanism issues invalidation message
    when it receives request for exclusive access

10
Coherence mechanism
11
Commit
  • When the homefree token arrives there is no
    possibility of further squashes
  • SpE is changed to E and SpS to S
  • Lines with SM bit set has to have D bit set
  • If a line is speculatively modified and shared,
    we have to get exclusive access for that line
  • Ownership required buffer (ORB) is used to track
    such lines

12
Squash
  • All speculatively modified lines have to be
    invalidated
  • SpE is changed to E and SpS to S

13
Performance Optimizations
  • Forwarding Data Between Epochs
  • Predictable data dependences are synchronized
  • Dirty and Speculatively Loaded State
  • Usually if a dirty line is speculatively loaded,
    it is flushed this can be avoided
  • Suspending Violations
  • When we have to evict a speculative line, we
    dont need to squash

14
Multiple writers
  • If two epochs write to the same line we have to
    squash one to avoid multiple writer problem
  • Possible to avoid this by maintaining fine
    grained disambiguation bits

15
Implementation
16
Epoch numbers
  • Has two parts TID and sequence number
  • To avoid costly comparisons during every access
    the difference is precomputed and a logically
    later mask is formed
  • Epoch numbers are maintained at one place for one
    chip

17
Speculative state implementation
18
Multiple writers - implementation
  • False violations are also handled in the same way

19
Correctness considerations
  • Speculation fails if the speculative state is
    lost
  • Exceptions are handled only when the homefree
    token is got
  • System calls are also postponed

20
Methodology
  • Detailed out-of-order simulation based on MIPS
    R10000 is done
  • Fork and other synchronization overhead is 10
    cycles

21
Results
  • Normalized execution cycles

22
Results
  • Buk and equake memory performance is a
    bottleneck
  • When increased more than 4 processors ijpeg
    performance degrades
  • Number of threads available is less
  • Some conflicts in cache

23
Overheads
  • Violations
  • Cache locality is important
  • ORB size can be further reduced early release
    of ORB

24
Communication overhead
  • Buk is insensitive

25
Multiprocessor performance
  • Advantages
  • More cache storage
  • Disadvantage
  • Increased communication latency

26
Conclusion
  • By using TLS even integer programs can be
    parallelized to get speedup
  • The approach is scalable and can be applied to
    various other architectures which support
    multiple threads
  • There are applications that are insensitive to
    communication latency so large scale parallel
    architectures using TLS are possible

27
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com