Redundant Multithreading Techniques for Transient Fault Detection - PowerPoint PPT Presentation

About This Presentation

Title:

Redundant Multithreading Techniques for Transient Fault Detection

Description:

SRT concepts & design. Preferential Space Redundancy. SRT Performance Analysis ... applicable in a real-world SMT design ~30% slowdown, slightly worse with two ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 33

Provided by: stevenkr1

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Redundant Multithreading Techniques for Transient Fault Detection

1
Redundant Multithreading Techniques for
Transient Fault Detection

Shubu Mukherjee
Michael Kontz
Steve Reinhardt

Intel HP (current) Intel Consultant, U. of
Michigan
Versions of this work have been presented at ISCA
2000 and ISCA 2002
2
Transient Faults from Cosmic Rays Alpha
particles

decreasing feature size
- decreasing voltage (exponential dependence?)
- increasing number of transistors (Moores Law)
- increasing system size (number of processors)
- no practical absorbent for cosmic rays

3
Fault Detection via Lockstepping(HP Himalaya)
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
4
Fault Detection via Simultaneous Multithreading
R1 ? (R2)
R1 ? (R2)
THREAD
THREAD
Output Comparison
Input Replication
Memory covered by ECC RAID array covered by
parity Servernet covered by CRC
5
Simultaneous Multithreading (SMT)
Example Alpha 21464, Intel Northwood
6
Redundant Multithreading (RMT)
RMT Multithreading Fault Detection
Multithreading (MT) Redundant Multithreading (RMT)
Multithreaded Uniprocessor Simultaneous Multithreading (SMT) Simultaneous Redundant Threading (SRT)
Chip Multiprocessor (CMP) Multiple Threads running on CMP Chip-Level Redundant Threading (CRT)
7
Outline

SRT concepts design
Preferential Space Redundancy
SRT Performance Analysis
Single- multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
Performance analysis
Summary
Current Future Work

8
Overview

SRT SMT Fault Detection
Advantages
Piggyback on an SMT processor with little extra
hardware
Better performance than complete replication
Lower cost due to market volume of SMT SRT
Challenges
Lockstepping very difficult with SRT
Must carefully fetch/schedule instructions from
redundant threads

9
Sphere of Replication
Sphere of Replication
LeadingThread
TrailingThread
InputReplication
OutputComparison
Memory System (incl. L1 caches)

Two copies of each architecturally visible thread
Co-scheduled on SMT core
Compare results signal fault if different

10
Basic Pipeline
Dispatch
Decode
Commit
Fetch
Execute
Data Cache
11
Load Value Queue (LVQ)

Load Value Queue (LVQ)
Keep threads on same path despite I/O or MP
writes
Out-of-order load issue possible

12
Store Queue Comparator (STQ)

Store Queue Comparator
Compares outputs to data cache
Catch faults before propagating to rest of system

13
Store Queue Comparator (contd)
Store Queue
st ...
st 5 ? 0x120
st ...
to data cache
Compareaddress data
st ...
st ...
st 5 ? 0x120

Extends residence time of leading-thread stores
Size constrained by cycle time goal
Base CPU statically partitions single queue among
threads
Potential solution per-thread store queues
Deadlock if matching trailing store cannot commit
Several small but crucial changes to avoid this

14
Branch Outcome Queue (BOQ)

Branch Outcome Queue
Forward leading-thread branch targets to trailing
fetch
100 prediction accuracy in absence of faults

15
Line Prediction Queue (LPQ)

Line Prediction Queue
Alpha 21464 fetches chunks using line predictions
Chunk contiguous block of 8 instructions

16
Line Prediction Queue (contd)

Generate stream of chunked line predictions
Every leading-thread instruction carries
itsI-cache coordinates
Commit logic merges into fetch chunks for LPQ
Independent of leading-thread fetch chunks
Commit-to-fetch dependence raised deadlock issues

1F8 add 1FC load R1?(R2) 200 beq
280 204 and 208 bne 200 200 add
17
Line Prediction Queue (contd)

Read-out on trailing-thread fetch also complex
Base CPU thread chooser gets multiple line
predictions, ignores all but one
Fetches must be retried on I-cache miss
Tricky to keep queue in sync with thread progress
Add handshake to advance queue head
Roll back head on I-cache miss
Track both last attempted last successful
chunks

18
Outline

SRT concepts design
Preferential Space Redundancy
SRT Performance Analysis
Single- multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
Performance analysis
Summary
Current Future Work

19
Preferential Space Redundancy

SRT combines two types of redundancy
Time same physical resource, different time
Space different physical resource
Space redundancy preferable
Better coverage of permanent/long-duration faults
Bias towards space redundancy where possible

20
PSR Example Clustered Execution
LPQ
add r1,r2,r3
add r1,r2,r3
add r1,r2,r3
Exec 0
IQ 0
add r1,r2,r3
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1

Base CPU has two execution clusters
Separate instruction queues, function units
Steered in dispatch stage

21
PSR Example Clustered Execution
LPQ
0
0
0
0
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3 0
0
Exec 0
IQ 0
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1

Leading thread instructions record their cluster
Bit carried with fetch chunk through LPQ
Attached to trailing-thread instruction
Dispatch sends to opposite cluster if possible

22
PSR Example Clustered Execution
LPQ
Exec 0
IQ 0
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3
add r1,r2,r3

99.94 of instruction pairs use different
clusters
Full spatial redundancy for execution
No performance impact (occasional slight gain)

23
Outline

SRT concepts design
Preferential Space Redundancy
SRT Performance Analysis
Single- multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
Performance analysis
Summary
Current Future Work

24
SRT Evaluation

Used SPEC CPU95, 15M instrs/thread
Constrained by simulation environment
? 120M instrs for 4 redundant thread pairs
Eight-issue, four-context SMT CPU
128-entry instruction queue
64-entry load and store queues
Default statically partitioned among active
threads
22-stage pipeline
64KB 2-way assoc. L1 caches
3 MB 8-way assoc L2

25
SRT Performance One Thread

One logical thread ? two hardware contexts
Performance degradation 30
Per-thread store queue buys extra 4

26
SRT Performance Two Threads

Two logical threads ? four hardware contexts
Average slowdown increases to 40
Only 32 with per-thread store queues

27
Outline

SRT concepts design
Preferential Space Redundancy
SRT Performance Analysis
Single- multi-threaded workloads
Chip-level Redundant Threading (CRT)
Concept
Performance analysis
Summary
Current Future Work

28
Chip-Level Redundant Threading

SRT typically more efficient than splitting one
processor into two half-size CPUs
What if you already have two CPUs?
IBM Power4, HP PA-8800 (Mako)
Conceptually easy to run these in lock-step
Benefit full physical redundancy
Costs
Latency through centralized checker logic
Overheads (misspeculation etc.) incurred twice
CRT combines best of SRT lockstepping
requires multithreaded CMP cores

29
Chip-Level Redundant Threading
CPU A
CPU B
LVQ
LPQ
Stores
LVQ
LPQ
LeadingThread B
Stores
30
CRT Performance

With per-thread store queues, 13 improvement
over lockstepping with 8-cycle checker latency

31
Summary Conclusions

SRT is applicable in a real-world SMT design
30 slowdown, slightly worse with two threads
Store queue capacity can limit performance
Preferential space redundancy improves coverage
Chip-level Redundant Threading SRT for CMPs
Looser synchronization than lockstepping
Free up resources for other application threads

32
More Information

Publications
S.K. Reinhardt S.S.Mukherjee, Transient Fault
Detection via Simultaneous Multithreading,
International Symposium on Computer Architecture
(ISCA), 2000
S.S.Mukherjee, M.Kontz, S.K.Reinhardt,
Detailed Design and Evaluation of Redundant
Multithreading Alternatives, International
Symposium on Computer Architecture (ISCA), 2002
Papers available from
http//www.cs.wisc.edu/shubu
http//www.eecs.umich.edu/stever
Patents
Compaq/HP filed eight patent applications on SRT