Title: Redundant Multithreading Techniques for Transient Fault Detection
1Redundant Multithreading Techniques for
Transient Fault Detection
- Shubu Mukherjee
- Michael Kontz
- Steve Reinhardt
Intel HP (current) Intel Consultant, U. of
Michigan
Versions of this work have been presented at ISCA
2000 and ISCA 2002
2Transient Faults from Cosmic Rays Alpha
particles
- decreasing feature size
- - decreasing voltage (exponential dependence?)
- - increasing number of transistors (Moores Law)
- - increasing system size (number of processors)
- - no practical absorbent for cosmic rays
3Fault Detection via Lockstepping(HP Himalaya)
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
4Fault Detection via Simultaneous Multithreading
R1 ? (R2)
R1 ? (R2)
THREAD
THREAD
Output Comparison
Input Replication
Memory covered by ECC RAID array covered by
parity Servernet covered by CRC
5Simultaneous Multithreading (SMT)
Example Alpha 21464, Intel Northwood
6Redundant Multithreading (RMT)
RMT Multithreading Fault Detection
Multithreading (MT) Redundant Multithreading (RMT)
Multithreaded Uniprocessor Simultaneous Multithreading (SMT) Simultaneous Redundant Threading (SRT)
Chip Multiprocessor (CMP) Multiple Threads running on CMP Chip-Level Redundant Threading (CRT)
7Outline
- SRT concepts design
- Preferential Space Redundancy
- SRT Performance Analysis
- Single- multi-threaded workloads
- Chip-level Redundant Threading (CRT)
- Concept
- Performance analysis
- Summary
- Current Future Work
8Overview
- SRT SMT Fault Detection
- Advantages
- Piggyback on an SMT processor with little extra
hardware - Better performance than complete replication
- Lower cost due to market volume of SMT SRT
- Challenges
- Lockstepping very difficult with SRT
- Must carefully fetch/schedule instructions from
redundant threads
9Sphere of Replication
Sphere of Replication
LeadingThread
TrailingThread
InputReplication
OutputComparison
Memory System (incl. L1 caches)
- Two copies of each architecturally visible thread
- Co-scheduled on SMT core
- Compare results signal fault if different
10Basic Pipeline
Dispatch
Decode
Commit
Fetch
Execute
Data Cache
11Load Value Queue (LVQ)
- Load Value Queue (LVQ)
- Keep threads on same path despite I/O or MP
writes - Out-of-order load issue possible
12Store Queue Comparator (STQ)
- Store Queue Comparator
- Compares outputs to data cache
- Catch faults before propagating to rest of system
13Store Queue Comparator (contd)
Store Queue
st ...
st 5 ? 0x120
st ...
to data cache
Compareaddress data
st ...
st ...
st 5 ? 0x120
- Extends residence time of leading-thread stores
- Size constrained by cycle time goal
- Base CPU statically partitions single queue among
threads - Potential solution per-thread store queues
- Deadlock if matching trailing store cannot commit
- Several small but crucial changes to avoid this
14Branch Outcome Queue (BOQ)
- Branch Outcome Queue
- Forward leading-thread branch targets to trailing
fetch - 100 prediction accuracy in absence of faults
15Line Prediction Queue (LPQ)
- Line Prediction Queue
- Alpha 21464 fetches chunks using line predictions
- Chunk contiguous block of 8 instructions
16Line Prediction Queue (contd)
- Generate stream of chunked line predictions
- Every leading-thread instruction carries
itsI-cache coordinates - Commit logic merges into fetch chunks for LPQ
- Independent of leading-thread fetch chunks
- Commit-to-fetch dependence raised deadlock issues
1F8 add 1FC load R1?(R2) 200 beq
280 204 and 208 bne 200 200 add
17Line Prediction Queue (contd)
- Read-out on trailing-thread fetch also complex
- Base CPU thread chooser gets multiple line
predictions, ignores all but one - Fetches must be retried on I-cache miss
- Tricky to keep queue in sync with thread progress
- Add handshake to advance queue head
- Roll back head on I-cache miss
- Track both last attempted last successful
chunks
18Outline
- SRT concepts design
- Preferential Space Redundancy
- SRT Performance Analysis
- Single- multi-threaded workloads
- Chip-level Redundant Threading (CRT)
- Concept
- Performance analysis
- Summary
- Current Future Work
19Preferential Space Redundancy
- SRT combines two types of redundancy
- Time same physical resource, different time
- Space different physical resource
- Space redundancy preferable
- Better coverage of permanent/long-duration faults
- Bias towards space redundancy where possible
20PSR Example Clustered Execution
LPQ
add r1,r2,r3
add r1,r2,r3
add r1,r2,r3
Exec 0
IQ 0
add r1,r2,r3
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1
- Base CPU has two execution clusters
- Separate instruction queues, function units
- Steered in dispatch stage
21PSR Example Clustered Execution
LPQ
0
0
0
0
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3 0
0
Exec 0
IQ 0
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1
- Leading thread instructions record their cluster
- Bit carried with fetch chunk through LPQ
- Attached to trailing-thread instruction
- Dispatch sends to opposite cluster if possible
22PSR Example Clustered Execution
LPQ
Exec 0
IQ 0
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3
add r1,r2,r3
- 99.94 of instruction pairs use different
clusters - Full spatial redundancy for execution
- No performance impact (occasional slight gain)
23Outline
- SRT concepts design
- Preferential Space Redundancy
- SRT Performance Analysis
- Single- multi-threaded workloads
- Chip-level Redundant Threading (CRT)
- Concept
- Performance analysis
- Summary
- Current Future Work
24SRT Evaluation
- Used SPEC CPU95, 15M instrs/thread
- Constrained by simulation environment
- ? 120M instrs for 4 redundant thread pairs
- Eight-issue, four-context SMT CPU
- 128-entry instruction queue
- 64-entry load and store queues
- Default statically partitioned among active
threads - 22-stage pipeline
- 64KB 2-way assoc. L1 caches
- 3 MB 8-way assoc L2
25SRT Performance One Thread
- One logical thread ? two hardware contexts
- Performance degradation 30
- Per-thread store queue buys extra 4
26SRT Performance Two Threads
- Two logical threads ? four hardware contexts
- Average slowdown increases to 40
- Only 32 with per-thread store queues
27Outline
- SRT concepts design
- Preferential Space Redundancy
- SRT Performance Analysis
- Single- multi-threaded workloads
- Chip-level Redundant Threading (CRT)
- Concept
- Performance analysis
- Summary
- Current Future Work
28Chip-Level Redundant Threading
- SRT typically more efficient than splitting one
processor into two half-size CPUs - What if you already have two CPUs?
- IBM Power4, HP PA-8800 (Mako)
- Conceptually easy to run these in lock-step
- Benefit full physical redundancy
- Costs
- Latency through centralized checker logic
- Overheads (misspeculation etc.) incurred twice
- CRT combines best of SRT lockstepping
- requires multithreaded CMP cores
29Chip-Level Redundant Threading
CPU A
CPU B
LVQ
LPQ
Stores
LVQ
LPQ
LeadingThread B
Stores
30CRT Performance
- With per-thread store queues, 13 improvement
over lockstepping with 8-cycle checker latency
31Summary Conclusions
- SRT is applicable in a real-world SMT design
- 30 slowdown, slightly worse with two threads
- Store queue capacity can limit performance
- Preferential space redundancy improves coverage
- Chip-level Redundant Threading SRT for CMPs
- Looser synchronization than lockstepping
- Free up resources for other application threads
32More Information
- Publications
- S.K. Reinhardt S.S.Mukherjee, Transient Fault
Detection via Simultaneous Multithreading,
International Symposium on Computer Architecture
(ISCA), 2000 - S.S.Mukherjee, M.Kontz, S.K.Reinhardt,
Detailed Design and Evaluation of Redundant
Multithreading Alternatives, International
Symposium on Computer Architecture (ISCA), 2002 - Papers available from
- http//www.cs.wisc.edu/shubu
- http//www.eecs.umich.edu/stever
- Patents
- Compaq/HP filed eight patent applications on SRT