Title: Dynamic%20Verification%20of%20End-to-End%20Multiprocessor%20Invariants
1Dynamic Verification of End-to-End
Multiprocessor Invariants
- Daniel J. Sorin1, Mark D. Hill2, David A. Wood2
- 1Department of Electrical Computer Engineering
- Duke University
- 2Computer Sciences Department
- University of Wisconsin-Madison
2My Talk in One Slide
- Commercial server availability is important
- System model Symmetric Multiprocessor (SMP)
- Fault model Mostly transient, some permanent
- Recent work developed efficient
checkpoint/recovery - But we can only recover from hardware errors we
detect - Many hardware errors are hard to detect
- Proposal Dynamic verification of invariants
- Online checking of end-to-end system invariants
- Checking performed with distributed signature
analysis - Triggers recovery if invariant is violated
3Outline
- Background
- SMPs and availability
- Existing hardware error detection
- Invariant checking with distributed signature
analysis - Two invariant checkers
- Evaluation
- Conclusions
4Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
I
M
Issue request Wait for response Receive response
5Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
I
M
Issue request Wait for response Receive response
- Broadcast request not delivered to subset of
nodes - Broadcast requests delivered out of order to
subset of nodes
6Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
response arrives
request arrives
t2
I
M
t1
t3
issue request
request arrives
response arrives
- More chances for incorrect state transitions
7Backward Error Recovery
- Can improve availability with backward error
recovery - If error detected, then recover to pre-fault
state - Backward error recovery (BER) requires
- Checkpoint/recovery mechanism
- Error detection mechanisms
8SafetyNet Checkpoint/Recovery
- SafetyNet all-hardware scheme ISCA 2002
- Periodically take logical checkpoint of
multiprocessor - MP State processor registers, caches, memory
- Incrementally log changes to caches and memory
- Consistent checkpointing performed in logical
time - E.g., every 3000 broadcast cache coherence
requests - Can tolerate gt100,000 cycles of error detection
latency
CP 4
CP 3
CP 2
CP 1
Active execution
Validated execution
Pending validation Still detecting errors
time
9Error Detection
- Error model mostly due to transient faults
- Example error detection mechanisms
- Parity bit on cache line
- Checksum on incoming message
- Timeout on cache coherence transaction
- But error detection for servers is still weak
- Why?
- Error detection is often on critical path and
must be fast - Fast error detection cant incorporate info from
other nodes
10Why Local Information Isnt Sufficient
Shared
Owned
11Why Local Information Isnt Sufficient
Broadcast Request for Exclusive
fault!
Shared
Owned
12Why Local Information Isnt Sufficient
Broadcast Request for Exclusive
fault!
Shared
Owned
Invalid
Data Response
13Why Local Information Isnt Sufficient
Shared
Modified
Neither P1 nor P2 can detect that an error has
occurred!
14Outline
- Background
- End-to-end invariant checking
- Two invariant checkers
- Evaluation
- Conclusions
15Distributed Signature Analysis
- Reduces long history of events into small
signature - Signatures map almost-uniquely to event histories
Event N at P1 Event 2 at P1 Event 1
at P1
Event N at P2 Event 2 at P2 Event 1
at P2
P1
P2
Signature
Signature
P2s signature
P1s signature
Check periodically in logical time (every 3000
requests)
Checker
16Designing Signature Analysis Schemes
- Must devise two functions Update and Check
- Signature(Pi) Update(Signature(Pi), Event)
- Check(Signature(P1),,Signature(PN)) true if
error - Simple example check that message inflowoutflow
- Assume only unicast messages
- Update 1 for receive, -1 for send
- Check true if sum of all signatures doesnt
equal 0
17Implementing Distributed Signature Analysis
- All components cooperate to perform checking
- Component cache controller or memory controller
- Each component contains
- Local signature register
- Logic to compute signature updates
- System contains
- System controller that performs check function
- Use distributed signature analysis for dynamic
verification - Verify end-to-end invariants
18Outline
- Background
- End-to-end invariant checking
- Two invariant checkers
- Message invariant
- Cache coherence invariant
- Evaluation
- Conclusions
19A Message-Level Invariant Checker
- Context symmetric multiprocessor (SMP)
- Cache coherence with broadcast snooping protocol
- Invariant all nodes see same total order of
broadcast cache coherence requests - Update for each incoming broadcast, add
Address - Not quite this simple (e.g., doesnt detect
reorderings) - Check error if all signatures arent equal
20Aliasing
- Aliasing occurs if two histories have same
signature - 3 possible sources of aliasing
- Finite resources b bits can only distinguish 2b
histories - Fault in signature analysis hardware itself
- Inherent flaw in scheme
- Examples of inherent aliasing in previous scheme
- Arrival of message with Address0 doesnt change
signature - Reordering of messages doesnt change signature
- We solve aliasing issues in paper
- Tricks hash more than 1 field of message, use
LFSRs, etc.
21A Cache Coherence Invariant Checker
- Invariant all coherence upgrades cause
downgrades - Upgrade increase permissions to block (e.g.,
none?read) - Downgrade decrease permissions (e.g., write ?
read) - Update add Address for upgrade
subtract Address for downgrade - Check error if sum of all signatures doesnt
equal 0 - Challenges
- Can be more than one downgrade per upgrade
- Upgrader doesnt know how how many downgraders
exist - See paper for solutions to these challenges
22Outline
- Background
- End-to-end invariant checking
- Two invariant checkers
- Evaluation
- Conclusions
23Methodology
- Full-system simulation of 16-processor machine
- Simics provides functional simulation of
everything - We added timing simulation for memory system
SafetyNet - Commercial workloads running on Solaris 8
- Database IBMs DB2 running online transaction
processing - Static web server Apache
- Dynamic web server Slashdot
- Java middleware
24Detection Coverage
- How do we know if our checkers work?
- Inject errors periodically
- Corrupt messages
- Drop messages
- Reorder messages
- Improperly process cache coherence messages
- Global invariant checkers detected all errors
25Performance
- Error bars represent /- one standard deviation
26Conclusions
- Goal improve multiprocessor availability
- How? Dynamic verification of end-to-end
invariants - Implemented with distributed signature analysis
- Results
- Detects previously undetectable hardware errors
- Negligible performance overhead for error-free
execution - Duke FaultFinder Project
- http//www.ee.duke.edu/sorin/faultfinder
- Wisconsin Multifacet Project
- http//www.cs.wisc.edu/multifacet/