Title: Formal Verification of Shared Memory Systems During their Design
1Formal Verification ofShared Memory
SystemsDuring their Design
- Ganesh Gopalakrishnan
- Department of Computer Science
- University of Utah
- http//www.cs.utah.edu/ganesh
2FM and shared-memory system design
- Processor speed increasing at 55 per year -
memory speeds at 7 - Mismatch exacerbated by shared memory
multiprocessors - Complex protocols employed to hide memory
latencies - Need for formal verification techniques that can
be employed during design
3Our Project Utah Verifier
4A Shared Memory Multiprocessor(a shared memory
system)
...
CPU
CPU
Interconnect
...
Memory
Memory
5Classification Symmetric Multi-Processors (SMP)
CPU
CPU
CPU
Coherent snooping bus
Memory
- Potential bugs in complex bus designs
- Deadlocks, lack of forward progress
- Lack of coherency
- Incorrect shared memory consistency model
62. Distributed Shared Memory (DSM) systems
SMP node
High-speed network
- Problems due to complex DSM protocols
- Deadlocks, lack of forward progress,
- Incorrect shared memory consistency models
7Formal Methods for Shared Memory System Design
Verification
Provably-correct Synthesis
Theorem-proving
Finite-state Reachability
Model-checking
Protocol
Low-level concerns (e.g. deadlocks, progress,...)
Higher-level concerns (e.g. shared memory
consistency models)
8Results of the UV group
- New Partial Order reduction algorithm
- Realized in verifier called PV
- Outperforms SPIN 10 to 1 on most examples
- Selective state-caching is available for free
- A DSM Protocol synthesis algorithm
- Safety of synthesis proved correct using PVS
- Derives realistic (hand-quality) DSM protocols
- Incorporates a scalable buffer-reservation scheme
- Verifying Formal Memory Models
9Protocol Refinement
10Motivations
- Distributed directory based coherence protocols
difficult to understand and debug - low-level requests / acks / nacks dont reveal
what is being implemented - transient states are introduced and handled in an
ad-hoc way - buffer allocation is not tied to desired
high-level properties (e.g. progress) - verification is tedious
11Example of problems due to unexpected msgs
Cache Ctrlr
Directory Ctrlr
12Our approach
- Based on synthesis
- Transient states introduced automatically
- Buffer allocation is tied to desired high-level
properties (e.g. progress - Verification becomes much easier
- Synthesized protocols seem efficient
13Overview of Synthesis Method
Cache Ctrlr
I
E
I
E
Req
(N)ack
Dir Ctrlr
F
E
F
E
14Model-checking Efficiency
15An Illustration Migratory Protocol (i)
Process h
r(j)?req
r(o)!inv
r(i)!gr(data)
r(i)?req
F
E
I2
I1
r(o)?LR(data)
r(o)?LR(data)
r(j)!gr(data)
r(o)?ID(data)
Process r(i)
I3
V1
h!LR(data)
evict
h!req
I
V
rw
h?gr(data)
h!ID(data)
h?inv
V2
16An Illustration Migratory Protocol (ii)
Process h
r(j)?req
r(o)!inv
r(i)!gr(data)
r(i)?req
F
E
I2
I1
r(o)?LR(data)
r(o)?LR(data)
r(j)!gr(data)
r(o)?ID(data)
Process r(i)
I3
V1
h!LR(data)
evict
h!req
I
V
rw
h?gr(data)
h!ID(data)
h?inv
V2
17A Generic Example
P
Q
R
P?x
R!b
Q!a
Q!c
18Async Implementation of Example (i)
P
Q
R
R!b
Q!a
R!!b
Q!!a
1 msg buffer location for Ack/Nack
19Async Implementation of Example (ii)
P
Q
R
R!b
Q!a
Progress Buffer
Q!!c
R!!b
Q!!a
P!!ack
20Organization of Protocol - per Cache Line
Remote Nodes
Home Node
- Remote nodes (cache ctrlrs) communicate w.
home directory controller only - If Remote
and Home requests cross in medium, . Remote
request treated as Nack by Home . Home request
is dropped by Remote - Pt-to-pt order-preserving
error-free communication
21General Nature of Communication States
h?m2
T
h!msg
h?m1
(Remote)
r(j)!m2
T
r(i)?m1
(Home)
22Summary Remote node rules
23Summary Home node (i)
24Summary Home node (ii)
25Status of Work
- Correctness of Protocol Synthesis Proved in PVS
- Write-invalidate protocol also synthesized
- Offers a general synthesis method for protocols
(not necessarily for DSM) - Related work Buckley and Silberschatz, Chandra
et.al., Park and Dill, Gribomont, ...
26Verifying Conformance toFormal Memory Models
27FM and shared-memory system design
- Shared-memory systems are complex!
- Designers need safety net when exploring
optimizations formal verification - We focus on verifying that a (finite-state model
of a) shared memory system provides the required
memory model (mainly Sequential Consistency) - E.g. Verify a Cache Coherence Protocol for SC
- Our approach finite-state reachability analysis
28Importance of Memory Models -- An Example
Petersons algorithm for mutex under a memory
model called TSO
P1 A 1 turn 2 while (B /\ turn2
) ..CS..
P2 B 1 turn 1 while (A /\ turn1
) ..CS..
Must Specify Synchronization Routines and the
Shared Memory Consistency Model(s) under which
they work!
29Impact on CPU design -- Do Read-Speculation Right!
MEM
..wr(a,2).. wr(b,3)..
bus
CPU1
CPU2
wr(a,2) - Miss rd(b, 0) - Speculate Snoop wr(a) -
Spec OK
wr(b,3) - Miss rd(a, 0) - Speculate Snoop wr(a)
Spec not OK reissue rd(a, 2)
Without reissue, results are inconsistent with SC
30Basis for our work ARCHTEST (Collier)
- Multi-threaded C programs
- Used to debug actual multiprocessor machines
- unavailable at design-time
- Based on the theory of graph-sets
- used in our work also
- Our CAV98 work adapt Colliers tests for
model-checking - incomplete
- This work a complete verification method (sound
too!)
31What is a shared memory model?
Captured by the set of all executions of a
concurrent program!
Execution 1
Execution 2
SC
TSO
TSO allows more executions than SC (hence
weaker)
32An Operational Definition of SC and TSO
cpu1
cpu2
cpu1
cpu2
MUX
fifo
fifo
SC
TSO
Memory
Memory
33How are allowed executions specified?
As constraints on events generated by the
execution!
Constraints are expressed in terms of ordering
rules RO - Read Ordering ROA - RO over the
same address WOS - Write Ordering by Storage POS
- Program Ordering by Storage CMP -
Computational Ordering WA - Write
Atomicity Ordering rules specify constrains on
EVENTS
Memory Model Collier Cocktail! - e.g. (CMP,
RO, WOS)
34Definition of POS (and also RO and WOS)
PO includes RR, RW, WR, and WW orders
35Definition of CMP (defined per CPU per address)
CPU_j
STORE_j
36Assumptions in defining CMP... and in the rest
of this talk
- We are interested in more than SC
- We would like to set-up a general framework for
defining and verifying memory models - Assume that RO is obeyed by every memory model of
interest to us - We Assume
- Projectability,
- Data Independence
- Unambiguous executions
37Assume Projectability, Data Independence,and
consider only Unambiguous executions
Projectible
Data independent
Same datum never written twice (so we can
uniquely trace source of data!)
Unambiguous
38Definition of CMP for CPU i for address d
CMP includes ROA also is an implied edge
R1(d,T)
CPU_j
STORE_j
W4(d,2)
R2(d,2)
R1(d,T) R2(d,2)
W4(d,2)
ROA
W2(d,4)
R3(d,2)
R3(d,2)
W3(d,5)
W4(d,2)
ROA
W2(d,4)
W3(d,5)
R4(d,5)
W2(d,4)
39Lets study (CMP, RO, WOS) - a useful drosophila!
Initially a 0 R1(a,1) W2(a,1)
Even this execution is possible under
(CMP,RO,WOS)
..no writes to a..
CPU_j
STORE_j
40An execution satisfying (CMP, RO, WOS)
Execution satisfies (CMP, RO, WOS) as there
are no cycles created by adding their arcs!
41An execution that violates (CMP,RO,WOS)
rd(A,3) rd(A,2)
wr(A,2) wr(A,3)
wr(A,2)
rd(A,2)
WOS
ROA
wr(A,3)
rd(A,3)
42Verification Techniques for Memory Models
- Consider all possible executions
- involving all possible addresses A
- and all possible data D
- for all possible concurrent programs P
- Introduce the arcs due ordering rules
- Look for cycles
- Impractical!
- So, look for ways to limit A, D, and P
43Our approach
- Assume address projectability (or
projectability) - and data independence
- Prove limited address theorems (helps limit A)
- Characterize all violating executions E_i
over A - Come up with finite-state abstractions for each
E_i - using data independence to limit D, and
- using non-determinism
- to arrive at a finite number of test automata
aut_i - Explore state-space of each aut_i
memory-system - Look for entry into error-states
44Use of data abstraction non-determinism
45Limited Address Theorem for (CMP,RO,WOS)
Two addresses suffice!
46PowerPoint proof of the limited address theorem
for (CMP,RO,WOS)
RO
RO
R
Involves two addrs!
47Exhaustive characterization of violations of
(CMP, RO, WOS) over one address, a
48Test automata for 1-address (CMP,RO,WOS)
violations
Error states E1, E2
49Exhaustive characterization of two addresses
violations of (CMP, RO, WOS)
50Test automata for 2-address (CMP,RO,WOS)
violations
Error states E1, E2
51Limited Address Theorem for (CMP,POS)
521-address (CMP,POS) verification
Error states E1, E2
532-address (CMP,POS) verification
Error states E1, E2
54SC (CMP, POS, WA)
55Definition of WA - by showing what is not WA!
56The limited-address theorem for SC (CMP, POS,
WA)
- In an N-processor system, N addresses are
- sufficient
- IF concurrent program P using M gt N addresses
shows a violation - THEN there exists a subset A of N addresses
- such that P projected onto A yields concurrent
program P that also shows a
violation. - PowerPoint proof to follow
- and necessary
57PowerPoint proof of the limited address theorem
for SC (CMP, POS, WA)
- Suppose C is the cycle containing the smallest
number of events that involves more than N
ltpos edges. - Then two ltpos edges connect events
generated by the same processor, say g, and
observed by a and b. - If ab, we can
eliminate one of these POS edges - if a ltgt b,
consider g ltgt a, and possibly equal to b. - a0
and a1 are writes. Find corresp events in b.
b0
One linearization
wa
a0
b2
a0
b2
Pos(g)
Pos(g)
Pos(g)
Pos(g)
a1
b3
a1
b3
58All N-address (CMP, POS, WA) violations
(2)
Two processors see two writes w1 and w2 in
different orders
(CMP, POS) violations
59Complete test for SC for 1-address programs
Error states - lt P14, Q41 gt - P41a, P41b x
Q14a, Q14b
60Complete test for SC for 2-address programs
Error states - lt P14, Q41 gt - P41a, P41b x
Q14a, Q14b
61Case Studies
- Runway/PA system model
- Bus based design
- An aggressive split transaction protocol
- Out-of-order (speculative) completion of
transactions on Runway for high-performance - not modeled in current experiments
- In-order completion of instructions in PA for
sequential consistency
62SC verification of the HP/Runway model
63Conclusions
- Promising
- Violations caught very quickly
- Need to try larger examples
- Currently studying weaker memory models
- Future work
- Combatting state-explosion
- Symmetries
- Better automata
- Integrate into design cycle of CPUs
- Support performance optimizations
- and verification regressions
64Related Work
- Graf (CAV94)
- for more than SC (hence unsound for SC)
- properties depend on design
- Alur, McMillan, Peled (LICS96)
- undecidable if data can be compared
- Nalumasu, Ghughal, Mokkedem, Gopalakrishnan
(CAV98) - incomplete
- Henzinger, Qadeer, Rajamani (CAV99)
- needs invariants
- invariants depend on design
- assumes address-symmetry
- Collier (80s)
- not available at design-time