Title: CHESS : Systematic Testing of Concurrent Programs
1CHESS Systematic Testing of Concurrent
Programs
- Madan Musuvathi
- Shaz Qadeer
- Microsoft Research
2Testing multithreaded programs is HARD
- Specific thread interleavings expose subtle
errors - Testing often misses these errors
- Even when found, errors are hard to debug
- No repeatable trace
- Source of the bug is far away from where it
manifests
3Concurrency is a real problem
- Windows 2000 hot fixes
- Concurrency errors most common defects among
detectable errors - Incorrect synchronization and protocol errors
most common defects among all coding errors - Windows Server 2003 late cycle defects
- Synchronization errors second in the list, next
to buffer overruns - Race conditions can result in security exploits
4Current practice
- Concurrency testing Stress testing
- Example testing a concurrent queue
- Create 100 threads performing queue operations
- Run for days/weeks
- Pepper the code with sleep ( random() )
- Stress increases the likelihood of rare
interleavings - Makes any error found hard to debug
5CHESS Unit testing for concurrency
- Example testing a concurrent queue
- Create 1 reader thread and 1 writer thread
- Exhaustively try all thread interleavings
- Run the test repeatedly on a specialized
scheduler - Explore a different thread interleaving each time
- Use model checking techniques to avoid redundancy
- Check for assertions and deadlocks in every run
- The error-trace is repeatable
6Systematic Stress Testing Using CHESS
Program
Tester Provides a Test Scenario
While(not done) TestScenario()
TestScenario()
CHESS
CHESS runs the scenario in a loop
- Every run takes a different interleaving
- Every run is repeatable
Win32 API
Kernel Threads, Scheduler,
Synchronization Objects
7Conditions on Test Scenario
- Test scenario should terminate in all
interleavings - Test scenario should be idempotent
- Free all resources (handles, memory, )
- Clear the hardware state
- Key observation
- Existing stress tests already have these
properties - Because they repeatedly run for ever
8Perturb the System as Little as Possible
- Run the system as is
- On the actual OS, hardware
- Using system threads, synchronization
Program
While(not done) TestScenario()
TestScenario()
CHESS
- Detour Win32 API calls
- To control and introduce nondeterminism
Win32 API
- Advantages
- Avoid reporting false errors
- Easy to add to existing test frameworks
- Use existing debuggers
Kernel Threads, Scheduler,
Synchronization Objects
9Implementation details
- Handle all the Win32 synchronization mechanisms
- Critical sections, locks, semaphores, events,
- Threadpools
- Asynchronous procedure calls
- Timers
- IO Completions
- No modification to the kernel scheduler / Win32
library - CHESS drives the system along a desired by
interleaving by hijacking the scheduler
10Controlling the Scheduling Nondeterminism
- Nondeterministic choices for the scheduler
- Determine when to context switch
- On context switch, pick the next runnable thread
to run - On resource release, wake up one of the waiting
threads - Hijack these choices from the scheduler
- Ensure at most one thread is runnable
- No thread is waiting on a resource
- At chosen schedule points, block the current
thread while waking the next thread - Emulate program execution on a uniprocessor with
context switches only at synchronization points
11Partial-order reduction
- Many thread interleavings are equivalent
- Accesses to separate memory locations by
different threads can be reordered - Avoid exploring equivalent thread interleavings
T1 x 1
T2 y 2
T2 y 2
T1 x 1
12Partial-order reduction in CHESS
- Algorithm
- Assume the program is data-race free
- Context switch only at synchronization points
- Check for data-races in each execution
- Theorem
- If the algorithm terminates without reporting
races, - then the program has no assertion failures
13Executions on Multi-cores
- CHESS checks for data-races
- If a Test Scenario manifests a bug on a
multi-core machine, then CHESS will - Either report a data-race
- Or the bug
- CHESS systematically enumerates all sequentially
consistent executions - Any data-race free multi-core execution is
equivalent to a sequentially consistent execution
14State space explosion
Thread 1
Thread 2
x 1 y 1
x 2 y 2
0,0
1,0
2,0
x 1
1,1
2,0
2,2
1,0
y 1
2,1
2,1
1,2
1,2
x 2
2,2
1,1
y 2
1,1
1,2
1,1
2,2
2,1
2,2
15State space explosion
Thread 1
Thread n
- Number of executions
- O( nnk )
- Exponential in both n and k
- Typically n lt 10 k gt 100
- Limits scalability to large programs (large k)
x 2 y 2
x 1 y 1
k steps each
n threads
16Bounding execution depth
- Works very well for message-passing programs
- Limit the number of message exchanges
- Message processing code executed atomically
- Can go deep in the state space
- Does not work for multithreaded programs
- Even toy programs can have large number of steps
(shared-variable accesses)
17Iterative context bounding
- Prioritize executions with small number of
preemptions - Two kinds of context switches
- Preemptions forced by the scheduler
- e.g. Time-slice expiration
- Non-preemptions a thread voluntarily yields
- e.g. Blocking on an unavailable lock, thread end
Thread 1
Thread 2
x 1 if (p ! 0) x p-gtf
x 1 if (p ! 0)
p 0
preemption
x p-gtf
non-preemption
18Iterative context-bounding algorithm
- The scheduler has a budget of c preemptions
- Nondeterministically choose the preemption points
- Resort to non-preemptive scheduling after c
preemptions - Once all executions explored with c preemptions
- Try with c1 preemptions
- Iterative context-bounding has desirable
properties - Property 0 Easy to implement
19Property 1 Polynomial state space
- Terminating program with fixed inputs and
deterministic threads - n threads, k steps each, c preemptions
- Number of executions lt nkCc . (nc)!
-
O( (n2k)c. n! ) - Exponential in n
and c, but not in k
Thread 1
Thread 2
- Choose c preemption points
x 1 y 1
x 2 y 2
x 1
x 2
y 1
y 2
20Property 2 Deep exploration possible with small
bounds
- A context-bounded execution has unbounded depth
- a thread may execute unbounded number of steps
within each context - Event a context-bound of zero yields complete
terminating executions
21Property 3 Finds the simplest error trace
- Finds smallest number of preemptions to the error
- Number of preemptions better metric of error
complexity than execution length
22Property 4 Coverage metric
- If search terminates with context-bound of c,
then any remaining error must require at least
c1 preemptions - Intuitive estimate for
- The complexity of the bugs remaining in the
program - The chance of their occurrence in practice
23Property 5 Lots of bugs with small number of
preemptions
- A non-blocking implementation of the
work-stealing queue algorithm - bounded circular buffer accessed concurrently by
readers and stealers - Developer provided
- test harness
- three buggy variations of the program
- Each bug found with at most 2 preemptions
- executions with 35 preemptions are possible!
24Context-bounding Partial-order reduction
- Algorithm
- Assume the program is data-race free
- Context switch only at synchronization points
- Explore executions with c preemptions
- Check for data-races in each execution
- Theorem
- If the algorithm terminates without reporting
races, - Then the program has no assertion failures
reachable with c preemptions - Requires that a thread can block only at
synchronization points - Proof (Musuvathi-Q, PLDI 2007)
25Bugs found
26// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
27// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
28// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
29// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
30// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
31Facts about Dryad error trace
- Long error trace but requires only one preemption
- Depth-bounding cannot find it without a lot of
luck - The error trace has 6 non-preempting context
switches - It is important to leave unbounded the number of
non-preempting context switches - This (and the other 6 errors) in Dryad remained
in spite of careful regression testing and months
of production use
32Bugs found
33Coverage vs. Context-bound
34Dryad (coverage vs. time)
35Current CHESS applications (work in progress)
- Dryad (library for distributed dataflow
programming) - Singularity/Midori (OS in managed code)
- User-mode drivers
- Cosmos (distributed file system)
- SQL database
36Conclusion
- Concurrency is important
- Building robust concurrent software is still a
challenge - Lack of debugging and testing tools
- CHESS Concurrency unit-testing
- Exhaustively try all interleavings
- Attempt to seamlessly integrate with existing
test frameworks - Provide replay capability
- Iterative context-bounding algorithm key to the
design