CHESS : Systematic Testing of Concurrent Programs - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

CHESS : Systematic Testing of Concurrent Programs

Description:

Concurrency errors most common defects among 'detectable errors' ... Test scenario should be idempotent. Free all resources (handles, memory, ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 37
Provided by: mada71
Category:

less

Transcript and Presenter's Notes

Title: CHESS : Systematic Testing of Concurrent Programs


1
CHESS Systematic Testing of Concurrent
Programs
  • Madan Musuvathi
  • Shaz Qadeer
  • Microsoft Research

2
Testing multithreaded programs is HARD
  • Specific thread interleavings expose subtle
    errors
  • Testing often misses these errors
  • Even when found, errors are hard to debug
  • No repeatable trace
  • Source of the bug is far away from where it
    manifests

3
Concurrency is a real problem
  • Windows 2000 hot fixes
  • Concurrency errors most common defects among
    detectable errors
  • Incorrect synchronization and protocol errors
    most common defects among all coding errors
  • Windows Server 2003 late cycle defects
  • Synchronization errors second in the list, next
    to buffer overruns
  • Race conditions can result in security exploits

4
Current practice
  • Concurrency testing Stress testing
  • Example testing a concurrent queue
  • Create 100 threads performing queue operations
  • Run for days/weeks
  • Pepper the code with sleep ( random() )
  • Stress increases the likelihood of rare
    interleavings
  • Makes any error found hard to debug

5
CHESS Unit testing for concurrency
  • Example testing a concurrent queue
  • Create 1 reader thread and 1 writer thread
  • Exhaustively try all thread interleavings
  • Run the test repeatedly on a specialized
    scheduler
  • Explore a different thread interleaving each time
  • Use model checking techniques to avoid redundancy
  • Check for assertions and deadlocks in every run
  • The error-trace is repeatable

6
Systematic Stress Testing Using CHESS
Program
Tester Provides a Test Scenario
While(not done) TestScenario()
TestScenario()
CHESS
CHESS runs the scenario in a loop
  • Every run takes a different interleaving
  • Every run is repeatable

Win32 API
Kernel Threads, Scheduler,
Synchronization Objects
7
Conditions on Test Scenario
  • Test scenario should terminate in all
    interleavings
  • Test scenario should be idempotent
  • Free all resources (handles, memory, )
  • Clear the hardware state
  • Key observation
  • Existing stress tests already have these
    properties
  • Because they repeatedly run for ever

8
Perturb the System as Little as Possible
  • Run the system as is
  • On the actual OS, hardware
  • Using system threads, synchronization

Program
While(not done) TestScenario()
TestScenario()
CHESS
  • Detour Win32 API calls
  • To control and introduce nondeterminism

Win32 API
  • Advantages
  • Avoid reporting false errors
  • Easy to add to existing test frameworks
  • Use existing debuggers

Kernel Threads, Scheduler,
Synchronization Objects
9
Implementation details
  • Handle all the Win32 synchronization mechanisms
  • Critical sections, locks, semaphores, events,
  • Threadpools
  • Asynchronous procedure calls
  • Timers
  • IO Completions
  • No modification to the kernel scheduler / Win32
    library
  • CHESS drives the system along a desired by
    interleaving by hijacking the scheduler

10
Controlling the Scheduling Nondeterminism
  • Nondeterministic choices for the scheduler
  • Determine when to context switch
  • On context switch, pick the next runnable thread
    to run
  • On resource release, wake up one of the waiting
    threads
  • Hijack these choices from the scheduler
  • Ensure at most one thread is runnable
  • No thread is waiting on a resource
  • At chosen schedule points, block the current
    thread while waking the next thread
  • Emulate program execution on a uniprocessor with
    context switches only at synchronization points

11
Partial-order reduction
  • Many thread interleavings are equivalent
  • Accesses to separate memory locations by
    different threads can be reordered
  • Avoid exploring equivalent thread interleavings

T1 x 1
T2 y 2
T2 y 2
T1 x 1
12
Partial-order reduction in CHESS
  • Algorithm
  • Assume the program is data-race free
  • Context switch only at synchronization points
  • Check for data-races in each execution
  • Theorem
  • If the algorithm terminates without reporting
    races,
  • then the program has no assertion failures

13
Executions on Multi-cores
  • CHESS checks for data-races
  • If a Test Scenario manifests a bug on a
    multi-core machine, then CHESS will
  • Either report a data-race
  • Or the bug
  • CHESS systematically enumerates all sequentially
    consistent executions
  • Any data-race free multi-core execution is
    equivalent to a sequentially consistent execution

14
State space explosion
Thread 1
Thread 2
x 1 y 1
x 2 y 2
0,0
1,0
2,0
x 1
1,1
2,0
2,2
1,0
y 1
2,1
2,1
1,2
1,2
x 2
2,2
1,1
y 2
1,1
1,2
1,1
2,2
2,1
2,2
15
State space explosion
Thread 1
Thread n
  • Number of executions
  • O( nnk )
  • Exponential in both n and k
  • Typically n lt 10 k gt 100
  • Limits scalability to large programs (large k)

x 2 y 2
x 1 y 1

k steps each
n threads
16
Bounding execution depth
  • Works very well for message-passing programs
  • Limit the number of message exchanges
  • Message processing code executed atomically
  • Can go deep in the state space
  • Does not work for multithreaded programs
  • Even toy programs can have large number of steps
    (shared-variable accesses)

17
Iterative context bounding
  • Prioritize executions with small number of
    preemptions
  • Two kinds of context switches
  • Preemptions forced by the scheduler
  • e.g. Time-slice expiration
  • Non-preemptions a thread voluntarily yields
  • e.g. Blocking on an unavailable lock, thread end

Thread 1
Thread 2
x 1 if (p ! 0) x p-gtf
x 1 if (p ! 0)
p 0
preemption
x p-gtf
non-preemption
18
Iterative context-bounding algorithm
  • The scheduler has a budget of c preemptions
  • Nondeterministically choose the preemption points
  • Resort to non-preemptive scheduling after c
    preemptions
  • Once all executions explored with c preemptions
  • Try with c1 preemptions
  • Iterative context-bounding has desirable
    properties
  • Property 0 Easy to implement

19
Property 1 Polynomial state space
  • Terminating program with fixed inputs and
    deterministic threads
  • n threads, k steps each, c preemptions
  • Number of executions lt nkCc . (nc)!

  • O( (n2k)c. n! )
  • Exponential in n
    and c, but not in k

Thread 1
Thread 2
  • Choose c preemption points

x 1 y 1
x 2 y 2
x 1
x 2
  • Permute nc atomic blocks


y 1
y 2
20
Property 2 Deep exploration possible with small
bounds
  • A context-bounded execution has unbounded depth
  • a thread may execute unbounded number of steps
    within each context
  • Event a context-bound of zero yields complete
    terminating executions

21
Property 3 Finds the simplest error trace
  • Finds smallest number of preemptions to the error
  • Number of preemptions better metric of error
    complexity than execution length

22
Property 4 Coverage metric
  • If search terminates with context-bound of c,
    then any remaining error must require at least
    c1 preemptions
  • Intuitive estimate for
  • The complexity of the bugs remaining in the
    program
  • The chance of their occurrence in practice

23
Property 5 Lots of bugs with small number of
preemptions
  • A non-blocking implementation of the
    work-stealing queue algorithm
  • bounded circular buffer accessed concurrently by
    readers and stealers
  • Developer provided
  • test harness
  • three buggy variations of the program
  • Each bug found with at most 2 preemptions
  • executions with 35 preemptions are possible!

24
Context-bounding Partial-order reduction
  • Algorithm
  • Assume the program is data-race free
  • Context switch only at synchronization points
  • Explore executions with c preemptions
  • Check for data-races in each execution
  • Theorem
  • If the algorithm terminates without reporting
    races,
  • Then the program has no assertion failures
    reachable with c preemptions
  • Requires that a thread can block only at
    synchronization points
  • Proof (Musuvathi-Q, PLDI 2007)

25
Bugs found
26
// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
27
// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
28
// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
29
// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
30
// Function called by a worker thread // of
RChannelReaderImpl void RChannelReaderImpl Alert
Application(RChannelItem item) // Notify
Application // XXX Preempt here for the bug
EnterCriticalSection(m_baseCS) // process
before exit LeaveCriticalSection(m_baseCS)
// Function called by the main thread void
TestChannel(WorkQueue workQueue, ...) //
Creating a channel // allocates worker
threads RChannelReader channel
new RChannelReaderImpl(..., workQueue) //
... do work here channel-gtClose() //
wrong assumption that channel-gtClose() //
waits for worker threads to be finished
delete channel // BUG deleting the channel
when // worker threads still have a valid
// reference to the channel
31
Facts about Dryad error trace
  • Long error trace but requires only one preemption
  • Depth-bounding cannot find it without a lot of
    luck
  • The error trace has 6 non-preempting context
    switches
  • It is important to leave unbounded the number of
    non-preempting context switches
  • This (and the other 6 errors) in Dryad remained
    in spite of careful regression testing and months
    of production use

32
Bugs found
33
Coverage vs. Context-bound
34
Dryad (coverage vs. time)
35
Current CHESS applications (work in progress)
  • Dryad (library for distributed dataflow
    programming)
  • Singularity/Midori (OS in managed code)
  • User-mode drivers
  • Cosmos (distributed file system)
  • SQL database

36
Conclusion
  • Concurrency is important
  • Building robust concurrent software is still a
    challenge
  • Lack of debugging and testing tools
  • CHESS Concurrency unit-testing
  • Exhaustively try all interleavings
  • Attempt to seamlessly integrate with existing
    test frameworks
  • Provide replay capability
  • Iterative context-bounding algorithm key to the
    design
Write a Comment
User Comments (0)
About PowerShow.com