Title: Nonintrusive onthefly data race detection using execution replay
1Non-intrusive on-the-fly data race detection
using execution replay
AADEBUG 2000 - MUNCHEN
- Michiel Ronsse - Koen De Bosschere
- Ghent University - Belgium
2Contents
- Introduction
- Non-determinism data races
- RecPlay
- Method
- Implementation
- Example
- Experimental Evaluation
- Conclusions
3Introduction
- Developing parallel programs for multiprocessors
with shared memory is considered difficult - number of threads running simultaneously
- co-operation synchronisation through shared
memory - too much synchronisation deadlock
- too little synchronisation race condition
- cyclic debugging is impossible due to
non-deterministic nature of most parallel
programs ? program execution is not repeatable
4Causes of non-determinism
- Sequential Programs input (keyboard, disk,
network), signals, interrupts, certain system
calls (gettimeofday(),) - Parallel programs race conditions
- two threads
- accessing the same shared variable (memory
location) - in an unsynchronised way
- and at least one thread modifies the variable
5Example code
include ltpthread.hgt unsigned global5 thread1(
) globalglobal6 thread2() globalglobal7
main() pthread_t t1,t2 pthread_create(t1,
NULL, thread1, NULL) pthread_create(t2, NULL,
thread2, NULL) pthread_join(t1,
NULL) pthread_join(t2, NULL) printf(globald
\n, global)
6Possible executions
L(5)
L(5)
L(5)
L(5)
L(5)
A
A
A
A
S(11)
A
S(11)
L(11)
S(12)
S(12)
S(11)
A
S(18)
global18
global11
global12
7Race conditions
- Two types
- synchronisation races
- doesnt allow us to use cycli debugging
- is not a bug, is desired non-determinism
- data races
- doesnt allow us to use cyclic debugging
- is a bug, is undesired non-determinism
- distinction is a matter of abstraction
- Automatic of data races detection is possible
- collect all memory references
- check parallel references
8Detecting data races
- Static methods
- checking the source code for all possible
executions with all possible input - NP complete ? not feasible
- Dynamic methods
- during an actual execution gt only detects data
races during this execution - Removal requires cyclic debugging
9Dynamic data race detection
- Piece of code between two consecutive
synchronisation operations a segment - We collect two sets for all segments i of all
thread L(i) and S(i) with the addresses of all
load and store operations - For all parallel segments,
gives the list of conflicting addresses.
10Existing race detection methods
- Huge overhead causing probe effect and Heisenbugs
- Only detect the existence of a data race (and the
variable), not the instructions involved. - It is a bug, we need cyclic debugging!
11RecPlay
- Synchronisation races execution replay
- Data races
- detect
- also enables cyclic debugging
- Allows you to detect/remove the first data race
- Three phases
- record the order of the synchronisation
operations - replay the synchronisation operations and check
for data races - normal replay, without checking for data races
12Overview
Replay ident.
Replay debug
Choose input
Replay detect
The end
Record
Replay debug
Choose new input
Automatic
Requires user intervention
13Instrumentation
- JiTI (Just in Time Instrumentation) was developed
especially for RecPlay, but it is a generic
instrumentation tool - Instruments memory and synchronisation operations
- Deals correctly with data in code, code in data,
self-modifying code - Clones processes the original process is used
for the data and the instrumented clone is used
for the code - No need for recompilation, relinking or
instrumentation of files.
14Execution replay
- ROLT (Reconstruction of Lamport Timestamps) is
used for tracing/replaying the synchronisation
operations - Attaches a scaler Lamport timestamp to each
synchronisation operation - Delaying synchronisation operations for
operations with a smaller timestamp suffices for
a correct replay - We only need to log a small subset of all
operations
15Collecting memory operations
- We need two lists of adresses per segment i L(i)
and S(i) - A multilevel bitmap is used
- low memory consumption
- comparing two bitmaps is easy
- We lose information two accesses to the same
variable are counted once. This is however no
problem for data race detection
16Memory bitmap
9 bit
9 bit
14 bit
17Detecting parallel segments
- A vectorclock is attached to each segment
- All segment information (two bitmapsvector
timestamps) is kept on a list L. - Each new segment is compared against the segments
on list L.
18Detecting obsolete segments
- Obsolete segments should be removed from list L.
- We use snooped matrix clock in order to detect
these segments
19Detecting obsolete segments
obsolete segment
segment on list L
segment in execution
point of execution
the future
20Identification phase
- If a data race is detected, we know
- the address involved
- the type of operations involved (load or store)
- the threads involved
- the segments containing the racing instructions
- We need another replayed execution to find the
racing instructions themselves ( call stack, ) - This replay executes at full speed till the
racing segments start executing.
21An Example
B?2
22An Example
A?1
B?2
C?4
P(S1)
23An Example
A?1
B?2
C?4
P(S1)
24An Example
A?1
B?2
V(S1)
C?4
P(S1)
25An Example
A?1
B?2
V(S1)
C?4
P(S1)
26An Example
A?1
B?2
V(S1)
C?4
P(S1)
27An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
28An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
29An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
30An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
31An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
32An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
33An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
34An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
35An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
P(S3)
36An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
P(S3)
37An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
P(S3)
38An Example
?
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
P(S3)
39An Example
?
A?1
B?2
V(S1)
?
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
P(S3)
40An Example
A?1
B?2
V(S1)
C?4
P(S1)
C?AB
A?3
V(S2)
P(S2)
V(S3)
P(S3)
41Experimental Evaluation
- RecPlay has been implemented for Solaris running
on SPARC multiprocessors - Tested on a SUN SparcServer 1000 with 4
processors - SPLASH-2 was used as a benchmark
- number of multithreaded numeric applications,
such as fast fourier transform, a raytracer, ... - Several data races were found, including in
SPLASH-2
42Basic performance of RecPlay
43Segments with memory accesses
44Efficiency of the ROLT mechanism
45Conclusions
- RecPlay is a practical and effictient tool for
detecting and removing data races - RecPlay also make cyclic debugging possible
- Three types of clocks (scalar, vector and matrix)
are used to enable a fast and memory-effictient
implementation - Data races have been found