Deterministic Execution of Nondeterministic Shared-Memory Programs - PowerPoint PPT Presentation

About This Presentation

Title:

Deterministic Execution of Nondeterministic Shared-Memory Programs

Description:

... produce different result than serial execution. In fact, execution not necessarily equivalent with any ... Dynamic table checked and updated during execution ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 30

Provided by: dangro

Category:

more less

Transcript and Presenter's Notes

Title: Deterministic Execution of Nondeterministic Shared-Memory Programs

1
Deterministic Execution of Nondeterministic
Shared-Memory Programs

Dan Grossman
University of Washington
Dagstuhl Seminar on
Design and Validation of Concurrent Systems
August 2009

2
What if

What if you could run the same multithreaded
program on the same inputs twice and know you
would get the same results?
What exactly does that mean?
Why might you want that?
How can we do that (semi-efficiently)?
But first
Some background on me and the talks Im not
giving
Key terminology and perspectives
More important than technical details at this
event

3
Biography / group names

Me
Programming-languages person
Type systems, compilers for memory-safe C dialect
200-2004
30 ? 80 focus on multithreading, 2005-
Co-advising 3-4 students with computer architect
Luis Ceze, 2007-
Two groups for marketing purposes
WASP, wasp.cs.washington.edu
SAMPA, sampa.cs.washington.edu

4
The talk you wont see
void transferFrom(int amt, Acct other)
atomic other.withdraw(amt)
this.deposit(amt)

Transactions are to shared-memory concurrency as
garbage
collection is to memory management OOPSLA 07
Semantic problems with nontransactional accesses
worse than locks!
Fix with stronger guarantees and compiler opts
PLDI07
Or static type system, formal semantics, and
proof POPL08
Or more dynamic approach adapting to Haskell
submitted
Prototypes for OCaml, Java, Scheme, and Haskell

5
This talk

Take an arbitrary C/C program with POSIX
threads
Locks, barriers, condition variables, data races,
whatever
Compile it funny
Link it against a funny run-time system
Get deterministic behavior
Well, as deterministic as a sequential C program
Joint work Luis Ceze, Tom Bergan, Joe Devietti,
Owen Anderson

6
Terminology

Essential perspectives, not just definitions
Parallelism vs. concurrency
Or different terms if you prefer
Sequential semantics vs. determinism vs.
nondeterminism
What is an input?
Level of abstraction
Which one do you care about?

7
Concurrency

Working definition
Software is concurrent if a primary intellectual
challenge is responding to external events from
multiple sources in a timely manner.
Examples operating system, shared hashtable,
version control
Key challenge is responsiveness
often leads to threads or asynchrony
Correctness usually requires synchronization
(e.g., locks)

8
Parallelism

Working definition
Software is parallel if a primary intellectual
challenge is using extra computational resources
to do more useful work per unit time.
Examples scientific computing, most graphics, a
lot of servers
Key challenge is Amdahls Law
No sequential bottlenecks, no imbalanced load
When pure fork-join isnt correct, need
synchronization

9
The confusion

First, this use of terms isnt standard
Many systems are both
And its really a matter of degree
Similar lower-level mechanisms, such as threads
and locks
And similar errors (race conditions, deadlocks,
etc.)
Our work determinizes these lower-level
mechanisms, so we determinize concurrent and
parallel applications
But purely parallel ones probably benefit less

10
Terminology

Essential perspectives, not just definitions
Parallelism vs. concurrency
Or different terms if you prefer
Sequential semantics vs. determinism vs.
nondeterminism
What is an input?
Level of abstraction
Which one do you care about?

11
Sequential semantics

Some languages can have results defined purely
sequentially, but are designed to have better
parallel-performance guarantees (thanks to a cost
model)
Examples DPJ, Cilk, NESL,
For correctness, reason sequentially
For performance, reason in parallel
Really designed for parallelism, not concurrency
Not our work

12
Sequential isnt always deterministic

Surprisingly easy to forget this

int f1() print(A) print(B) return 0 int
f2() print(C) print(D) return 0 int g()
return f1() f2()

Must g() print ABCD?
Java yes
C/C no, CDAB allowed, but not ACBD, ACDB, etc.

13
Another example

Dijkstras guarded-command conditionals

if x 2 1 -gt y x - 1 x lt 10 -gt y
7 x gt 10 -gt y 0 fi

We might still expect a particular language
implementation (compiler) to be deterministic
May choose any deterministic result consistent
with the nondeterministic semantics
Presumably doesnt change choice across
executions, but may across compiles (including
butterfly effects)
Our work does this

14
Why helpful?

So programmer gets a deterministic executable,
but doesnt know which one
Key degree of freedom for automated performance
Still helpful for
Whole-program testing and debugging
Automated replicas
In general, repeatability and reducing possible
executions

15
Define deterministic, part 1

Deterministic outputs depend only on inputs
Thats right, but means must clearly specify what
is an input (and an output)
Can define away anything you want
Example All syscall results are inputs, so
seeding the pseudorandom number generator with
time-of-day is deterministic
We mean what you think we mean
Inputs command-line, I/O, syscalls
Not inputs cache state, hardware timing, thread
scheduler

16
Terminology

Essential perspectives, not just definitions
Parallelism vs. concurrency
Or different terms if you prefer
Sequential semantics vs. determinism vs.
nondeterminism
What is an input?
Level of abstraction
Which one do you care about?

17
Define deterministic, part 2

Is it deterministic? depends crucially on your
abstraction level
Another obvious easy-to-forget thing
Examples
File systems
Memory-allocation (Java vs. C)
Set implemented as a list
Quantum mechanics
Our work
The language level state of logical memory,
program output
Application may care only about a higher level
(future work)

18
Okay how?

Trade-off between complexity and performance

PERFORMANCE
COMPLEXITY

Performance
Overhead (single-thread slowdown)
Scalability (minimize extra synchronization,
waiting)

19
Starting serial

Determinization is easy!
Run one thread at a time in round-robin order
Context-switch after N basic blocks for
deterministic N
Cannot use a timer use compiler and run-time
Races in source program are irrelevant locks
still respected
Example with 3 threads running (time moves with
arrows)

T1
T2
T3
1 quantum
1 round
20
Parallel quanta

The quanta in a round can start to run in
parallel provided they stop before any
communication occurs (see how next)
So each round has two stages, parallel then serial

T1
T2
T3
Parallel stage ends with global barrier
load A
load A
Serial stage ends next round starts
store B
store C

21
Is that legal?
T1
T2
T3
load A
load A
store B
store C

Can produce different result than serial
execution
In fact, execution not necessarily equivalent
with any serialization of quanta
But it doesnt matter as long as we are
deterministic! Just need
Parallel stages do no communication
Parallel stages end at deterministic points

22
Performance
T1
T2
T3
load A
load A
store B
store C

Keys to scalability
Run almost everything in the parallel stage
Keep quanta balanced
Assume (1), use rough instruction costs

23
Memory ownership

To avoid communication during parallel stage
Every memory location is shared or owned by 1
thread T
Dynamic table checked and updated during
execution
Can read only memory that is shared or
owned-by-you
Can write only memory owned-by-you
Locks just like memory locations blocking ends
quantum
In our example, perhaps A is shared, B and C are
owned by T2

T1
T2
T3
load A
load A
store B
store C
24
Changing ownership

Policy
For each location (any deterministic granularity
is correct),
First owner is first thread to allocate in the
location
On read in serial stage, if owned-by-other set to
shared
One write in serial stage, set to owned-by-self
Correctness
Ownership immutable in parallel stages (so no
communication)
Serial-stage changes are deterministic
So many, many polices are correct
Chose the obvious one for temporal locality
read-sharing
Must have good locality for scalability!

25
Overhead

Significant overhead
All reads/writes consult ownership information
All basic blocks subtract from a thread-local
quantum counter
Reduce via
Lots of run-time engineering and data structures
(not too much magic, but most important)
Obvious compiler optimizations like escape
analysis and hoisting counter-subtractions
Specialized compiler optimizations like
Subsequent Access Optimization Dont recheck
same ownership unless a quantum boundary might
intervene.
Correctness of this is a subtle argument and
slightly affects the ownership-change policy
(deterministically!)

26
Brittle

Change any line of code, command-line argument,
environment variable, etc. and you can get a
different deterministic program ?
We are mostly robust to memory-safety errors ?,
except ?
Bounds errors that corrupt ownership information
Bounds errors that write to another threads
allegedly-thread-local data

27
Results

Overhead Varies a lot, but about 3x at 8 threads
Scalability Varies a lot, but on average with
parsec suite ()
nondet 8 threads vs. nondet 2 threads 2.4
(linear 4)
det 8 threads vs. det 2 threads
2.0
det 8 threads vs. nondet 2 threads 0.91
(range 0.41 - 2.75)
How do you want to spend Moores Dividend?
subset runnable no mpi, no C exceptions, no
32-bit assumptions

28
Buffering

Actually, ownership is only one approach
Second approach relies on buffering and a commit
stage
Even higher overhead (to consult buffers)
Even better scalability (block only for
synchronization commits)
And a third hybrid approach
Hopefully more details soon

29
Conclusion

The fundamental assumption that nondeterministic
shared-memory programs must be run
nondeterministically is false
A fun problem to throw principled compiler and
run-time optimizations at.
Could dramatically change how we test and debug
parallel and concurrent programs
Most-related work
Kendo from MIT done concurrently (in parallel?
?), requires knowing about data races statically,
different approach
Colleagues in ASPLOS09 hardware support for
ownership
Record replay systemswe can replay without the
record