Title: Memoizing MultiThreaded Transactions
1Memoizing Multi-ThreadedTransactions
joint work with Suresh Jagannathan and Jeremy
Orlow
2Motivation
- Advent of multi-core architectures encourages
development of new applications and abstractions. - Concurrent stream-based programs
- Avoid recomputing outputs for previously seen
inputs - Speculative computation
- Reuse results computed by completed applications
within a failed speculation - Transactions
- Substantial amount of wasted work when a
(long-lived) transaction aborts - Reduce this overhead by avoiding re-execution of
computation not affected by the reasons for
commit failure - Apply well-known sequential optimizations
Memoization
2
3Programming Model
- Pure CML
- First-class threads
- Message-based synchronous communication
- First-class synchronous events
- All effects manifest through channel
communication - No polling
- Augmented with atomic
- No constraints on atomic regions
- Allow thread creation multi-threaded
transactions - Nested atomic regions nested transactions
- Strong atomicity
4Memoization
- Consider a pure function
- When applied to v
- If we apply f to v again, can simply return the
previous result, without having to re-evaluate
fs body
fun f(x) e
4
5Memoization
- What if f has side-effects?
- Cannot simply elide the second call
val x ref 0 fun f(y) ... x !x y ...
f(v) value of x is v f(v) value
of x is now 2v
-gt v
-gt v
5
6Observation potential solution
- Can split function body into two parts
- Expressions that have no effect, nor depend on
any effectful expression - Effectful expressions
- Record effectful computations (and any expression
that has a dependence with them) in the memo
table - When applying a memoized function
- Avoid execution of effect-free expressions
- Such expressions have no side-effect, nor depend
on expressions that do - Execute all effectful expressions
- Return value is
- value stored in memo table if the return
expression was effect-free - value yielded by evaluating the return
expression otherwise
7Challenges
- References provide a form of implicit, non-local
communication - Should only re-evaluate effect-dependent
computation when necessary - Does the problem become more tractable when
communication is explicit?
let val x ref true
fun f y ... if (!x) then e else
e in f v
... f v
end
? re-evaluation necessary only if x is modified
7
8Memoization and Concurrency
Scheduling decisions introduce non-determinism of
thread interleavings, making it non-trivial to
draw such conclusions
9Example
T3
f(v)
10Tracking Communication
let val (c1,c2) (mkCh(),mkCh()) fun f()
(... send(c1,v1) ...) fun g()
(recv(c1) send(c2,v2) ... g()) in spawn(g)
f() recv(c2) f() end
g()
What if there is no waiting receiver for the
send performed by f?
Should enforce a schedule that allows the thread
computing g() to proceed to the recv on c1
11An Approach
- Maintain a memo store that records communication
actions performed within a procedure through
constraints. - Constraints ensure communication and
synchronization take place in a specific order. - At a call, consult the memo store.
- If all constraints are satisfiable in the current
global state, elide the call. - Otherwise, explore the state space of possible
interleavings to discover a global state in which
remaining constraints can be satisfied. - Fail if no such state exists.
12Challenges
- Finding an interleaving that satisfies all
dependencies may involve unbounded search - Even when a schedule is discovered, starvation
may be introduced - Similar situation arises for recv
- Want to utilize memoization without compromising
fairness guarantees
let val c mkCh() fun p1() (send(c,1)
p1()) fun p2() (send(c,2) p2()) fun
f() (recv(c) ...) fun g() (f()...,
g()) in spawn(p1) spawn(p2) spawn(g) end
13Partial Memoization
- Utilizing memoized information requires
discovering a path in the state space in which
memoization constraints can be discharged.
let val (c1,c2) (mkCh(), mkCh()) fun f()
(send(c1,v1) ... recv(c2)) fun g()
(recv(c1) ... recv(c2) g()) fun h()
(... send(c2,v2) send(c2,v3) ... h())
fun i() (recv(c2) i()) in spawn(g) spawn(h)
spawn(i) f() ... send (c2,v3) ...
f() end
Instead, match send constraint, elide pure
computation upto recv(c2), and resume execution
14Implementation
- Incorporated within MLton
- insertion of barriers to monitor function
arguments and returns - hooks into CML to monitor channel communication
and to record constraints - Constraint matching can fail on a receive
constraint - Receive constraints are obligated to read a
specific value - Send constraints can only fail if
- there are no matching receive constraints on the
sending channel or - no receive operations on the same channel
- A receive operation (not constraint) is
ambivalent about the value it reads - When an application of a memoized function is
stalled, we can fail, and resume execution from
the stall point - Heuristic record the number of context switches
to a thread attempting to discharge a constraint.
15Case Study
- STM-Bench7
- A tunable multi-threaded benchmark designed to
compare different software transactional memory
(STM) implementations and designs. - Simulates data storage and access patterns of a
CAD/CAM application - Benchmark builds a tree of assemblies
- leaves contain bags of components
- components form highly-connected graphs of
atomic parts - Roughly 1.5K lines of CML
- Nodes in the graph are represented as
message-passing servers - Receiving channel for input
- Output channels to connect to adjacent nodes in
the tree
16Example
Traverses the graph, and changes a components
height
Establishes a transaction
Searching the graph for different components can
be performed concurrently
Memoization helps avoid unnecessary re-traversal
of the graph if the transaction fails.
17Results
- Consider two configurations of the benchmark
- Transactional Use STM without memoization
- Memoized Use STM with memoization of atomic
sections - Goal
- Measure performance improvement as a function of
transaction aborts - Parameters
- A graph of 1M nodes
- 280K complex assemblies
- 140 assemblies
- bags reference one of 100 components, each
containing 100 nodes - Execution creates roughly 500K threads, and 1M
channels - Each transaction performs 7 channel operations
on average, and traverses roughly 20 nodes of the
parts graph
18Runtime Improvement
19Runtime Improvement
20Related Work
- Self-adjusting computation and change propagation
- Leverages memoization to automatically alter a
programs execution to a change of inputs given
an initial execution run. - Key distinction no maintenance of dynamic
dependence graphs. - Effectiveness of memoization only dependent on
values stored in constraints, not where those
values came from. - Transactional Events
- Require arbitrary look-ahead to determine if a
complex transactional event can commit. - Similar property is necessary to determine if a
call can be elided based on communication actions
performed by its memoized version. - Selective memoization
- Addresses a complementary problem that can be
used to improve memoization efficiency.
21Conclusions
- Memoizing communication can be an effective
dynamic optimization to improve re-execution
overheads for optimistic or speculative
concurrency abstractions. - Partial memoization allows these techniques to be
useful in practice. - Future Work
- Opportunities for static analysis
- Detect communication patterns to aggregate
(bundle) constraints - Identify partial memoization points
- Runtime profiling
22Questions?
23STM
- We implement an eager-versioning, lazy conflict
detection STM protocol. - Isolation and atomicity guarantees within a
transaction - Shared references implemented in terms of
channel-based communication - Track updates to channels in the same way that
updates to shared memory is tracked by a typical
STM - Build an STM-aware shared-memory server
abstraction on top of channel communication - The STM supports nested, multi-threaded
transactions - Multiple threads within a transaction must join
at commit point before transaction can complete - Memoization helps reduce abort overheads in the
presence of communication among threads within
the transaction
24Schedulability
- Reasoning about whether a feasible schedule
exists is typically more difficult.
let val (c1,c2) (mkCh(), mkCh() fun f()
(... send(c1,v1) ... recv(c2)) fun g()
(recv(c1) recv(c2) ... g()) fun h()
(send(c2,v2) send(c2,v3)
h()) in (spawn(g) spawn(h)
f() ... f()) end
25Utilization
26Example
let val ch mkCh() fun f() let val _
recv(ch) val _ recv(ch)
in () end
fun g() let val _ send(ch,1)
val _ send(ch,2) in
() end fun f()
(spawn(f) f()) fun g() (spawn(g)
g()) in (spawn(f) spawn(g)) end
g()
send(ch,1)
send(ch,2)
send(ch,2)
send(ch,2)
send(ch,1)
g()
g()
g()
g()
- Four possible memoized versions of f, one for
each pair of values it may receive. - Force a thread schedule that guarantees calls to
g supply values recorded in memoized version of f.
27Evaluation Rules
28Evaluation Rules
29Safety
- Using memo information to elide calls only yields
states realizable under non-memoized evaluation. - Introduce two auxiliary operators
- transforms process states (and terms)
defined under memo evaluation to process states
and terms defined under non-memoized evaluation. - translates constraints in the memo store
to core language terms.
If
then
30Program States
- s Memo store
- Given an id (for a procedure), and an argument
value, returns a set of constraints and a return
value. - T Memo state
- Associates a set of constraints with a call.
- C Constraint
- for a communication operation
- channel location, action (Send/Recv), value sent
or received, continuation - for a spawn operation
- thunk spawned
- for a channel creation operation
- channel location
31Correspondence
- Partial memoization is sound with respect to full
memoization. - There exists a transition sequence from the
global state yielded by a Fail transition to a
global state representing successful discharge of
all memoization constraints.
If
and
then
32Utilization
33Example
Memoized information about sclHgt can help elide
the first call on re-execution if -- arguments
remain the same -- the object yielded by the
traversal has not changed Can elide the second
call if -- communication via channel c2 is
consistent with previous execution as determined
by behavior of the first call to sclHgt