Title: Optimizing memory transactions
1Optimizing memory transactions
- Tim Harris, Mark Plesko, Avi Shinnar, David
Tarditi
2The Big Question are atomic blocks feasible?
- Atomic blocks may be great for the programmer
but can they be implemented with acceptable
performance? - At first, atomic blocks look insanely expensive.
A recent implementation (HarrisFraser, OOPSLA
03) - Every load and store instruction logs information
into a thread-local log - A store instruction writes the log only
- A load instruction consults the log first
- At the end of the block validate the log and
atomically commit it to shared memory - Assumptions throughout this talk
- Reads outnumber writes (31 or more)
- Conflicts are rare
3State of the art 2003
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Normalised execution time
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
4Our new techniques prototyped in Bartok
- Direct-update STM
- Allow transactions to make updates in place in
the heap - Avoids reads needing to search the log to see
earlier writes that the transaction has made - Makes successful commit operations faster at the
cost of extra work on contention or when a
transaction aborts - Compiler integration
- Decompose the transactional memory operations
into primitives - Expose the primitives to compiler optimization
(e.g. to hoist concurrency control operations out
of a loop) - Runtime system integration
- Integration with the garbage collector or runtime
system components to scale to atomic blocks
containing 100M memory accesses
5Results concurrency control overhead
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Direct-update STM (2.04x)
Normalised execution time
Direct-update STM compiler integration (1.46x)
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
Scalable to multicore
6Results scalability
Coarse-grained locking
Fine-grained locking
WSTM (atomic blocks)
DSTM (API)
OSTM (API)
Microseconds per operation
Direct-update STM compiler integration
threads
7 Direct update STM
- Augment objects with (i) a lock, (ii) a version
number - Transactional write
- Lock objects before they are written to (abort if
another thread has that lock) - Log the overwritten data we need it to restore
the heap case of retry, transaction abort, or a
conflict with a concurrent thread - Make the update in place to the object
- Transactional read
- Log the objects version number
- Read from the object itself
- Commit
- Check the version numbers of objects weve read
- Increment the version numbers of object weve
written, unlocking them
8Example contention between transactions
T1s log
T2s log
9Example contention between transactions
T1s log
T2s log
c1.ver100
T1 reads from c1 logs that it saw version 100
10Example contention between transactions
T1s log
T2s log
c1.ver100
c1.ver100
T2 also reads from c1 logs that it saw version
100
11Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100
Suppose T1 now reads from c2, sees it at version
200
12Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100
Before updating c1, thread T2 must lock it
record old version number
13Example contention between transactions
(2) After logging the old value, T2 makes its
update in place to c1
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100 c1.val10
(1) Before updating c1.val, thread T2 must log
the data its going to overwrite
14Example contention between transactions
(2) T2s transaction commits successfully.
Unlock the object, installing the new version
number
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100 c1.val10
(1) Check the version we locked matches the
version we previously read
15Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver100
(1) T1 attempts to commit. Check the versions it
read are still up-to-date.
(2) Object c1 was updated from version 100 to
101, so T1s transaction is aborted and re-run.
16 Compiler integration
- We expose decomposed log-writing operations in
the compilers internal intermediate code (no
change to MSIL) - OpenForRead before the first time we read from
an object (e.g. c1 or c2 in the examples) - OpenForUpdate before the first time we update
an object - LogOldValue before the first time we write to a
given field
Source code
Basic intermediate code
Optimized intermediate code
atomic t n.value n n.next
OpenForRead(n) t n.value OpenForRead(n) n
n.next
OpenForRead(n) t n.value n n.next
17Compiler integration avoiding upgrades
Compilers intermediate code
Optimized intermediate code
Source code
atomic c1.val
OpenForRead(c1) temp1 c1.val temp1
OpenForUpdate(c1) LogOldValue(c1.val) c1.va
l temp1
OpenForUpdate(c1) temp1 c1.val temp1
LogOldValue(c1.val) c1.val temp1
18Compiler integration other optimizations
- Hoist OpenFor and Log operations out from loops
- Avoid OpenFor and Log operations on objects
allocated inside atomic blocks (these objects
must be thread local) - Move OpenFor operations from methods to their
callers - Further decompose operations to allow
logging-space checks to be combined - Expose OpenFor and Logs implementation to
inlining and further optimization
19 What about version wrap-around
Open for update, see version 17
Commit, set version 18
Open for update, see version 16
Commit, set version 17
time
Commit obj1 back to version 17 oops
Open obj1 for read, see version 17
- Solution validate read log at each GC, force GC
at least once every versions transactions
20Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
21Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
22Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
23Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
4. Discard log entries for unreachable objects
theyre dead whether or not the block succeeds
24Results long running tests
10.8
73
162
Direct-update STM
Run-time filtering
Compile-time optimizations
Original application (no tx)
Normalised execution time
tree
go
skip
merge-sort
xlisp
25Conclusions
- Atomic blocks and transactional memory provide a
fundamental improvement to the abstractions for
building shared-memory concurrent data structures - Only one part of the concurrency landscape
- But we believe no one solution will solve all the
challenges there - Our experience with compiler integration shows
that a pure software implementation can perform
well for short transactions and scale to vast
transactions (many accesses many locations) - We still need a better understanding of realistic
workload distributions
26(No Transcript)
27Backup slides
- Backup slides beyond this point
28Parallelism preserving design
Data held in shared mode in multiple caches
Core 1
Core 4
Core 3
Core 2
S
S
S
S
E
L1
L1
L1
L1
L2
L2
Data held in exclusive mode in a single cache
E
L3
- Any updates must pull data into the local cache
in exclusive mode - Even if an operation is read only, acquiring a
multi-reader lock will involve fetching a line in
exclusive mode - Our optimistic design lets data shared by
multiple cores remain cached by all of them - Scalability at the cost of wasted work when
optimism does not pay off
29Compilation
MSIL atomic blocks boundaries
IR cloned atomic code
IR explicit STM operations
IR low-level STM operations
30Some examples (xlisp garbage collector)
- / follow the left sublist if there is one /
- if (livecar(xl_this))
- xl_this.n_flags (byte)LEFT
- tmp prev
- prev xl_this
- xl_this prev.p
- prev.p tmp
-
Open prev for update here to avoid an
inevitable upgrade
31Some examples (othello)
public int PlayerPos (int xm, int ym, int
opponent, bool upDate) int rotate
// 8 degrees of rotation int x, y
bool endTurn bool plotPos
int turnOver 0 // inital checking !
if (this.Boardym,xm ! BInfo.Blank) return
turnOver // can't overwrite player
Calls to PlayerPos must open this for read do
this in the caller
32Basic design open for read
00
Transactional version number
0
2. Copy meta data
vtable
1. Store obj ref
fields
Read objects log
33Basic design open for update
00
Transactional version number
0
3. CAS to acquire object
2. Copy meta data
vtable
1. Store obj ref
fields
Updated objects log
34Multi-use header word
vtable
fields
35Filtering duplicate log entries
- Per-thread table of recently logged objects /
addresses - Fast O(1) logical clearing by embedding
transaction sequence numbers in entries
00
Hash value
Tag
seq
Hash value
seq
36Semantics of atomic blocks
37Challenges opportunities
- Moving away from locks as a programming
abstraction lets us re-think much of the
concurrent programming landscape
Atomic blocks for synchronization and shared
state access
Explicit threads only for explicit concurrency
CILK-style parallel loops and calls
Application software
Data-parallel libraries
Managed runtime STM implementation
Re-use STM mechanisms for TLS and optimistic (in
the technical sense) parallel loops etc
Multi-core / many-core hardware
H/W TM or TM-acceleration