Optimizing memory transactions - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Optimizing memory transactions

Description:

2. 8/12/09. Optimizing memory transactions. Tim Harris, Mark Plesko, Avi Shinnar, David Tarditi ... A recent implementation (Harris Fraser, OOPSLA '03) ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 38

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing memory transactions

1
Optimizing memory transactions

Tim Harris, Mark Plesko, Avi Shinnar, David
Tarditi

2
The Big Question are atomic blocks feasible?

Atomic blocks may be great for the programmer
but can they be implemented with acceptable
performance?
At first, atomic blocks look insanely expensive.
A recent implementation (HarrisFraser, OOPSLA
03)
Every load and store instruction logs information
into a thread-local log
A store instruction writes the log only
A load instruction consults the log first
At the end of the block validate the log and
atomically commit it to shared memory
Assumptions throughout this talk
Reads outnumber writes (31 or more)
Conflicts are rare

3
State of the art 2003
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Normalised execution time
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
4
Our new techniques prototyped in Bartok

Direct-update STM
Allow transactions to make updates in place in
the heap
Avoids reads needing to search the log to see
earlier writes that the transaction has made
Makes successful commit operations faster at the
cost of extra work on contention or when a
transaction aborts
Compiler integration
Decompose the transactional memory operations
into primitives
Expose the primitives to compiler optimization
(e.g. to hoist concurrency control operations out
of a loop)
Runtime system integration
Integration with the garbage collector or runtime
system components to scale to atomic blocks
containing 100M memory accesses

5
Results concurrency control overhead
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Direct-update STM (2.04x)
Normalised execution time
Direct-update STM compiler integration (1.46x)
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
Scalable to multicore
6
Results scalability
Coarse-grained locking
Fine-grained locking
WSTM (atomic blocks)
DSTM (API)
OSTM (API)
Microseconds per operation
Direct-update STM compiler integration
threads
7
Direct update STM

Augment objects with (i) a lock, (ii) a version
number
Transactional write
Lock objects before they are written to (abort if
another thread has that lock)
Log the overwritten data we need it to restore
the heap case of retry, transaction abort, or a
conflict with a concurrent thread
Make the update in place to the object
Transactional read
Log the objects version number
Read from the object itself
Commit
Check the version numbers of objects weve read
Increment the version numbers of object weve
written, unlocking them

8
Example contention between transactions
T1s log
T2s log

9
Example contention between transactions
T1s log
T2s log
c1.ver100

T1 reads from c1 logs that it saw version 100
10
Example contention between transactions
T1s log
T2s log
c1.ver100
c1.ver100
T2 also reads from c1 logs that it saw version
100
11
Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100
Suppose T1 now reads from c2, sees it at version
200
12
Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100
Before updating c1, thread T2 must lock it
record old version number
13
Example contention between transactions
(2) After logging the old value, T2 makes its
update in place to c1
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100 c1.val10
(1) Before updating c1.val, thread T2 must log
the data its going to overwrite
14
Example contention between transactions
(2) T2s transaction commits successfully.
Unlock the object, installing the new version
number
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100 c1.val10
(1) Check the version we locked matches the
version we previously read
15
Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver100

(1) T1 attempts to commit. Check the versions it
read are still up-to-date.
(2) Object c1 was updated from version 100 to
101, so T1s transaction is aborted and re-run.
16
Compiler integration

We expose decomposed log-writing operations in
the compilers internal intermediate code (no
change to MSIL)
OpenForRead before the first time we read from
an object (e.g. c1 or c2 in the examples)
OpenForUpdate before the first time we update
an object
LogOldValue before the first time we write to a
given field

Source code
Basic intermediate code
Optimized intermediate code
atomic t n.value n n.next
OpenForRead(n) t n.value OpenForRead(n) n
n.next
OpenForRead(n) t n.value n n.next
17
Compiler integration avoiding upgrades
Compilers intermediate code
Optimized intermediate code
Source code
atomic c1.val
OpenForRead(c1) temp1 c1.val temp1
OpenForUpdate(c1) LogOldValue(c1.val) c1.va
l temp1
OpenForUpdate(c1) temp1 c1.val temp1
LogOldValue(c1.val) c1.val temp1
18
Compiler integration other optimizations

Hoist OpenFor and Log operations out from loops
Avoid OpenFor and Log operations on objects
allocated inside atomic blocks (these objects
must be thread local)
Move OpenFor operations from methods to their
callers
Further decompose operations to allow
logging-space checks to be combined
Expose OpenFor and Logs implementation to
inlining and further optimization

19
What about version wrap-around
Open for update, see version 17
Commit, set version 18
Open for update, see version 16
Commit, set version 17

time
Commit obj1 back to version 17 oops
Open obj1 for read, see version 17

Solution validate read log at each GC, force GC
at least once every versions transactions

20
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
21
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
22
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
23
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
4. Discard log entries for unreachable objects
theyre dead whether or not the block succeeds
24
Results long running tests
10.8
73
162
Direct-update STM
Run-time filtering
Compile-time optimizations
Original application (no tx)
Normalised execution time
tree
go
skip
merge-sort
xlisp
25
Conclusions

Atomic blocks and transactional memory provide a
fundamental improvement to the abstractions for
building shared-memory concurrent data structures
Only one part of the concurrency landscape
But we believe no one solution will solve all the
challenges there
Our experience with compiler integration shows
that a pure software implementation can perform
well for short transactions and scale to vast
transactions (many accesses many locations)
We still need a better understanding of realistic
workload distributions

26
(No Transcript)
27
Backup slides

Backup slides beyond this point

28
Parallelism preserving design
Data held in shared mode in multiple caches
Core 1
Core 4
Core 3
Core 2
S
S
S
S
E
L1
L1
L1
L1
L2
L2
Data held in exclusive mode in a single cache
E
L3

Any updates must pull data into the local cache
in exclusive mode
Even if an operation is read only, acquiring a
multi-reader lock will involve fetching a line in
exclusive mode
Our optimistic design lets data shared by
multiple cores remain cached by all of them
Scalability at the cost of wasted work when
optimism does not pay off

29
Compilation
MSIL atomic blocks boundaries
IR cloned atomic code
IR explicit STM operations
IR low-level STM operations
30
Some examples (xlisp garbage collector)

/ follow the left sublist if there is one /
if (livecar(xl_this))
xl_this.n_flags (byte)LEFT
tmp prev
prev xl_this
xl_this prev.p
prev.p tmp

Open prev for update here to avoid an
inevitable upgrade
31
Some examples (othello)
public int PlayerPos (int xm, int ym, int
opponent, bool upDate) int rotate
// 8 degrees of rotation int x, y
bool endTurn bool plotPos
int turnOver 0 // inital checking !
if (this.Boardym,xm ! BInfo.Blank) return
turnOver // can't overwrite player
Calls to PlayerPos must open this for read do
this in the caller
32
Basic design open for read
00
Transactional version number
0
2. Copy meta data
vtable
1. Store obj ref
fields
Read objects log
33
Basic design open for update
00
Transactional version number
0
3. CAS to acquire object
2. Copy meta data
vtable
1. Store obj ref
fields
Updated objects log
34
Multi-use header word
vtable
fields
35
Filtering duplicate log entries

Per-thread table of recently logged objects /
addresses
Fast O(1) logical clearing by embedding
transaction sequence numbers in entries

00
Hash value
Tag
seq
Hash value
seq
36
Semantics of atomic blocks

I/O
Details
Workloads

37
Challenges opportunities

Moving away from locks as a programming
abstraction lets us re-think much of the
concurrent programming landscape

Atomic blocks for synchronization and shared
state access
Explicit threads only for explicit concurrency
CILK-style parallel loops and calls
Application software
Data-parallel libraries
Managed runtime STM implementation
Re-use STM mechanisms for TLS and optimistic (in
the technical sense) parallel loops etc
Multi-core / many-core hardware
H/W TM or TM-acceleration

Write a Comment

User Comments (0)