Optimizing memory transactions - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Optimizing memory transactions

Description:

2. 8/12/09. Optimizing memory transactions. Tim Harris, Mark Plesko, Avi Shinnar, David Tarditi ... A recent implementation (Harris Fraser, OOPSLA '03) ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 38
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Optimizing memory transactions


1
Optimizing memory transactions
  • Tim Harris, Mark Plesko, Avi Shinnar, David
    Tarditi

2
The Big Question are atomic blocks feasible?
  • Atomic blocks may be great for the programmer
    but can they be implemented with acceptable
    performance?
  • At first, atomic blocks look insanely expensive.
    A recent implementation (HarrisFraser, OOPSLA
    03)
  • Every load and store instruction logs information
    into a thread-local log
  • A store instruction writes the log only
  • A load instruction consults the log first
  • At the end of the block validate the log and
    atomically commit it to shared memory
  • Assumptions throughout this talk
  • Reads outnumber writes (31 or more)
  • Conflicts are rare

3
State of the art 2003
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Normalised execution time
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
4
Our new techniques prototyped in Bartok
  • Direct-update STM
  • Allow transactions to make updates in place in
    the heap
  • Avoids reads needing to search the log to see
    earlier writes that the transaction has made
  • Makes successful commit operations faster at the
    cost of extra work on contention or when a
    transaction aborts
  • Compiler integration
  • Decompose the transactional memory operations
    into primitives
  • Expose the primitives to compiler optimization
    (e.g. to hoist concurrency control operations out
    of a loop)
  • Runtime system integration
  • Integration with the garbage collector or runtime
    system components to scale to atomic blocks
    containing 100M memory accesses

5
Results concurrency control overhead
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Direct-update STM (2.04x)
Normalised execution time
Direct-update STM compiler integration (1.46x)
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
Scalable to multicore
6
Results scalability
Coarse-grained locking
Fine-grained locking
WSTM (atomic blocks)
DSTM (API)
OSTM (API)
Microseconds per operation
Direct-update STM compiler integration
threads
7
Direct update STM
  • Augment objects with (i) a lock, (ii) a version
    number
  • Transactional write
  • Lock objects before they are written to (abort if
    another thread has that lock)
  • Log the overwritten data we need it to restore
    the heap case of retry, transaction abort, or a
    conflict with a concurrent thread
  • Make the update in place to the object
  • Transactional read
  • Log the objects version number
  • Read from the object itself
  • Commit
  • Check the version numbers of objects weve read
  • Increment the version numbers of object weve
    written, unlocking them

8
Example contention between transactions
T1s log
T2s log


9
Example contention between transactions
T1s log
T2s log
c1.ver100

T1 reads from c1 logs that it saw version 100
10
Example contention between transactions
T1s log
T2s log
c1.ver100
c1.ver100
T2 also reads from c1 logs that it saw version
100
11
Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100
Suppose T1 now reads from c2, sees it at version
200
12
Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100
Before updating c1, thread T2 must lock it
record old version number
13
Example contention between transactions
(2) After logging the old value, T2 makes its
update in place to c1
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100 c1.val10
(1) Before updating c1.val, thread T2 must log
the data its going to overwrite
14
Example contention between transactions
(2) T2s transaction commits successfully.
Unlock the object, installing the new version
number
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100 lock c1, 100 c1.val10
(1) Check the version we locked matches the
version we previously read
15
Example contention between transactions
T1s log
T2s log
c1.ver100 c2.ver100

(1) T1 attempts to commit. Check the versions it
read are still up-to-date.
(2) Object c1 was updated from version 100 to
101, so T1s transaction is aborted and re-run.
16
Compiler integration
  • We expose decomposed log-writing operations in
    the compilers internal intermediate code (no
    change to MSIL)
  • OpenForRead before the first time we read from
    an object (e.g. c1 or c2 in the examples)
  • OpenForUpdate before the first time we update
    an object
  • LogOldValue before the first time we write to a
    given field

Source code
Basic intermediate code
Optimized intermediate code
atomic t n.value n n.next
OpenForRead(n) t n.value OpenForRead(n) n
n.next
OpenForRead(n) t n.value n n.next
17
Compiler integration avoiding upgrades
Compilers intermediate code
Optimized intermediate code
Source code
atomic c1.val
OpenForRead(c1) temp1 c1.val temp1
OpenForUpdate(c1) LogOldValue(c1.val) c1.va
l temp1
OpenForUpdate(c1) temp1 c1.val temp1
LogOldValue(c1.val) c1.val temp1
18
Compiler integration other optimizations
  • Hoist OpenFor and Log operations out from loops
  • Avoid OpenFor and Log operations on objects
    allocated inside atomic blocks (these objects
    must be thread local)
  • Move OpenFor operations from methods to their
    callers
  • Further decompose operations to allow
    logging-space checks to be combined
  • Expose OpenFor and Logs implementation to
    inlining and further optimization

19
What about version wrap-around
Open for update, see version 17
Commit, set version 18
Open for update, see version 16
Commit, set version 17

time
Commit obj1 back to version 17 oops
Open obj1 for read, see version 17
  • Solution validate read log at each GC, force GC
    at least once every versions transactions

20
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
21
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
22
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
23
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
4. Discard log entries for unreachable objects
theyre dead whether or not the block succeeds
24
Results long running tests
10.8
73
162
Direct-update STM
Run-time filtering
Compile-time optimizations
Original application (no tx)
Normalised execution time
tree
go
skip
merge-sort
xlisp
25
Conclusions
  • Atomic blocks and transactional memory provide a
    fundamental improvement to the abstractions for
    building shared-memory concurrent data structures
  • Only one part of the concurrency landscape
  • But we believe no one solution will solve all the
    challenges there
  • Our experience with compiler integration shows
    that a pure software implementation can perform
    well for short transactions and scale to vast
    transactions (many accesses many locations)
  • We still need a better understanding of realistic
    workload distributions

26
(No Transcript)
27
Backup slides
  • Backup slides beyond this point

28
Parallelism preserving design
Data held in shared mode in multiple caches
Core 1
Core 4
Core 3
Core 2
S
S
S
S
E
L1
L1
L1
L1
L2
L2
Data held in exclusive mode in a single cache
E
L3
  • Any updates must pull data into the local cache
    in exclusive mode
  • Even if an operation is read only, acquiring a
    multi-reader lock will involve fetching a line in
    exclusive mode
  • Our optimistic design lets data shared by
    multiple cores remain cached by all of them
  • Scalability at the cost of wasted work when
    optimism does not pay off

29
Compilation
MSIL atomic blocks boundaries
IR cloned atomic code
IR explicit STM operations
IR low-level STM operations
30
Some examples (xlisp garbage collector)
  • / follow the left sublist if there is one /
  • if (livecar(xl_this))
  • xl_this.n_flags (byte)LEFT
  • tmp prev
  • prev xl_this
  • xl_this prev.p
  • prev.p tmp

Open prev for update here to avoid an
inevitable upgrade
31
Some examples (othello)
public int PlayerPos (int xm, int ym, int
opponent, bool upDate) int rotate
// 8 degrees of rotation int x, y
bool endTurn bool plotPos
int turnOver 0 // inital checking !
if (this.Boardym,xm ! BInfo.Blank) return
turnOver // can't overwrite player
Calls to PlayerPos must open this for read do
this in the caller
32
Basic design open for read
00
Transactional version number
0
2. Copy meta data
vtable
1. Store obj ref
fields
Read objects log
33
Basic design open for update
00
Transactional version number
0
3. CAS to acquire object
2. Copy meta data
vtable
1. Store obj ref
fields
Updated objects log
34
Multi-use header word
vtable
fields
35
Filtering duplicate log entries
  • Per-thread table of recently logged objects /
    addresses
  • Fast O(1) logical clearing by embedding
    transaction sequence numbers in entries

00
Hash value
Tag
seq
Hash value
seq
36
Semantics of atomic blocks
  • I/O
  • Details
  • Workloads

37
Challenges opportunities
  • Moving away from locks as a programming
    abstraction lets us re-think much of the
    concurrent programming landscape

Atomic blocks for synchronization and shared
state access
Explicit threads only for explicit concurrency
CILK-style parallel loops and calls
Application software
Data-parallel libraries
Managed runtime STM implementation
Re-use STM mechanisms for TLS and optimistic (in
the technical sense) parallel loops etc
Multi-core / many-core hardware
H/W TM or TM-acceleration
Write a Comment
User Comments (0)
About PowerShow.com