Title: Compiler and Runtime Support for Efficient Software Transactional Memory
1Compiler and Runtime Supportfor
EfficientSoftware Transactional Memory
Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R.
Murphy, Bratin Saha, Tatiana Shpeisman
2Motivation
- Multi-core architectures are mainstream
- Software concurrency needed for scalability
- Concurrent programming is hard
- Difficult to reason about shared data
- Traditional mechanism Lock-based Synchronization
- Hard to use
- Must be fine-grain for scalability
- Deadlocks
- Not easily composable
- New Solution Transactional Memory (TM)
- Simpler programming model Atomicity,
Consistency, Isolation - No deadlocks
- Composability
- Optimistic concurrency
- Analogy
- GC Memory allocation TM Mutual exclusion
3Composability
- class Bank
- ConcurrentHashMap accounts
-
- void deposit(String name, int amount)
- synchronized (accounts)
- int balance accounts.get(name) // Get
the current balance - balance balance amount // Increment
it - accounts.put(name, balance) // Set the
new balance -
-
-
-
- Thread-safe but no scaling
- ConcurrentHashMap (Java 5/JSR 166) does not help
- Performance requires redesign from scratch
fine-grain locking
4Transactional solution
- class Bank
- HashMap accounts
-
- void deposit(String name, int amount)
- atomic
- int balance accounts.get(name) // Get
the current balance - balance balance amount // Increment
it - accounts.put(name, balance) // Set the
new balance -
-
-
-
- Underlying system provide
- isolation (thread safety)
- optimistic concurrency
5Transactions are Composable
Scalability on 16-way 2.2 GHz Xeon System
6Our System
- A Java Software Transactional Memory (STM) System
- Pure software implementation
- Language extensions in Java
- Integrated with JVM JIT
- Novel Features
- Rich transactional language constructs in Java
- Efficient, first class nested transactions
- Risc-like STM API
- Compiler optimizations
- Per-type word and object level conflict detection
- Complete GC support
7System Overview
Polyglot
StarJIT
8Transactional Java
- Java new language constructs
- Atomic execute block atomically
- atomic S
- Retry block until alternate path possible
- atomic retry
- Orelse compose alternate atomic blocks
- atomic S1 orelseS2 orelseSn
- Tryatomic atomic with escape hatch
- tryatomic S catch(TxnFailed e)
- When conditionally atomic region
- when (condition) S
- Builds on prior research
- Concurrent Haskell, CAML, CILK, Java
- HPCS languages Fortress, Chapel, X10
9Transactional Java ? Java
- Standard Java STM API
- while(true)
- TxnHandle th txnStart()
- try
- S
- break
- finally
- if(!txnCommit(th))
- continue
-
-
- Transactional Java
- atomic
- S
-
- STM API
- txnStartNested
- txnCommitNested
- txnAbortNested
- txnUserRetry
- ...
10JVM STM support
- On-demand cloning of methods called inside
transactions - Garbage collection support
- Enumeration of refs in read set, write set undo
log - Extra transaction record field in each object
- Supports both word object granularity
- Native method invocation throws exception inside
transaction - Some intrinsic functions allowed
- Runtime STM API
- Wrapper around McRT-STM API
- Polyglot / StarJIT automatically generates calls
to API
11Background McRT-STM
- STM for
- C / C (PPoPP 2006)
- Java (PLDI 2006)
- Writes
- strict two-phase locking
- update in place
- undo on abort
- Reads
- versioning
- validation before commit
- Granularity per type
- Object-level small objects
- Word-level large arrays
- Benefits
- Fast memory accesses (no buffering / object
wrapping) - Minimal copying (no cloning for large objects)
- Compatible with existing types libraries
12Ensuring Atomicity Novel Combination
Memory Ops ? Mode ? Reads Writes
Pessimistic Concurrency
Optimistic Concurrency
In place updates Fast commits Fast reads
Caching effects Avoids lock operations
Quantitative results in PPoPP06
13McRT-STM Example
atomic B A 5
stmStart() temp stmRd(A) stmWr(B,
temp 5) stmCommit()
- STM read write barriers before accessing memory
inside transactions - STM tracks accesses detects data conflicts
14Transaction Record
- Pointer-sized record per object / word
- Two states
- Shared (low bit is 1)
- Read-only / multiple readers
- Value is version number (odd)
- Exclusive
- Write-only / single owner
- Value is thread transaction descriptor (4-byte
aligned) - Mapping
- Object slot in object
- Field hashed index into global record table
15Transaction Record Example
- Every data item has an associated transaction
record
Extra transaction record field
class Foo int x int y
Object granularity
Object words hash into table of TxRs Hash is
f(obj.hash, offset)
class Foo int x int y
Word granularity
16Transaction Descriptor
- Descriptor per thread
- Info for version validation, lock release, undo
on abort, - Read and Write set ltTi, Nigt
- Ti transaction record
- Ni version number
- Undo log ltAi, Oi, Vi, Kigt
- Ai field / element address
- Oi containing object (or null for static)
- Vi original value
- Ki type tag (for garbage collection)
- In atomic region
- Read operation appends read set
- Write operation appends write set and undo log
- GC enumerates read/write/undo logs
17McRT-STM Example
Class Foo int x int y Foo bar, foo
T2
atomic t1 bar.x t2 bar.y
- T1 copies foo into bar
- T2 reads bar, but should not see intermediate
values
18McRT-STM Example
T2
stmStart() t1 stmRd(bar.x) t2
stmRd(bar.y) stmCommit()
- T1 copies foo into bar
- T2 reads bar, but should not see intermediate
values
19McRT-STM Example
T1
7
5
foo
bar
Abort
hdr
Commit
x 9
x 0
y 7
y 0
T2
stmStart() t1 stmRd(bar.x) t2
stmRd(bar.y) stmCommit()
T2 waits
ltbar, 7gt
Reads
ltbar, 5gt
Reads
ltfoo, 3gt
ltfoo, 3gt
Writes
ltbar, 5gt
- T2 should read 0, 0 or should read 9,7
Undo
ltbar.x, 0gt
ltbar.y, 0gt
20Early Results Overhead breakdown
- Time breakdown on single processor
- STM read validation overheads dominate
- ? Good optimization targets
21System Overview
Polyglot
StarJIT
22Leveraging the JIT
- StarJIT High-performance dynamic compiler
- Identifies transactional regions in JavaSTM code
- Differentiates top-level and nested transactions
- Inserts read/write barriers in transactional code
- Maps STM API to first class opcodes in STIR
- Good compiler representation ?
- greater optimization opportunities
23Representing Read/Write Barriers
Traditional barriers hide redundant
locking/logging
-
- stmWr(a.x, t1)
- stmWr(a.y, t2)
- if(stmRd(a.z) ! 0)
- stmWr(a.x, 0)
- stmWr(a.z, t3)
-
- atomic
- a.x t1
- a.y t2
- if(a.z 0)
- a.x 0
- a.z t3
-
-
24An STM IR for Optimization
- txnOpenForWrite(a)
- txnLogObjectInt(a.x, a)
- a.x t1
- txnOpenForWrite(a)
- txnLogObjectInt(a.y, a)
- a.y t2
- txnOpenForRead(a)
- if(a.z ! 0)
- txnOpenForWrite(a)
- txnLogObjectInt(a.x, a)
- a.x 0
- txnOpenForWrite(a)
- txnLogObjectInt(a.z, a)
- a.z t3
-
- Redundancies exposed
- atomic
- a.x t1
- a.y t2
- if(a.z 0)
- a.x 0
- a.z t3
-
-
25Optimized Code
Fewer cheaper STM operations
- atomic
- a.x t1
- a.y t2
- if(a.z 0)
- a.x 0
- a.z t3
-
-
- txnOpenForWrite(a)
- txnLogObjectInt(a.x, a)
- a.x t1
- txnLogObjectInt(a.y, a)
- a.y t2
- if(a.z ! 0)
- a.x 0
- txnLogObjectInt(a.z, a)
- a.y t3
-
26Compiler Optimizations for Transactions
- Standard optimizations
- CSE, Dead-code-elimination,
- Careful IR representation exposes opportunities
and enables optimizations with almost no
modifications - Subtle in presence of nesting
- STM-specific optimizations
- Immutable field / class detection barrier
removal (vtable/String) - Transaction-local object detection barrier
removal - Partial inlining of STM fast paths to eliminate
call overhead
27Experiments
- 16-way 2.2 GHz Xeon with 16 GB shared memory
- L1 8KB, L2 512 KB, L3 2MB, L4 64MB (per four)
- Workloads
- Hashtable, Binary tree, OO7 (OODBMS)
- Mix of gets, in-place updates, insertions, and
removals - Object-level conflict detection by default
- Word / mixed where beneficial
28Effective of Compiler Optimizations
- 1P overheads over thread-unsafe baseline
Prior STMs typically incur 2x on 1P With
compiler optimizations - lt 40 over no
concurrency control - lt 30 over synchronization
29Scalability Java HashMap Shootout
- Unsafe (java.util.HashMap)
- Thread-unsafe w/o Concurrency Control
- Synchronized
- Coarse-grain synchronization via SynchronizedMap
wrapper - Concurrent (java.util.concurrent.ConcurrentHashMap
) - Multi-year effort JSR 166 -gt Java 5
- Optimized for concurrent gets (no locking)
- For updates, divides bucket array into 16
segments (size / locking) - Atomic
- Transactional version via AtomicMap wrapper
- Atomic Prime
- Transactional version with minor hand
optimization - Tracks size per segment ala ConcurrentHashMap
- Execution
- 10,000,000 operations / 200,000 elements
- Defaults load factor, threshold, concurrency
level
30Scalability 100 Gets
Atomic wrapper is competitive with
ConcurrentHashMap Effect of compiler
optimizations scale
31Scalability 20 Gets / 80 Updates
ConcurrentHashMap thrashes on 16 segments Atomic
still scales
3220 Inserts and Removes
Atomic conflicts on entire bucket array - The
array is an object
3320 Inserts and Removes Word-Level
We still conflict on the single size field in
java.util.HashMap
3420 Inserts and Removes Atomic Prime
Atomic Prime tracks size / segment lowering
bottleneck No degradation, modest performance gain
3520 Inserts and Removes Mixed-Level
- Mixed-level preserves wins reduces overheads
- word-level for arrays
- object-level for non-arrays
36Scalability java.util.TreeMap
100 Gets
80 Gets
Results similar to HashMap
37Scalability OO7 80 Reads
Operations traversal over synthetic database
Coarse atomic is competitive with medium-grain
synchronization
38Key Takeaways
- Optimistic reads pessimistic writes is nice
sweet spot - Compiler optimizations significantly reduce STM
overhead - - 20-40 over thread-unsafe
- - 10-30 over synchronized
- Simple atomic wrappers sometimes good enough
- Minor modifications give competitive performance
to complex fine-grain synchronization - Word-level contention is crucial for large arrays
- Mixed contention provides best of both
-
39Research challenges
- Performance
- Compiler optimizations
- Hardware support
- Dealing with contention
- Semantics
- I/O communication
- Strong atomicity
- Nested parallelism
- Open transactions
- Debugging performance analysis tools
- System integration
40Conclusions
- Rich transactional language constructs in Java
- Efficient, first class nested transactions
- Risc-like STM API
- Compiler optimizations
- Per-type word and object level conflict detection
- Complete GC support
41(No Transcript)