Compiler and Runtime Support for Efficient Software Transactional Memory - PowerPoint PPT Presentation

About This Presentation

Title:

Compiler and Runtime Support for Efficient Software Transactional Memory

Description:

Compiler and Runtime Support for Efficient Software ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 42

Provided by: Vija96

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler and Runtime Support for Efficient Software Transactional Memory

1
Compiler and Runtime Supportfor
EfficientSoftware Transactional Memory

Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R.
Murphy, Bratin Saha, Tatiana Shpeisman
2
Motivation

Multi-core architectures are mainstream
Software concurrency needed for scalability
Concurrent programming is hard
Difficult to reason about shared data
Traditional mechanism Lock-based Synchronization
Hard to use
Must be fine-grain for scalability
Deadlocks
Not easily composable
New Solution Transactional Memory (TM)
Simpler programming model Atomicity,
Consistency, Isolation
No deadlocks
Composability
Optimistic concurrency
Analogy
GC Memory allocation TM Mutual exclusion

3
Composability

class Bank
ConcurrentHashMap accounts
void deposit(String name, int amount)
synchronized (accounts)
int balance accounts.get(name) // Get
the current balance
balance balance amount // Increment
it
accounts.put(name, balance) // Set the
new balance
Thread-safe but no scaling
ConcurrentHashMap (Java 5/JSR 166) does not help
Performance requires redesign from scratch
fine-grain locking

4
Transactional solution

class Bank
HashMap accounts
void deposit(String name, int amount)
atomic
int balance accounts.get(name) // Get
the current balance
balance balance amount // Increment
it
accounts.put(name, balance) // Set the
new balance
Underlying system provide
isolation (thread safety)
optimistic concurrency

5
Transactions are Composable
Scalability on 16-way 2.2 GHz Xeon System
6
Our System

A Java Software Transactional Memory (STM) System
Pure software implementation
Language extensions in Java
Integrated with JVM JIT
Novel Features
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support

7
System Overview
Polyglot
StarJIT
8
Transactional Java

Java new language constructs
Atomic execute block atomically
atomic S
Retry block until alternate path possible
atomic retry
Orelse compose alternate atomic blocks
atomic S1 orelseS2 orelseSn
Tryatomic atomic with escape hatch
tryatomic S catch(TxnFailed e)
When conditionally atomic region
when (condition) S
Builds on prior research
Concurrent Haskell, CAML, CILK, Java
HPCS languages Fortress, Chapel, X10

9
Transactional Java ? Java

Standard Java STM API
while(true)
TxnHandle th txnStart()
try
S
break
finally
if(!txnCommit(th))
continue

Transactional Java
atomic
S
STM API
txnStartNested
txnCommitNested
txnAbortNested
txnUserRetry
...

10
JVM STM support

On-demand cloning of methods called inside
transactions
Garbage collection support
Enumeration of refs in read set, write set undo
log
Extra transaction record field in each object
Supports both word object granularity
Native method invocation throws exception inside
transaction
Some intrinsic functions allowed
Runtime STM API
Wrapper around McRT-STM API
Polyglot / StarJIT automatically generates calls
to API

11
Background McRT-STM

STM for
C / C (PPoPP 2006)
Java (PLDI 2006)
Writes
strict two-phase locking
update in place
undo on abort
Reads
versioning
validation before commit
Granularity per type
Object-level small objects
Word-level large arrays
Benefits
Fast memory accesses (no buffering / object
wrapping)
Minimal copying (no cloning for large objects)
Compatible with existing types libraries

12
Ensuring Atomicity Novel Combination
Memory Ops ? Mode ? Reads Writes
Pessimistic Concurrency
Optimistic Concurrency
In place updates Fast commits Fast reads
Caching effects Avoids lock operations
Quantitative results in PPoPP06
13
McRT-STM Example
atomic B A 5
stmStart() temp stmRd(A) stmWr(B,
temp 5) stmCommit()

STM read write barriers before accessing memory
inside transactions
STM tracks accesses detects data conflicts

14
Transaction Record

Pointer-sized record per object / word
Two states
Shared (low bit is 1)
Read-only / multiple readers
Value is version number (odd)
Exclusive
Write-only / single owner
Value is thread transaction descriptor (4-byte
aligned)
Mapping
Object slot in object
Field hashed index into global record table

15
Transaction Record Example

Every data item has an associated transaction
record

Extra transaction record field
class Foo int x int y
Object granularity
Object words hash into table of TxRs Hash is
f(obj.hash, offset)
class Foo int x int y
Word granularity
16
Transaction Descriptor

Descriptor per thread
Info for version validation, lock release, undo
on abort,
Read and Write set ltTi, Nigt
Ti transaction record
Ni version number
Undo log ltAi, Oi, Vi, Kigt
Ai field / element address
Oi containing object (or null for static)
Vi original value
Ki type tag (for garbage collection)
In atomic region
Read operation appends read set
Write operation appends write set and undo log
GC enumerates read/write/undo logs

17
McRT-STM Example
Class Foo int x int y Foo bar, foo
T2
atomic t1 bar.x t2 bar.y

T1 copies foo into bar
T2 reads bar, but should not see intermediate
values

18
McRT-STM Example
T2
stmStart() t1 stmRd(bar.x) t2
stmRd(bar.y) stmCommit()

T1 copies foo into bar
T2 reads bar, but should not see intermediate
values

19
McRT-STM Example
T1
7
5
foo
bar
Abort
hdr
Commit
x 9
x 0
y 7
y 0
T2
stmStart() t1 stmRd(bar.x) t2
stmRd(bar.y) stmCommit()
T2 waits
ltbar, 7gt
Reads
ltbar, 5gt
Reads
ltfoo, 3gt
ltfoo, 3gt
Writes
ltbar, 5gt

T2 should read 0, 0 or should read 9,7

Undo
ltbar.x, 0gt
ltbar.y, 0gt
20
Early Results Overhead breakdown

Time breakdown on single processor
STM read validation overheads dominate
? Good optimization targets

21
System Overview
Polyglot
StarJIT
22
Leveraging the JIT

StarJIT High-performance dynamic compiler
Identifies transactional regions in JavaSTM code
Differentiates top-level and nested transactions
Inserts read/write barriers in transactional code
Maps STM API to first class opcodes in STIR
Good compiler representation ?
greater optimization opportunities

23
Representing Read/Write Barriers
Traditional barriers hide redundant
locking/logging

stmWr(a.x, t1)
stmWr(a.y, t2)
if(stmRd(a.z) ! 0)
stmWr(a.x, 0)
stmWr(a.z, t3)

atomic
a.x t1
a.y t2
if(a.z 0)
a.x 0
a.z t3

24
An STM IR for Optimization

txnOpenForWrite(a)
txnLogObjectInt(a.x, a)
a.x t1
txnOpenForWrite(a)
txnLogObjectInt(a.y, a)
a.y t2
txnOpenForRead(a)
if(a.z ! 0)
txnOpenForWrite(a)
txnLogObjectInt(a.x, a)
a.x 0
txnOpenForWrite(a)
txnLogObjectInt(a.z, a)
a.z t3

Redundancies exposed
atomic
a.x t1
a.y t2
if(a.z 0)
a.x 0
a.z t3

25
Optimized Code
Fewer cheaper STM operations

atomic
a.x t1
a.y t2
if(a.z 0)
a.x 0
a.z t3

txnOpenForWrite(a)
txnLogObjectInt(a.x, a)
a.x t1
txnLogObjectInt(a.y, a)
a.y t2
if(a.z ! 0)
a.x 0
txnLogObjectInt(a.z, a)
a.y t3

26
Compiler Optimizations for Transactions

Standard optimizations
CSE, Dead-code-elimination,
Careful IR representation exposes opportunities
and enables optimizations with almost no
modifications
Subtle in presence of nesting
STM-specific optimizations
Immutable field / class detection barrier
removal (vtable/String)
Transaction-local object detection barrier
removal
Partial inlining of STM fast paths to eliminate
call overhead

27
Experiments

16-way 2.2 GHz Xeon with 16 GB shared memory
L1 8KB, L2 512 KB, L3 2MB, L4 64MB (per four)
Workloads
Hashtable, Binary tree, OO7 (OODBMS)
Mix of gets, in-place updates, insertions, and
removals
Object-level conflict detection by default
Word / mixed where beneficial

28
Effective of Compiler Optimizations

1P overheads over thread-unsafe baseline

Prior STMs typically incur 2x on 1P With
compiler optimizations - lt 40 over no
concurrency control - lt 30 over synchronization
29
Scalability Java HashMap Shootout

Unsafe (java.util.HashMap)
Thread-unsafe w/o Concurrency Control
Synchronized
Coarse-grain synchronization via SynchronizedMap
wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap
)
Multi-year effort JSR 166 -gt Java 5
Optimized for concurrent gets (no locking)
For updates, divides bucket array into 16
segments (size / locking)
Atomic
Transactional version via AtomicMap wrapper
Atomic Prime
Transactional version with minor hand
optimization
Tracks size per segment ala ConcurrentHashMap
Execution
10,000,000 operations / 200,000 elements
Defaults load factor, threshold, concurrency
level

30
Scalability 100 Gets
Atomic wrapper is competitive with
ConcurrentHashMap Effect of compiler
optimizations scale
31
Scalability 20 Gets / 80 Updates
ConcurrentHashMap thrashes on 16 segments Atomic
still scales
32
20 Inserts and Removes
Atomic conflicts on entire bucket array - The
array is an object
33
20 Inserts and Removes Word-Level
We still conflict on the single size field in
java.util.HashMap
34
20 Inserts and Removes Atomic Prime
Atomic Prime tracks size / segment lowering
bottleneck No degradation, modest performance gain
35
20 Inserts and Removes Mixed-Level

Mixed-level preserves wins reduces overheads
word-level for arrays
object-level for non-arrays

36
Scalability java.util.TreeMap
100 Gets
80 Gets
Results similar to HashMap
37
Scalability OO7 80 Reads
Operations traversal over synthetic database
Coarse atomic is competitive with medium-grain
synchronization
38
Key Takeaways

Optimistic reads pessimistic writes is nice
sweet spot
Compiler optimizations significantly reduce STM
overhead
- 20-40 over thread-unsafe
- 10-30 over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance
to complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both

39
Research challenges