Compiler and Runtime Support for Efficient Software Transactional Memory - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler and Runtime Support for Efficient Software Transactional Memory

Description:

Compiler and Runtime Support for Efficient Software ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 42
Provided by: Vija96
Category:

less

Transcript and Presenter's Notes

Title: Compiler and Runtime Support for Efficient Software Transactional Memory


1
Compiler and Runtime Supportfor
EfficientSoftware Transactional Memory
  • Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R.
Murphy, Bratin Saha, Tatiana Shpeisman
2
Motivation
  • Multi-core architectures are mainstream
  • Software concurrency needed for scalability
  • Concurrent programming is hard
  • Difficult to reason about shared data
  • Traditional mechanism Lock-based Synchronization
  • Hard to use
  • Must be fine-grain for scalability
  • Deadlocks
  • Not easily composable
  • New Solution Transactional Memory (TM)
  • Simpler programming model Atomicity,
    Consistency, Isolation
  • No deadlocks
  • Composability
  • Optimistic concurrency
  • Analogy
  • GC Memory allocation TM Mutual exclusion

3
Composability
  • class Bank
  • ConcurrentHashMap accounts
  • void deposit(String name, int amount)
  • synchronized (accounts)
  • int balance accounts.get(name) // Get
    the current balance
  • balance balance amount // Increment
    it
  • accounts.put(name, balance) // Set the
    new balance
  • Thread-safe but no scaling
  • ConcurrentHashMap (Java 5/JSR 166) does not help
  • Performance requires redesign from scratch
    fine-grain locking

4
Transactional solution
  • class Bank
  • HashMap accounts
  • void deposit(String name, int amount)
  • atomic
  • int balance accounts.get(name) // Get
    the current balance
  • balance balance amount // Increment
    it
  • accounts.put(name, balance) // Set the
    new balance
  • Underlying system provide
  • isolation (thread safety)
  • optimistic concurrency

5
Transactions are Composable
Scalability on 16-way 2.2 GHz Xeon System
6
Our System
  • A Java Software Transactional Memory (STM) System
  • Pure software implementation
  • Language extensions in Java
  • Integrated with JVM JIT
  • Novel Features
  • Rich transactional language constructs in Java
  • Efficient, first class nested transactions
  • Risc-like STM API
  • Compiler optimizations
  • Per-type word and object level conflict detection
  • Complete GC support

7
System Overview
Polyglot
StarJIT
8
Transactional Java
  • Java new language constructs
  • Atomic execute block atomically
  • atomic S
  • Retry block until alternate path possible
  • atomic retry
  • Orelse compose alternate atomic blocks
  • atomic S1 orelseS2 orelseSn
  • Tryatomic atomic with escape hatch
  • tryatomic S catch(TxnFailed e)
  • When conditionally atomic region
  • when (condition) S
  • Builds on prior research
  • Concurrent Haskell, CAML, CILK, Java
  • HPCS languages Fortress, Chapel, X10

9
Transactional Java ? Java
  • Standard Java STM API
  • while(true)
  • TxnHandle th txnStart()
  • try
  • S
  • break
  • finally
  • if(!txnCommit(th))
  • continue
  • Transactional Java
  • atomic
  • S
  • STM API
  • txnStartNested
  • txnCommitNested
  • txnAbortNested
  • txnUserRetry
  • ...

10
JVM STM support
  • On-demand cloning of methods called inside
    transactions
  • Garbage collection support
  • Enumeration of refs in read set, write set undo
    log
  • Extra transaction record field in each object
  • Supports both word object granularity
  • Native method invocation throws exception inside
    transaction
  • Some intrinsic functions allowed
  • Runtime STM API
  • Wrapper around McRT-STM API
  • Polyglot / StarJIT automatically generates calls
    to API

11
Background McRT-STM
  • STM for
  • C / C (PPoPP 2006)
  • Java (PLDI 2006)
  • Writes
  • strict two-phase locking
  • update in place
  • undo on abort
  • Reads
  • versioning
  • validation before commit
  • Granularity per type
  • Object-level small objects
  • Word-level large arrays
  • Benefits
  • Fast memory accesses (no buffering / object
    wrapping)
  • Minimal copying (no cloning for large objects)
  • Compatible with existing types libraries

12
Ensuring Atomicity Novel Combination
Memory Ops ? Mode ? Reads Writes
Pessimistic Concurrency
Optimistic Concurrency
In place updates Fast commits Fast reads
Caching effects Avoids lock operations
Quantitative results in PPoPP06
13
McRT-STM Example
atomic B A 5
stmStart() temp stmRd(A) stmWr(B,
temp 5) stmCommit()
  • STM read write barriers before accessing memory
    inside transactions
  • STM tracks accesses detects data conflicts

14
Transaction Record
  • Pointer-sized record per object / word
  • Two states
  • Shared (low bit is 1)
  • Read-only / multiple readers
  • Value is version number (odd)
  • Exclusive
  • Write-only / single owner
  • Value is thread transaction descriptor (4-byte
    aligned)
  • Mapping
  • Object slot in object
  • Field hashed index into global record table

15
Transaction Record Example
  • Every data item has an associated transaction
    record

Extra transaction record field
class Foo int x int y
Object granularity
Object words hash into table of TxRs Hash is
f(obj.hash, offset)
class Foo int x int y
Word granularity
16
Transaction Descriptor
  • Descriptor per thread
  • Info for version validation, lock release, undo
    on abort,
  • Read and Write set ltTi, Nigt
  • Ti transaction record
  • Ni version number
  • Undo log ltAi, Oi, Vi, Kigt
  • Ai field / element address
  • Oi containing object (or null for static)
  • Vi original value
  • Ki type tag (for garbage collection)
  • In atomic region
  • Read operation appends read set
  • Write operation appends write set and undo log
  • GC enumerates read/write/undo logs

17
McRT-STM Example
Class Foo int x int y Foo bar, foo
T2
atomic t1 bar.x t2 bar.y
  • T1 copies foo into bar
  • T2 reads bar, but should not see intermediate
    values

18
McRT-STM Example
T2
stmStart() t1 stmRd(bar.x) t2
stmRd(bar.y) stmCommit()
  • T1 copies foo into bar
  • T2 reads bar, but should not see intermediate
    values

19
McRT-STM Example
T1
7
5
foo
bar
Abort
hdr
Commit
x 9
x 0
y 7
y 0
T2
stmStart() t1 stmRd(bar.x) t2
stmRd(bar.y) stmCommit()
T2 waits
ltbar, 7gt
Reads
ltbar, 5gt
Reads
ltfoo, 3gt
ltfoo, 3gt
Writes
ltbar, 5gt
  • T2 should read 0, 0 or should read 9,7

Undo
ltbar.x, 0gt
ltbar.y, 0gt
20
Early Results Overhead breakdown
  • Time breakdown on single processor
  • STM read validation overheads dominate
  • ? Good optimization targets

21
System Overview
Polyglot
StarJIT
22
Leveraging the JIT
  • StarJIT High-performance dynamic compiler
  • Identifies transactional regions in JavaSTM code
  • Differentiates top-level and nested transactions
  • Inserts read/write barriers in transactional code
  • Maps STM API to first class opcodes in STIR
  • Good compiler representation ?
  • greater optimization opportunities

23
Representing Read/Write Barriers
Traditional barriers hide redundant
locking/logging
  • stmWr(a.x, t1)
  • stmWr(a.y, t2)
  • if(stmRd(a.z) ! 0)
  • stmWr(a.x, 0)
  • stmWr(a.z, t3)
  • atomic
  • a.x t1
  • a.y t2
  • if(a.z 0)
  • a.x 0
  • a.z t3

24
An STM IR for Optimization
  • txnOpenForWrite(a)
  • txnLogObjectInt(a.x, a)
  • a.x t1
  • txnOpenForWrite(a)
  • txnLogObjectInt(a.y, a)
  • a.y t2
  • txnOpenForRead(a)
  • if(a.z ! 0)
  • txnOpenForWrite(a)
  • txnLogObjectInt(a.x, a)
  • a.x 0
  • txnOpenForWrite(a)
  • txnLogObjectInt(a.z, a)
  • a.z t3
  • Redundancies exposed
  • atomic
  • a.x t1
  • a.y t2
  • if(a.z 0)
  • a.x 0
  • a.z t3

25
Optimized Code
Fewer cheaper STM operations
  • atomic
  • a.x t1
  • a.y t2
  • if(a.z 0)
  • a.x 0
  • a.z t3
  • txnOpenForWrite(a)
  • txnLogObjectInt(a.x, a)
  • a.x t1
  • txnLogObjectInt(a.y, a)
  • a.y t2
  • if(a.z ! 0)
  • a.x 0
  • txnLogObjectInt(a.z, a)
  • a.y t3

26
Compiler Optimizations for Transactions
  • Standard optimizations
  • CSE, Dead-code-elimination,
  • Careful IR representation exposes opportunities
    and enables optimizations with almost no
    modifications
  • Subtle in presence of nesting
  • STM-specific optimizations
  • Immutable field / class detection barrier
    removal (vtable/String)
  • Transaction-local object detection barrier
    removal
  • Partial inlining of STM fast paths to eliminate
    call overhead

27
Experiments
  • 16-way 2.2 GHz Xeon with 16 GB shared memory
  • L1 8KB, L2 512 KB, L3 2MB, L4 64MB (per four)
  • Workloads
  • Hashtable, Binary tree, OO7 (OODBMS)
  • Mix of gets, in-place updates, insertions, and
    removals
  • Object-level conflict detection by default
  • Word / mixed where beneficial

28
Effective of Compiler Optimizations
  • 1P overheads over thread-unsafe baseline

Prior STMs typically incur 2x on 1P With
compiler optimizations - lt 40 over no
concurrency control - lt 30 over synchronization
29
Scalability Java HashMap Shootout
  • Unsafe (java.util.HashMap)
  • Thread-unsafe w/o Concurrency Control
  • Synchronized
  • Coarse-grain synchronization via SynchronizedMap
    wrapper
  • Concurrent (java.util.concurrent.ConcurrentHashMap
    )
  • Multi-year effort JSR 166 -gt Java 5
  • Optimized for concurrent gets (no locking)
  • For updates, divides bucket array into 16
    segments (size / locking)
  • Atomic
  • Transactional version via AtomicMap wrapper
  • Atomic Prime
  • Transactional version with minor hand
    optimization
  • Tracks size per segment ala ConcurrentHashMap
  • Execution
  • 10,000,000 operations / 200,000 elements
  • Defaults load factor, threshold, concurrency
    level

30
Scalability 100 Gets
Atomic wrapper is competitive with
ConcurrentHashMap Effect of compiler
optimizations scale
31
Scalability 20 Gets / 80 Updates
ConcurrentHashMap thrashes on 16 segments Atomic
still scales
32
20 Inserts and Removes
Atomic conflicts on entire bucket array - The
array is an object
33
20 Inserts and Removes Word-Level
We still conflict on the single size field in
java.util.HashMap
34
20 Inserts and Removes Atomic Prime
Atomic Prime tracks size / segment lowering
bottleneck No degradation, modest performance gain
35
20 Inserts and Removes Mixed-Level
  • Mixed-level preserves wins reduces overheads
  • word-level for arrays
  • object-level for non-arrays

36
Scalability java.util.TreeMap
100 Gets
80 Gets
Results similar to HashMap
37
Scalability OO7 80 Reads
Operations traversal over synthetic database
Coarse atomic is competitive with medium-grain
synchronization
38
Key Takeaways
  • Optimistic reads pessimistic writes is nice
    sweet spot
  • Compiler optimizations significantly reduce STM
    overhead
  • - 20-40 over thread-unsafe
  • - 10-30 over synchronized
  • Simple atomic wrappers sometimes good enough
  • Minor modifications give competitive performance
    to complex fine-grain synchronization
  • Word-level contention is crucial for large arrays
  • Mixed contention provides best of both

39
Research challenges
  • Performance
  • Compiler optimizations
  • Hardware support
  • Dealing with contention
  • Semantics
  • I/O communication
  • Strong atomicity
  • Nested parallelism
  • Open transactions
  • Debugging performance analysis tools
  • System integration

40
Conclusions
  • Rich transactional language constructs in Java
  • Efficient, first class nested transactions
  • Risc-like STM API
  • Compiler optimizations
  • Per-type word and object level conflict detection
  • Complete GC support

41
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com