Log-Based Transactional Memory - PowerPoint PPT Presentation

About This Presentation
Title:

Log-Based Transactional Memory

Description:

Log-Based Transactional Memory Kevin E. Moore Motivation Chip-multiprocessors/Multi-core/Many-core are here Intel has 10 projects in the works that contain four or ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 150
Provided by: KevinM162
Category:

less

Transcript and Presenter's Notes

Title: Log-Based Transactional Memory


1
Log-Based Transactional Memory
  • Kevin E. Moore

2
Motivation
  • Chip-multiprocessors/Multi-core/Many-core are
    here
  • Intel has 10 projects in the works that contain
    four or more computing cores per chip -- Paul
    Otellini, Intel CEO, Fall 05
  • We must effectively program these systems
  • But programming with locks is challenging
  • Blocking on a mutex is a surprisingly delicate
    dance -- OpenSolaris, mutex.c

3
Locks are Hard
// WITH LOCKS void move(T s, T d, Obj key)
LOCK(s) LOCK(d) tmp s.remove(key)
d.insert(key, tmp) UNLOCK(d) UNLOCK(s)
Moreover Coarse-grain locking limits
concurrency Fine-grain locking difficult
DEADLOCK!
4
Transactional Memory (TM)
void move(T s, T d, Obj key) atomic tmp
s.remove(key) d.insert(key, tmp)
  • Programmer says
  • I want this atomic
  • TM system
  • Makes it so
  • Software TM (STM) Implementations
  • Currently slower than locks
  • Always slower than hardware?
  • Hardware TM (HTM) Implementations
  • Leverage cache coherence speculation
  • Fast
  • But hardware overheads and virtualization
    challenges

5
Goals for Transactional Memory
  • Efficient Implementation
  • Make the common case fast
  • Cant justify expensive HW (yet)
  • Virtualizing TM
  • Dont limit programming model
  • Allow transactions of any size and duration

6
Implementing TM
  • Version Management
  • new values for commit
  • old values for abort
  • Must keep both
  • Conflict Detection
  • Find read-write, write-read or write-write
    conflictsamong concurrent transactions
  • Allows multiple readers OR one writer

Large state (must be precise)
Checked often (must be fast)
7
LogTM Log-Based Transactional Memory
  • Combined Hardware/Software Transactional Memory
  • Conservative hardware conflict detection
  • Software version management (with some hardware
    support)
  • Eager Version Management
  • Stores new values in place
  • Stores old values in user virtual memory (the
    transaction log)
  • Eager Conflict Detection
  • Detects transaction conflicts on each load and
    store

8
LogTM Publications
  • HPCA 2006 LogTM Log-based Transactional
    Memory
  • ASPLOS 2006 Supporting Nested Transactional
    Memory in LogTM
  • HPCA 2007 LogTM-SE Decoupling Hardware
    Transactional Memory from Caches
  • ISCA 2007 Performance Pathologies in
    Hardware Transactional Memory

9
Outline
  • Introduction
  • Background
  • LogTM
  • Implementing LogTM
  • Evaluation
  • Extending LogTM
  • Related Work
  • Conclusion

10
LOGTM
11
LogTM Log-Based Transactional Memory
  • Eager Software-Based Version Management
  • Store new values in place
  • Store old values in the transaction log
  • Undo failed transactions in software
  • Eager All-Hardware Conflict Detection
  • Isolate new values
  • Fast conflict detection for all transactions

12
LogTMs Eager Version Management
  • New values stored in place
  • Old values stored in the transaction log
  • A per-thread linear (virtual) address space (like
    the stack)
  • Filled by hardware (during transactions)
  • Read by software (on abort)

Data Block
VA
R W
00
0
0
12
40
0
0
24
80
0
0
56
Log Base
C0
7c
90
34
23
C0
Transaction Log
E8
Log Ptr
00
15
100
1
TM count
ltexamplegt
13
Eager Version Management Discussion
  • Advantages
  • No extra indirection (unlike STM)
  • Fast Commits
  • No copying
  • Common case
  • Disadvantages
  • Slow/Complex Aborts
  • Undo aborting transaction
  • Relies on Eager Conflict Detection/Prevention

14
LogTMs Eager Conflict Detection
  • Requirements for Conflict Detection in LogTM
  • Transactions Must Be Well Formed
  • Each thread must obtain read isolation on all
    memory locations read and write isolation on all
    locations written
  • Isolation Must be Strict Two-Phase
  • Any thread that acquires read or write isolation
    on a memory location in a transaction must
    maintain that isolation until the end of the
    transaction
  • Isolation Must Be Released at the End of a
    Transaction
  • Because conflicts may prevent transactions from
    making progress, a thread completing a
    transaction must release isolation when it aborts
    or commits a transaction

15
LogTMs Conflict Detection in Practice
  • LogTM detects conflicts using coherence
  • Requesting processor issues coherence request to
    memory system
  • Coherence mechanism forwards to other
    processor(s)
  • Responding processor detects conflict using local
    state informs requesting processor of conflict
  • Requesting processor resolves conflict (discussed
    later)

16
Example Implementation (LogTM-Dir)
  • P0 store
  • P0 sends get exclusive (GETX) request
  • Directory responds with data (old)
  • P0 executes store

Directory
I old
M_at_P0 old
P1
P0
(W-)
(--)
Metadata
(--)
(--)
Metadata
I none
M old
M new
I none
17
Example Implementation (LogTM-Dir)
  • In-cache transaction conflict
  • P1 sends get shared (GETS) request
  • Directory forwards to P0
  • P1 detects conflict and sends NACK

Directory
M_at_P0 old
GETS
Fwd_GETS
P1
P0
Metadata
(W-)
Metadata
(--)
I none
M new
M new
NACK
18
Conflict Resolution
  • Conflict Resolution
  • Can wait risking deadlock
  • Can abort risking livelock
  • Wait/abort transaction at requesting or
    responding proc?
  • LogTM resolves conflicts at requesting processor
  • Requesting can wait (using coherence
    nacks/retries)
  • But must abort if deadlock is possible
  • Requester Stalls Policy
  • Logically order transactions with timestamps
  • On conflict notification, wait unless already
    causing an older transaction to wait

19
LogTM API
User System/Library Low-Level
begin_transaction() commit_transaction() abort_transaction() Initialize_logtm_transactions() Register_abort_handler(void () handler) Undo_log_entry() Complete_abort_with_restart() Complete_abort_wo_restart()
20
IMPLEMENTING LOGTM
21
Version Management Trade-offs
  • Hardware vs. Software Register Checkpointing
  • Implicit vs. Explicit Logging
  • Buffered vs. Direct Logging
  • Logging Granularity
  • Logging Location

22
Compiler-Supported Software Logging
  • Software Register Checkpointing
  • Compiler generates instructions to save registers
    to transaction log
  • Software-only logging
  • Compiler generates instructions to save old
    values and to the transaction log
  • Lowest implementation cost
  • All-software version management
  • High overhead
  • Slow to start transactions (save registers)
  • Slow writes (extra load instructions)

23
In-Cache Hardware Logging
  • Hardware Register Checkpointing
  • Bulk save architectural registers (like USIII)
  • Hardware Logging
  • Hardware saves old values and virtual address to
    memory at the first level of writeback cache
  • Best Performance
  • Little or no logging-induced delay
  • Single-cycle transaction begin/commit
  • Complex implementation
  • Shadow register file
  • Buffering and forwarding logic in caches

24
In-Cache Hardware Logging
L1 D Cache
L2 Cache
VA
Log Target
ECC
Bank 0
Bank 1
VA
Log Target
ECC
Store Target
ECC
Store Target
ECC
Data
ECC
Data
ECC
CPU
Store Buffer
L1 D
L1 D
CPU
Store Buffer
CPU
25
Hardware/Software Hybrid Buffered Logging
  • Hardware Register Checkpointing
  • Bulk save architectural registers (like USIII)
  • Buffered Logging
  • Hardware saves old values and virtual address to
    a small buffer
  • Good Performance
  • Little or no logging-induced delay for small
    transactions
  • Single-cycle transaction begin/commit
  • Reduces processor-to-cache memory traffic
  • Less-complex implementation
  • Shadow register file
  • No changes to caches

26
Hardware/Software Hybrid Buffered Logging
Cache
VA
Log Target
Store Target
CPU
Store Buffer
Store Buffer
Log Buffer
Transaction Execution
Buffer Spill
Register File
Register File
27
Implementing Conflict Detection
  • Existing cache coherence mechanisms can support
    conflict detection for cached data by adding an R
    (read) W (write) bit to each cache line
  • Challenges for detecting conflicts on un-cached
    data differ for broadcast and directory systems
  • Broadcast
  • Easy to find all possible conflicts
  • Hard to filter false conflicts
  • Directory
  • Hard to find all possible conflicts
  • Easy to filter false conflicts

28
LogTM-Bcast
  • Adds a Bloom Filter to track memory blocks
    touched in a transaction, then evicted from the
    cache
  • Allows any number of addresses to be added to the
    filter
  • Detects all true conflicts
  • Allows some false conflicts

L2 Cache
RW
Data
Tag
Overflow filter
R
W
L1 D
CPU
L1 I
29
LogTM-Dir
  • Extends a standard MESI directory with sticky
    states
  • The directory continues to forward coherence
    traffic for a memory location to processors that
    touch that location in a transaction then evict
    it from the cache
  • Removes most false conflicts with a single
    overflow bit per cache

30
Sticky States
Directory State
M S I
M M
E E
S S
I Sticky-M Sticky-S I
Cache State
31
LogTM-Dir Conflict Detection w/ Cache Overflow
  • At overflow at processor P0
  • Set P0s overflow bit (1 bit per processor)
  • Allow writeback, but set directory state to
    Sticky_at_P0
  • At (potential) conflicting request by processor
    P1
  • Directory forwards P1s request to P0.
  • P0 tells P1 no conflict if overflow is reset
  • But asserts conflict if set (w/ small chance of
    false positive)
  • At transaction end (commit or abort) at processor
    P0
  • Reset P0s overflow bit
  • Clean sticky states lazily on next access

32
LogTM-Dir
  • Cache overflow
  • P0 sends put exclusive (PUTX) request
  • Directory acknowledges
  • P0 writes data back to memory

Directory
M_at_P0 old
Msticky_at_P0 new
PUTX
ACK
DATA
P1
P0
TM count
TM count
0
0
1
(W-)
R/W
(--)
R/W
M new
I none
I none
33
LogTM-Dir
  • Out-of-cache conflict
  • P1 sends GETS request
  • Directory forwards to P0
  • P0 detects a (possible) conflict
  • P0 sends NACK

Directory
M_at_P0 old
Msticky_at_P0 new
GETS
Fwd_GETS
P1
P0
TM count
TM count
0
0
1
(-W)
Signature
(--)
Signature
I (--) none
I none
M (--) old
M (-W) new
I none
NACK
34
LogTM-Dir
  • Commit
  • P0 clears TM count and
  • Signature

Directory
M_at_P0 old
Msticky_at_P0 new
P1
P0
TM count
TM count
0
0
1
0
Signature
(--)
(-W)
Signature
(--)
I (--) none
I none
M (--) old
M (-W) new
I none
35
LogTM-Dir
  • Lazy cleanup
  • P1 sends GETS request
  • Directory forwards request to P0
  • P0 detects no conflict, sends CLEAN
  • Directory sends Data to P1

Directory
S(P1) new
Msticky_at_P0 new
GETS
DATA
CLEAN
Fwd_GETS
P1
P0
TM count
TM count
0
0
0
(--)
Signature
(--)
Signature
(R-)
I (--) none
I (--) none
M (--) old
M (-W) new
I none
S new
36
EVALUATION
37
System Model
  • LogTM-Dir
  • In-Cache Hardware Logging Hybrid Buffered
    Logging

Component Settings
Processors 32, 1 GHz, single-issue, in-order, non-memory IPC1
L1 Cache 16 kB 4-way split, 1-cycle latency
L2 Cache 4 MB 4-way unified, 12-cycle latency
Memory 4 GB, 80-cycle latency
Directory Full-bit-vector sharers list, directory cache, 6-cycle latency
Interconnection Network Hierarchical switch topology, 14-cycle link latency
38
Benchmarks
Benchmark Synchronization Inputs
Shared Counter Counter lock 2500 cycle random think time
B-Tree Transactions only 9-ary tree, 5 levels deep
Barnes Locks on tree nodes 512 bodies
Cholesky Task queue locks 14
Berkeley DB (BkDB) Locks on object lists 512 operations
MP3D Locks 4096 molecules
Radiosity Task queue locks Large room
Raytrace Work list and counter locks Car
39
Read Set Size
40
Write Set Size
41
Microbenchmark Scalability
Btree 0, 10 and 20 Updates
Shared Counter LogTM vs. Locks
42
Benchmark Scalability
Barnes
BkDB
43
Benchmark Scalability
Cholesky
MP3D
44
Benchmark Scalability
Radiosity
Raytrace
45
Scalability Summary
  • Benchmarks scale as well or better using LogTM
    transactions
  • Performance is better for all benchmarks
  • LogTM improves the scalability of some
    benchmarks, but not others
  • Abort rates are low
  • Next
  • Write set prediction
  • Buffered Logging
  • Log Granularity

46
Write Set Prediction
  • Predicts if the target of each load will be
    modified in this transaction
  • Eagerly acquires write isolation
  • Reduces waits for cycles that force aborts in
    LogTM
  • Four Predictors
  • None -- Never predict
  • 1-Entry -- Remembers a single address
  • Load PC -- History based on PC of load
    instruction
  • Always -- Acquire write isolation for all loads
    and stores

47
Abort Rate with Write Set Prediction
48
Performance Impact of WSP
49
Impact of Buffer-Spill Stalls
50
Log Granularity
51
Modeling Abort Penalty
  • Abort penalty
  • Delays coherence requests
  • Delays transaction restart
  • Penalty consists of
  • Trap overhead (constant)
  • Rollback overhead (per log entry)
  • Measured performance for 3 settings
  • Ideal -- single-cycle abort
  • Medium -- 200 cycle trap, 40-cycle per undo
    record
  • Slow -- 1000 cycle trap, 200-cycle per undo
    record

52
Sensitivity to Abort Penalty (no WSP)
53
Sensitivity to Abort Penalty (with WSP)
54
EXTENDING LOGTM
55
Extending LogTM
  • Supporting Nesting in LogTM
  • Support nested VM by segmenting the transaction
    log
  • Non-transactional escape actions facilitate OS
    interactions
  • Virtualizing Conflict Detection with Signatures
    LogTM-Signature Edition (LogTM-SE) tracks read
    and write sets with signatures (like Bloom
    Filters)
  • Supports thread switching and paging by saving,
    restoring and manipulating signatures

56
RELATED WORK
57
Related Work
  • Hardware Support for Database Transactions
  • Early Transactional Memory Systems
  • Hardware TM (HTM)
  • Software TM (STM)
  • Hybrid TM
  • TM Virtualization

58
Early Transactional Memory Systems
  • Hardware Support for Database Transactions
  • 801 Storage System
  • Database-like transactions on 1-level store
    (memory and disk)
  • Transactions are durable
  • Early HTM
  • Knight
  • used transactions to parallelize code written in
    mostly functional languages
  • Herlihy and Moss
  • First HTM
  • Implementation based on a separate transaction
    cache
  • Transactions limited to cached data

59
Unbounded Transactional Memory
  • Uses Eager VM and Eager CD
  • Supports unbounded transactions in hardware
  • Complex hardware
  • Pointer and state bits for each line in memory
  • Hardware state machine for transaction rollback
  • Global virtual address space

60
Transactional Memory Coherence and Consistency
(TCC)
On-Chip Interconnect Broadcast-Based Communication
Write buffer 4 kB, Fully-Associative
L2 Cache Logically Shared
CPU
L1 D
R
L1 Cache tracks read set
61
Bulk
  • Encodes read and write sets in signatures (like
    bloom filters)
  • Like TCC, uses lazy VM and lazy CD
  • Can detect conflicts for non-cached data

62
Hybrid Transactional Memory
  • Combines HTM and STM
  • Executes small transactions in hardware, large
    transactions in software
  • Allows program execution on existing hardware
    (without HTM support)

63
Transaction Virtualization
  • Virtual Transactional Memory (VTM)
  • Rajwar and Herlihy
  • Adds a virtualization mechanism to limited HTM
    (e.g. Herlihy and Moss TM)
  • Implements CD and VM for transactions that exceed
    hardware capabilities in micro-code
  • Page-granularity Transaction Virtualization
  • PTM -- Chuang et al.
  • XTM -- Chung et al.

64
HTM Virtualization Mechanisms
Before Virtualization Before Virtualization Before Virtualization Before Virtualization After Virtualization After Virtualization After Virtualization After Virtualization
Miss Commit Abort Eviction Miss Commit Abort Eviction Paging Thread Switch
UTM - - - H H H HC H H H
VTM - - - S S SC S S S SWV
UnrestrictedTM - - - A B B B B AS AS
XTM XTM-g - - - - - - ASC SC - - SCV SCV S S SC SC SC SC AS AS
PTM-Copy PTM-Select - - - - - - SC S S H S S SC S SC S S S S S
LogTM-SE - - SC - - S SC - S S
Shaded virtualization event - handled in simple HW H complex hardware S handled in software A abort transaction C copy values W walk cache V validate read set B block other transactions
65
Conclusion
  • TM can make parallel programs faster and easier
    to write
  • LogTM provides
  • Hardware/Software Implementation
  • Simple, flexible hardware
  • Software-Based Eager Version Management
  • Makes the common case (commit) fast
  • Reduces hardware complexity
  • Hardware-Based Eager Conflict Detection
  • Allows blocking to reduce wasted work

66
Thanks to my Collaborators
  • Jayaram Bobba, Mark Hill, Derek Hower, Steve
    Jackson, Nick Kidd, Ben Liblit, Mike Marty,
    Michelle Moravan, Tom Reps, Mike Swift, Haris
    Volos, David Wood, Luke Yen

67
BACKUP SLIDES
68
Database Locks and Cache Coherence States
Database Cache State
No Lock I
S E, O, S
X M
  • Coherence states are analogous to short database
    locks
  • Most protocols have no provision to hold long
    locks

69
Herlihy and Moss, ISCA 1993
  • Transaction cache
  • Stores all data accessed by transactions
  • 2 copies of each updated cache line
  • Fully associative
  • Acts as a victim cache
  • Long Locks
  • Processors are allowed to refuse coherence
    requests

Memory
M
S
S
XCommit
XAbort
Cache
Transaction Cache
70
Transactions Limited by Cache Size and
Associativity
  • Exposes the size of the transaction cache to the
    architecture
  • Requires minimum associativity
  • Difficult for dynamic transactions

71
Transactional Lock Removal (TLR)
  • Uses Speculative Lock Elision (SLE) to elide lock
    operations in short critical sections
  • Extends SLE with lock-based concurrency control
  • Long locks processors can defer coherence
    responses during speculative transactions

72
LogTM-SE Processor Hardware
  • Segmented log, like LogTM
  • Track R / W sets withR / W signatures
  • Over-approximate R / W sets
  • Tracks physical addresses
  • Summary signature used for virtualization
  • Conflict detection by coherence protocol
  • Check signatures on every memory access for SMT

Registers
Register Checkpoint
LogFrame
TMcount
Read
LogPtr
Write
SummaryRead
SummaryWrite
SMT Thread Context
NO TM STATE
Data Caches
73
Escape Actions
  • Allow non-transactional escapes from a
    transaction
  • (e.g., system calls, I/O)
  • Similar to Zilless pause/unpause
  • Escape actions never
  • Abort
  • Stall
  • Cause other transactions to abort
  • Cause other transactions to stall
  • Commit and compensating actions
  • similar to open nests

Not recommended for the average programmer!
74
Case Study System Calls in Solaris
Category Examples
Read-Only 57 getpid, times, stat, access, mincore, sync, pread, gettimeofday
Undoable (without global side effects) 40 chdir, dup, umask, seteuid, nice, seek, mprotect
Undoable (with global side effects) 27 chmod, mkdir, link, mknod, stime
Calls not handled by escape actions 89 kill, fork, exec, umount
75
Escape Actions in LogTM
  • Loads and stores to non-transactional blocks
    behave as normal coherent accesses
  • Loads return the latest value in coherent memory
  • Loads to a transactionally modified cache block
    triggers a writeback (sticky-M state)
  • Memory responds with an uncacheable copy of the
    block
  • Stores modify coherent memory
  • Stores to transactionally modified blocks trigger
    writebacks (sticky-M)
  • Updates the value in memory (non-cacheable write
    through)

76
Thread Switching Support
  • Why?
  • Support long-running transactions
  • What?
  • Conflict Detection for descheduled transactions
  • How?
  • Summary Read / Write signatures
  • If thread t of process P is scheduled to use an
    active signature,the corresponding summary
    signature holds the union of the saved signatures
    from all descheduled threads from process P.
  • Updated using TLB-shootdown-like mechanism

77
Handling Thread Switching
OS
T2
T3
T1
W
00000000
Summary
R
00000000
01001000
W
0100000
W
0100000
W
00000000
W
01010010
R
01010010
R
01000010
R
00000000
R
P1
P4
P2
P3
78
Handling Thread Switching
00000000
W
01001000
Summary
OS
R
00000000
01010010
Deschedule
T2
T3
T1
W
00000000
Summary
R
00000000
01001000
W
0100000
W
0100000
W
00000000
W
01001000
01010010
R
01010010
R
01000010
R
00000000
R
01010010
P1
P4
P2
P3
79
Handling Thread Switching
01001000
W
Summary
OS
R
01010010
Deschedule
T2
T3
T1
W
00000000
Summary
R
00000000
01001000
W
0100000
W
0100000
W
00000000
W
01010010
R
01010010
R
01000010
R
00000000
R
P1
P4
P2
P3
80
Handling Thread Switching
01001000
W
Summary
OS
R
01010010
T1
T2
T3
W
W
00000000
00000000
Summary
Summary
R
R
00000000
00000000
00000000
W
0100000
W
0100000
W
00000000
W
00000000
R
01010010
R
01000010
R
00000000
R
P1
P4
P2
P3
81
Thread Switching Support Summary
  • Summary Read / Write signatures
  • Summarizes descheduled threads with active
    transactions
  • One OS structure per process
  • Check summary signature on every memory access
  • Updated on transaction deschedule
  • Similar to TLB shootdown

Coherence
82
Improving LogTM
83
Comparing HTMs
84
Multifacet Group Projects
  • IEEE Computer - Simulating a 2M Commercial
    Server on a 2K PC Alaa R. Alameldeen, Milo M.K.
    Martin, Carl J. Mauer, Kevin E. Moore, Min Xu,
    Daniel J. Sorin, Mark D. Hill and David A. Wood
  • ASPLOS 2000 - Timestamp Snooping An Approach for
    Extending SMPs, Milo M. K. Martin, Daniel J.
    Sorin, Anastassia Ailamaki, Alaa R. Alameldeen,
    Ross M. Dickson, Carl J. Mauer, Kevin E. Moore,
    Manoj Plakal, Mark D. Hill, and David A. Wood

85
How Do Transactional Memory Systems Differ?
  • (Data) Version Management
  • Eager record old values elsewhere update in
    place
  • Lazy update elsewhere keep old values in
    place
  • (Data) Conflict Detection
  • Eager detect conflict on every read/write
  • Lazy detect conflict at end (commit/abort)

? Fastcommit
? Less wasted work
86
Transaction Log Example
Data Block
VA
R W
  • Initial State
  • LogBase LogPointer
  • R W bits are clear

12--------------
00
0
0
--------------23
40
0
0
34--------------
C0
0
0
1000
Log Base
1000
1040
Log Ptr
1000
1080
TM mode
1
87
Transaction Log Example
Data Block
VA
R W
  • Load r1, (00) / r1 gets 12 /
  • Set R bit for block (00) (no changes to log)

12--------------
0
00
1
0
--------------23
40
0
0
34--------------
C0
0
0
1000
Log Base
1000
1040
Log Ptr
1000
1080
TM mode
1
88
Transaction Log Example
Data Block
VA
R W
  • Store r2, (c0) / r2 56 /
  • Set W bit for block (c0)
  • Store address (c0) and old data on the log
  • Increment Log Ptr to 1048
  • Update memory

12--------------
00
1
0
--------------23
40
0
0
34--------------
56--------------
C0
0
0
1
34------------
1000
c0
Log Base
1000
1040
--
1000
1048
Log Ptr
1080
TM mode
1
89
Transaction Log Example
Data Block
VA
R W
  • Load r3, (78)
  • Set R bit for block (40)
  • R3 r3 1
  • Store r3, (78)
  • Set W bit for block (40)
  • Store address (40) and old data on the log
  • Increment Log Ptr to 1090
  • Update memory

12--------------
00
1
0
--------------23
40
--------------24
0
0
1
1
56--------------
C0
0
1
1000
34------------
c0
Log Base
1000
1040
40
------------
--
1048
Log Ptr
1090
1080
--23
TM mode
1
90
Transaction Log Example
Data Block
VA
R W
  • Commit transaction
  • Clear R W for all blocks
  • Reset Log Ptr to Log Base (1000)
  • Clear TM mode

12--------------
00
0
0
--------------24
40
0
0
56--------------
C0
0
0
1000
34------------
c0
Log Base
1000
1040
40
--
------------
Log Ptr
1000
1090
1080
--23
TM mode
0
1
91
Transaction Log Example
Data Block
VA
R W
  • Abort transaction
  • Replay log entries to undo the transaction
  • Reset Log Ptr to Log Base (1000)
  • Clear R W bits for all blocks
  • Clear TM mode

12--------------
00
0
0
--------------23
40
--------------24
0
0
34--------------
C0
56--------------
0
0
1000
c0
Log Base
1000
1040
40
1090
Log Ptr
1048
1000
1080
1
TM mode
0
Back to Talk
92
Primitive Logging
  • Software defined log location (in virtual memory)
  • Based on log pointer register
  • Hardware copies old values and virtual address to
    memory at log pointer
  • Overlaps logging with stores
  • Allows logging with library calls

93
Primitive Address Matching
  • Software creates and activates multiple contexts
  • Not strictly nested
  • Many uses
  • Hand-over-hand locking
  • Pointer alias checks
  • Transactional memory

94
LogTM Interface
  • User-Level
  • Begin/commit/abort
  • System/Library
  • Initialize transactions
  • Register conflict handler
  • Low-Level
  • Undo log entry
  • Complete abort with/without restart

? currently, undo log to abort, but conflict
managers in future
95
HTM (in general)
  • Version Management
  • New values in cache
  • Old values in memory
  • Conflict Detection
  • Coherence protocol detects conflicts
  • Invalidate

Memory
Cache
Cache
M NEW
S
S
I
S
M NEW
96
Conflict Detection in Other TM Schemes
  • Cache overflow of transactional data hard for
    (Hardware) TM
  • Prohibit Herlihy/Moss TM
  • Action at Overflow LTM, VTM, TCC

97
Outline
  • Background/Motivation
  • Multicores are her
  • We need to program them
  • Need HW/SW solution
  • HW primitives
  • SW Control
  • TM
  • Clear, intuitive model
  • Likely benfits
  • But, all-hw wont work
  • LogTM
  • LogTM Family
  • Eager Version Management
  • Basic Log
  • Segmented Log
  • Eager Conflict Detection
  • Signatures
  • Coherence
  • Sticky States

98
Software Transactional Memory
  • Transactional programming w/o hardware support
  • Atomic swap of pointers to enforce atomicity
  • Adds a level of indirection

99
MSI Coherence 101 (per memory block)
  • States
  • M - one writer
  • S - many readers
  • I - no access
  • Protocol detects orders data conflicts
  • write-read
  • read-write
  • write-write
  • E.g., Writer seeks M copy must invalidate S
    copies

100
Why Hardware Transactional Memory (HTM)?
  • Speed HTMs faster than STMs
  • Leverage cache coherence
  • Mitigate extra indirection copying
  • Speed HTMs faster than some lock regimes
  • Auto-magical fine-grain
  • Dont have to get lock
  • Speed Whole reason for parallelism
  • But HTM virtualization issues
  • Cache size associativity, OS Calls
  • Paging, process switching migration
  • LogTM helps? Needs work

101
Conflict Detection in HTM
  • Most Hardware TMs
  • Do eager conflict detection (at read/writes)
  • Leveraging invalidation-based cache coherence
  • Most Hardware TMs add
  • Add per-processor transactional write (W) read
    (R) bits
  • Setting W bit requires M state setting R
    requires S or M
  • Ensures coherence protocol detects transactional
    data conflicts
  • E.g., Writer seeks M copy, seeks S copies,
    finds R bit set

102
The State of the World
  • GHz race is over
  • Frequency increase limited by heat and power
    constraints
  • Size of processor limited by communication delay,
    not transistors
  • Increasing wire delay on chip
  • All high-performance processors will be CMP
  • Software must become parallel

(in computer architecture)
103
Parallel Programming is Hard!
  • Data races cause subtle bugs
  • Locks are a mess
  • Deadlock
  • Granularity problem
  • Not composable
  • Lock-free solutions still challenging
  • We need a better way to write parallel software

104
Solution Let the hardware help
  • Provide a better interface for parallel software
  • Plenty of transistors
  • Access to run-time information
  • Transactional Memory
  • Intuitive interface -- serial execution
  • High performance -- run transactions in parallel
    when possible
  • Current cache coherence schemes already do much
    of the work

105
LogTM Overview
  • Hardware Transactional Memory promising
  • Most use lazy version management
  • Old values in place
  • New values elsewhere
  • Commits slower than aborts
  • But commits more common
  • New LogTM Log-based Transactional Memory
  • Uses eager version management (like most
    databases)
  • Old values to log in thread-private virtual
    memory
  • New values in place
  • Makes common commits fast!
  • Also allows cache overflow software abort
    handling

106
What is Transactional Memory?
void move(T s, T d, Obj key) atomic tmp
s.remove(key) d.insert(key, tmp) void
move(T s, T d, Obj key) LOCK(s) LOCK(d)
tmp s.remove(key) d.insert(key, tmp)
UNLOCK(d) UNLOCK(s)
  • Atomic and isolated execution
  • Replaces locks for many applications
  • No lock granularity problem
  • No deadlock
  • Composable synchronization

DEADLOCK!
107
Single-CMP System

Interconnect
L2
DRAM
108
Methods
  • Simulated Machine 32-way non-CMP
  • 32 SPARC V9 processors running Solaris 9 OS
  • 1 GHz in-order processors w/ ideal IPC1
    private caches
  • 16 kB 4-way split L1 cache, 1 cycle latency
  • 4 MB 4-way unified L2 cache, 12 cycle latency
  • 4 GB main memory, 80-cycle access latency
  • Full-bit vector directory w/ directory cache
  • Hierarchical switch interconnect, 14-cycle
    latency
  • Simulation Infrastructure
  • Virtutech Simics for full-system function
  • Multifacet GEMS for memory system timing (Ruby
    only)GPL Release http//www.cs.wisc.edu/gems/
  • Magic no-ops instructions for begin_transaction()e
    tc.

109
Microbenchmark Analysis
  • Shared Counter
  • All threads updatethe same counter
  • High contention
  • Small Transactions
  • LogTM v. Locks
  • EXP - Test-And-Test-And-Set Locks with
    Exponential Backoff
  • MCS - Software Queue-Based Locks

BEGIN_TRANSACTION() new_total total.count
1 private_dataid.count total.count
new_total COMMIT_TRANSACTION()
110
Shared Counter
  • LogTM (like other HTMs) does not read/write lock
  • LogTM has few aborts despite conflicts

111
SPLASH2 Benchmarks
Benchmark Input Synchronization
Barnes 512 Bodies Locks on tree nodes
Cholesky 14 Task queue locks
Ocean Contiguous partitions, 258 Barriers
Radiosity Room Task queue and buffer locks
Raytrace Small image (teapot) Work list and counter locks
Raytrace-Opt Small image (teapot) Work list and counter locks
Water N-Squared 512 Molecules barriers
112
SPLASH2 Benchmark Results
113
SPLASH2 Benchmark Results
Benchmark Transactions Stalls Aborts R-M-W
Barnes 3,067 4.89 15.3 27.9
Cholesky 22,309 4.54 2.07 82.3
Ocean 6,693 .30 .52 100
Radiosity 279,750 3.96 1.03 82.7
Raytrace-Base 48,285 24.7 1.24 99.9
Raytrace-Opt 47,884 2.04 .41 99.9
Water 35,398 0 .11 99.6
? Conflicts Less Common ?
? Aborts ?
Most trans. data read before written
114
LogTM
Virtual Memory
  • No limit on transaction size
  • New values stored in place (even in main memory)
  • All-hardware conflict detection using sticky
    states
  • Aborts processed in software

New Values
Transaction Logs
Old Values
HPCA 2006 - LogTM Log-Based Transactional
Memory, Kevin E. Moore, Jayaram Bobba, Michelle
J. Moravan, Mark D. Hill and David A. Wood
115
Nested LogTM
Transaction Log
  • Supports closed and open nesting by
  • Splitting log into frames (like a stack of
    activation records)
  • Replicating R/W bits
  • Escape actions provide non-transactional
    execution for system calls and I/O

Header
Level 0
Undo record
Undo record
Header
Undo record
Level 1
Undo record
ASPLOS 2006 - Supporting Nested Transactional
Memory in LogTM, Michelle J. Moravan, Jayaram
Bobba, Kevin E. Moore, Luke Yen, Mark D. Hill,
Ben Liblit, Michael M. Swift and David A. Wood
116
LogTM-SE Signature Edition
  • Nested LogTM has several implementation issues
  • Nesting depth limited by hardware
  • Multiple R and W bits per cache block
  • SMT makes this worse
  • Mucks with latency critical L1 cache
  • Not easy to virtualize
  • Decouple conflict detection from L1 cache array
  • Use Signatures to conservatively detect conflicts
  • E.g., Bloom filters
  • Small filters sufficient for most transactions

117
LogTM-SE and Nesting
  • Single hardware signature
  • Save current signature on nested begin
  • On conflict, abort inner transaction and reload
    signature
  • Check if conflict resolved, if not repeat
  • Closed nested commit
  • No change to hardware signature
  • Child merges with parent
  • Open nested commit
  • Restore saved signature from log

118
Virtualizing LogTM-SE
  • Cache overflow
  • Sticky-states or broadcast coherence
  • Ensures conflict detection
  • Filter (conservatively) checks for conflicts
  • Thread suspension/migration
  • Second hardware signature
  • Summarizes suspended transactions
  • OS manages on scheduling events
  • Paging
  • Pageout checks for (potential) conflict, OS saves
    state
  • Pagein updates filters with new physical address

Skip Other gtgt
119
Characterization of Java Middleware
  • ICPP 2005 - Exploring Processor Design Options
    for Java Based Middleware
  • HPCA 2003 - Memory System Behavior of Java-Based
    Middleware

Martin Karlsson, Kevin E. Moore, Erik Hagersten
and David A. Wood
120
Closed Nesting in LogTM
  • Conflict Detection
  • Nested LogTM replicates R/W bits for each level
  • Flash-Or circuit merges child and parent R/W
    bits
  • Version Management
  • Nested LogTM segments the log into frames
  • (similar to a stack of activation records)

R
W
Tag
Data
1
1
1
Data Caches
Registers
Register Checkpoint
LogFrame
LogBase
TMcount
LogPtr
Processor
121
Hardware State
  • R and W bit per cache line
  • track read and write sets
  • Overflow bit
  • Register checkpoint
  • Fast save/restore
  • Log Base and Log Pointer registers
  • TM nesting count

R W Tag Data
Overflow
Data Cache
Registers
Register Checkpoint
LogBase
TMcount
LogPtr
Processor
122
How Do Transactional Memory Systems Differ?
Lazy Version Management Eager Version Management
Lazy Conflict Detection
Eager Conflict Detection
Databases withOptimistic Conc. Ctrl.
Not done (yet)
Stanford TCC UIUC Bulk
Databases withConservative C. Ctrl.
Herlihy/Moss TM MIT LTM Intel/Brown VTM
MIT UTM
Wisconsin LogTM
123
Virtualization Challenge
  • Hardware TM Implementations
  • Finite Hardware Signatures
  • Mutiplexed Thread Switching, Virtual Memory
  • LogTM-SE
  • Version Management
  • Transaction Log
  • Virtual Memory
  • Conflict Detection
  • Signatures
  • Physical Addresses

Already Virtualized
Coming up
124
Open Nesting
Child transaction exposes state on commit
(i.e., before the parent commits)
  • Raise level of abstraction for isolation and
    abort
  • Eliminates semantically unnecessary conflicts
  • Increases concurrency
  • Higher-level isolation
  • Release memory-level isolation
  • Programmer enforce isolation at higher level
    (e.g., locks)
  • Use commit action to release isolation at parent
    commit
  • Higher-level abort
  • Childs memory updates not undone if parent
    aborts
  • Use compensating action to undo the childs
    forward action at a higher-level of abstraction
  • E.g., malloc() compensated by free()

125
Commit and Compensating Actions
  • Commit Actions
  • Execute when innermost open ancestor commits
  • Outermost transaction is considered open
  • Use to release isolation at higher-level of
    abstraction
  • Compensating Actions
  • Discard when innermost open ancestor commits
  • Execute in LIFO order when ancestor aborts
  • Execute in the state that held when its forward
    action commited Moss, TRANSACT 06

126
Open Nested Example
  • insert(int key, int value)
  • open_begin
  • leaf find_leaf(key)
  • entry insert_into_leaf(key, value)
  • // lock entry to isolate node
  • entry-gtlock 1
  • open_commit(abort_action(delete(key)),
  • commit_action(unlock(key)))
  • insert_set(set S)
  • open_begin
  • while ((key,value) next(S))
  • insert(key, value)
  • open_commit(abort_action(delete_set(S)))

? Isolate entry at higher-level of abstraction
? Delete entry if ancestor aborts
? Release high-level isolation on ancestor commit
? Replace compensating
action with higher-level action on commit
127
Condition O1
Condition O1 An open nested child transaction
never modifies a memory location that has been
modified by any ancestor.
  • If condition O1 holds programmers need not reason
    about the interaction between compensation and
    undo
  • All implementations of nesting (so far) agree on
    semantics when O1 holds

128
Timing of Compensating Actions
  • LogTM behaves correctly
  • Compensating action sees the state of the counter
    when the open transaction committed (2)
  • Decrement restores the value to what it was
    before the open nest executed (1)
  • Undo of the parent restores the value back to (0)
  • TCC doesnt
  • Counter ends up at 1

// initialize to 0 counter 0
transaction_begin() // top-level 1
counter // counter gets 1 open_begin()
// level 2 counter // counter gets 2
open_commit(abort_action(counter--))
... // Abort and run compensating action //
Expect counter to be restored to
0 ... transaction_commit() // not executed
Condition O1 No writes to blocks written by an
ancestor transaction.
129
Open Nesting in LogTM
  • Conflict Detection
  • R/W bits cleared on open commit
  • (no flash or)
  • Version Management
  • Open commit pops the most recent frame off the
    log
  • (Optionally) add commit and compensating action
    records
  • Compensating actions are run by the software
    abort handler
  • Software handler interleaves restoration of
    memory state and compensating action execution

130
Open Nested Commit
  • Discard childs log frame
  • (Optionally) append commit and compensating
    actions to log

Header
LogFrame
Undo record
LogPtr
Undo record
TM count
Header
Commit Action
2
1
Undo record
Comp Action
Undo record
131
Paging Support
  • Why?
  • Support Large Transactions.
  • What?
  • Physical Relocation of Virtual Pages
  • How?
  • Update Signatures on paging activity

132
Updating Signatures
Suppose Virtual Page (VP) 0x40000 -gt Physical
Frame(PP) 0x1000
0x1040,0x1080, 0x30c0
Signature A
At Page Out Remember 0x40000-gt0x1000
At Page In Suppose 0x40000-gt0x2000
Signature A
0x1040,0x1080, 0x2040, 0x2080,0x30c0
133
Paging Support Summary
  • Problem
  • Changing page frames
  • Need to maintain isolation on transactional
    blocks
  • Solution
  • On Page-Out
  • Save Virtual -gt Physical mapping
  • On Page-In
  • If different page frame, update signatures with
    physical address of transactional blocks in new
    page frame.

134
The State of the World
  • Chip-multiprocessors/Multi-core/Many-core are
    here
  • Intel has 10 projects in the works that contain
    four or more computing cores per chip -- Paul
    Otellini, Intel CEO, Fall 05
  • GHz race is over
  • Frequency increase limited by heat and power
    constraints
  • Size of processor limited by communication delay,
    not transistors
  • Increasing wire delay on chip
  • All high-performance processors will be CMP
  • Software must become parallel

(in computer architecture)
135
Parallel Programming is Hard!
  • Data races cause subtle bugs
  • Locks are a mess
  • Deadlock
  • Granularity problem
  • Not composable
  • Lock-free solutions still challenging
  • We need a better way to write parallel software

136
Solution Let the hardware help
  • Provide a better interface for parallel software
  • Plenty of transistors
  • Access to run-time information
  • Transactional Memory
  • Intuitive interface -- serial execution
  • High performance -- run transactions in parallel
    when possible
  • Current cache coherence schemes already do much
    of the work

137
LogTM Log-Based Transactional Memory
  • Combined Hardware/Software Implementation
  • Conflicts detected in hardware
  • Aborts processed in software
  • Policy-Free Hardware
  • Simple hardware primitives
  • Software-accessible state
  • Supports Transactions with
  • Large memory footprints
  • Thread switching
  • Unbounded nesting
  • Paging

138
Transactional Memory
  • Promising programming technique
  • begin_transaction atomic execution
    end_transaction
  • Good first step
  • Likely benefits
  • Can be integrated into current hardware and
    programming languages
  • Will not save the world

139
Nested Transactions for Software Composition
  • Modules expose interfaces, NOT implementations
  • Example
  • Insert() calls getID() from within a transaction
  • The getID() transaction is nested inside the
    insert() transaction
  • int getID()
  • // child TX
  • begin_transaction() id global_id
  • commit_transaction()
  • return id

void insert(object o) // parent TX
begin_transaction() t.insert(getID(), o)
commit_transaction()
140
Closed Nesting
Child transactions remain isolated until parent
commits
  • On Commit child transaction is merged with its
    parent
  • Flat
  • Nested transactions flattened into a single
    transaction
  • Only outermost begins/commits are meaningful
  • Any conflict aborts to outermost transaction
  • Partial rollback
  • Child transaction can be aborted independently
  • Can avoid costly re-execution of parent
    transaction
  • But child merges transaction state with parent on
    commit
  • So most conflicts with child end up affecting the
    parent

141
Thesis We need new hardware and software
  • Architects should devote resources to support
    parallelism
  • Manycore will succeed only if we find a way to
    program it (only if software is parallel)
  • Using resources to facilitate parallelism is less
    risky
  • Hardware Primitives Software Solutions
  • HW Implements difficult functions
  • Coordinated by SW

142
Segmented Transaction Log for Nesting
  • LogTMs log is a stack of frames
  • A frame contains
  • Header (including saved registers and pointer to
    parents frame)
  • Undo records (block address, old value pairs)
  • Garbage headers (headers of committed closed
    transactions)
  • Commit action records
  • Compensating action records

Header
LogFrame
Undo record
LogPtr
Undo record
TM count
Header
0
2
1
Undo record
Undo record
143
Closed Nested Commit
  • Merge childs log frame with parents
  • Mark childs header as dummy header
  • Copy pointer from childs header to LogFrame

Header
LogFrame
Undo record
LogPtr
Undo record
TM count
Header
2
1
Undo record
Undo record
144
LogTM-SE Signatures
  • Conflict-detection signatures
  • Summarize read and write sets
  • Similar to Bulk ISCA 2006
  • Aliasing is a performance issue
  • Results in false conflicts
  • Rare for current apps
  • Version-management signatures
  • Prevent redundant entries in the log
  • Aliasing is a functional issue
  • Results in incorrect abort
  • Use small full-address filter
  • Some redundant log entries

145
LogTM-SE Unbounded Nesting Support
  • Why?
  • Composability libraries
  • Software Constructs Retry, OrElse Harris,
    PPoPP 05
  • What?
  • Signatures for each nesting level
  • How?
  • One R / W signature set per SMT thread
  • Save / Restore signatures using Transaction Log

146
Nested Begin
Transaction Log
Program
Processor State
xbegin LD ST xbegin
01001000
01001000
00000000
R
01010010
01010010
Xact header
00000000
W
Undo entry
Undo entry
1
TMCount
Undo entry
Log Frame
Xact header
Log Ptr
147
Nested Begin
Transaction Log
Program
Processor State
xbegin LD ST xbegin
01001000
R
01010010
Xact header
W
Undo entry
Undo entry
2
TMCount
Undo entry
Log Frame
Xact header
01001000
01010010
Log Ptr
148
Partial Abort
Transaction Log
Program
Processor State
xbegin LD ST xbegin LD ST
ABORT!
01001001
01001000
R
01010010
01110110
Xact header
W
Undo entry
Undo entry
2
1
TMCount
Undo entry
Log Frame
Xact header
01001000
01010010
Log Ptr
Undo entry
Undo entry
149
Nested Commit
Transaction Log
Program
Processor State
xbegin LD ST xbegin LD ST
xend
01001001
01001000
R
Xact header
01110110
01010010
W
Undo entry
Undo entry
2
1
TMCount
Undo entry
Log Frame
Xact header
01001000
01010010
Log Ptr
Undo entry
Undo entry
150
Unbounded Nesting Support Summary
  • Closed nesting
  • Begin save signatures
  • Abort restore signatures
  • Commit No signature action
  • Open nesting
  • Begin save signatures
  • Abort restore signatures
  • Commit restore signatures

151
Terminology
  • Transaction A transformation of state that is
  • Atomic (all or nothing),
  • Consistent,
  • Isolated (serializable) and
  • Durable (permanent)

Commit Successful completion of a transaction
Abort Unsuccessful termination of a transaction,
requiring that all updates from the transaction
are undone
ConflictTwo transactions conflict if both access
the same object and at least one of the accesses
is an update
Write a Comment
User Comments (0)
About PowerShow.com