Hardware Transactional Memory - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Hardware Transactional Memory

Description:

Snoopy Cache ... We describe here how to extend 'snoopy' cache protocol for a shared bus to ... on both cache coherence protocols: snoopy and directory cache. ... – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 75
Provided by: mick3
Category:

less

Transcript and Presenter's Notes

Title: Hardware Transactional Memory


1
Hardware Transactional Memory
  • Royi Maimon
  • Merav Havuv
  • 27/5/2007

2
References
  • M. Herlihy and J. Moss,  Transactional Memory
    Architectural Support for Lock-Free Data
    Structures 
  • C. Scott Ananian, Krste Asanovic, Bradley  C.
    Kuszmaul, Charles  E. Leiserson, Sean  Lie
    Unbounded Transactional  Memory.
  • Hammond, Wong, Chen, Carlstrom, Davis (Jun
    2004).Transactional Memory Coherence and
    Consistency

3
Today
  • What are transactions?
  • What is Hardware Transactional Memory?
  • Various implementations of HTM

4
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

5
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

6
Lock-free
  • A shared data structure is lock-free if its
    operations do not require mutual exclusion.
  • If one process is interrupted in the middle of an
    operation, other processes will not be prevented
    from operating on that object.

7
Lock-free (cont)
  • Lock-free data structures avoid common problems
    associated with conventional locking techniques
    in highly concurrent systems
  • Priority inversion
  • Convoying occurs when a process holding a lock is
    descheduled, and then, other processes capable of
    running may be unable to progress.
  • Deadlock

8
Priority inversion
  • Priority inversion occurs when a lower-priority
    process is preempted while holding a lock needed
    by higher-priority processes.

9
Deadlock
  • Deadlock two or more processes are waiting
    indefinitely for an event that can be caused by
    only one of waiting processes.
  • Let S and Q be two resources
  • P0 P1
  • Lock(S) Lock(Q)
  • Lock(Q) Lock(S)

10
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

11
What is a transaction?
  • A transaction is a sequence of memory loads and
    stores executed by a single process that either
    commits or aborts
  • If a transaction commits, all the loads and
    stores appear to have executed atomically
  • If a transaction aborts, none of its stores take
    effect
  • Transaction operations aren't visible until they
    commit or abort

12
Transactions properties
  • A transaction satisfies the following properties
  • Serializability
  • Atomicity
  • Simplified version of traditional ACID database
    (Atomicity, Consistency, Isolation, and
    Durability)

13
Transactional Memory
  • A new multiprocessor architecture
  • The goal Implementing a lock-free
    synchronization
  • efficient
  • easy to use
  • comparing to conventional techniques based on
    mutual exclusion
  • Implemented by straightforward extensions to
    multiprocessor cache-coherence protocols.

14
An Example
  • Locks
  • if (iltj)
  • a i b j
  • else
  • a j b i
  • Lock(La)
  • Lock(Lb)
  • Flowi Flowi X
  • Flowj Flowj X
  • Unlock(Lb)
  • Unlock(La)
  • Transactional Memory
  • StartTransaction
  • Flowi Flowi X
  • Flowj Flowj X
  • EndTransaction

15
Transactional Memory
  • Transactions execute in commit order

Time
0xbeef
0xbeef
16
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

17
Cache-Coherence Protocol
  • A protocol for managing the caches of a
    multiprocessor system
  • No data is lost
  • No overwritten before the data is transferred
    from a cache to the target memory.
  • When multiprocessing, each processor may have
    its own memory cache that is separate from the
    shared memory

18
The Problem (Cache-Coherence)
  • Solving the problem in either of two ways
  • directory-based
  • snooping system

19
Snoopy Cache
  • All caches watches the activity (snoop) on a
    global bus to determine if they have a copy of
    the block of data that is requested on the bus.

20
Directory-based
  • The data being shared is placed in a common
    directory that maintains the coherence between
    caches.
  • The directory acts as a filter through which the
    processor must ask permission to load an entry
    from the primary memory to its cache.
  • When an entry is changed the directory either
    updates or invalidates the other caches with that
    entry.

21
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

22
How it Works?
  • The following primitive instructions for
    accessing memory are provided
  • Load-transactional (LT) reads value of a shared
    memory location into a private register.
  • Load-transactional-exclusive (LTX) Like LT, but
    hinting that the location is likely to be
    modified.
  • Store-transactional (ST) tentatively writes a
    value from a private register to a shared memory
    location.
  • Commit (COMMIT)
  • Abort (ABORT)
  • Validate (VALIDATE) tests the current transaction
    status.

23
Some definitions
  • Read set the set of locations read by LT by a
    transaction
  • Write set the set of locations accessed by LTX
    or ST by a transaction
  • Data set (footprints) the union of the read and
    write sets.
  • A set of values in memory is inconsistent if it
    couldnt have been produced by any serial
    execution of transactions

24
Intended Use
  • Instead of acquiring a lock, executing the
    critical section, and releasing the lock, a
    process would
  • use LT or LTX to read from a set of locations
  • use VALIDATE to check that the values read are
    consistent,
  • use ST to modify a set of locations
  • use COMMIT to make the changes permanent.
  • If either the VALIDATE or the COMMIT fails, the
    process returns to Step (1).

25
Implementation
  • Transactional memory is implemented by modifying
    standard multiprocessor cache coherence protocols
  • We describe here how to extend snoopy cache
    protocol for a shared bus to support
    transactional memory
  • Our transactions are short-lived activities with
    relatively small Data set.

26
The basic idea
  • Any protocol capable of detecting accessibility
    conflicts can also detect transaction conflict at
    no extra cost
  • Once a transaction conflict is detected, it can
    be resolved in a variety of ways

27
Implementation
  • Each processor maintains two caches
  • Regular cache for non-transactional operations,
  • Transactional cache for transactional operations.
    It holds all the tentative writes, without
    propagating them to other processors or to main
    memory (until commit)
  • Why using two caches?

28
Cache line states
  • Each cache line (regular or transactional) has
    one of the following states
  • The transactional cache expends these states

29
Cleanup
  • When the transactional cache needs space for a
    new entry, it searches for
  • EMPTY entry
  • If not found - a NORMAL entry
  • finally for an XCOMMIT entry.

30
Processor actions
  • Each processor maintains two flags
  • The transaction active (TACTIVE) flag indicates
    whether a transaction is in progress
  • The transaction status (TSTATUS) flag indicates
    whether that transaction is active (True) or
    aborted (False)
  • Non-transactional operations behave exactly as in
    original cache-coherence protocol

31
Example LT operation
Not Found?
Found?
Not Found?
Cache miss
Found?
Successful read
Unsuccessful read
32
Snoopy cache actions
  • Both the regular cache and the transactional
    cache snoop on the bus.
  • A cache ignores any bus cycles for lines not in
    that cache.
  • The transactional caches behavior
  • If TSTATUSFalse, or if the operation isnt
    transactional, the cache acts just like the
    regular cache, but ignores entries with state
    other than NORMAL
  • On LT of other cpu, if the state is VALID, the
    cache returns the value, and for all other
    transactional operations it returns BUSY

33
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

34
Simulation
  • Well see an example code for the
    producer/consumer algorithm using transactional
    memory architecture.
  • The simulation runs on both cache coherence
    protocols snoopy and directory cache.
  • The simulation use 32 processors
  • The simulation finishes when 216 operations have
    completed.

35
Part Of Producer/Consumer Code
unsigned queue_deq(queue q) unsigned head,
tail, result unsigned backoff BACKOFF_MIN
unsigned wait while (1) result
QUEUE_EMPTY tail LTX(q-gtenqs)
head LTX(q-gtdeqs) if (head !
tail) / queue not empty? /
result LT(q-gtitemshead
QUEUE_SIZE) / advance counter
/ ST(q-gtdeqs, head 1)
if (COMMIT()) break / abort gt
backoff / wait random() (01 ltlt
backoff) while (wait--) if
(backoff lt BACKOFF_MAX) backoff
return result
typedef struct Word deqs // Holds
the heads index Word enqs // Holds
the tails index Word itemsQUEUE_SIZE
queue
36
The results
37

So Far
  • In both HTM and STM the transactions shouldnt
    touch many memory locations
  • There is a (small) bound on the transactions
    footprint
  • In addition, there is a duration limit.

38
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

39
Unbounded Transactional Memory (UTM)
  • UTM new thesis supports transactions of
    arbitrary footprint and duration.
  • The UTM architecture allows
  • transactions as large as virtual memory
  • transactions of unlimited duration
  • transactions which can migrate between processors
  • UTM supports a semantics for nested transactions
  • In contrast to previous HTM implementation UTM
    is optimized for transactions below a certain
    size but still operate correctly for larger
    transactions

40
The Goal of UTM
  • The primary goal
  • make concurrent programming easier.
  • Reducing implementation overhead.
  • Why do we want unbounded TM?

Neither programmers nor compilers can easily cope
with an imposed hard limit on transaction size.
41
UTM architecture
  • The transaction log data structure that
    maintains bookkeeping information for a
    transaction
  • Why is it needed?
  • Enables transactions to survive time slice
    interrupts
  • Enables process migration from one processor to
    another.

42
Two new instructions
  • All the programmer must specify is where a
    transaction begins and ends
  • XBEGIN pc
  • Begin a new transaction. Entry point to an abort
    handler specified by pc.
  • If transaction must fail, roll back processor and
    memory state to what it was when XBEGIN was
    executed, and jump to pc.
  • We can think of an XBEGIN instruction as a
    conditional branch to the abort handler.
  • XEND
  • End the current transaction. If XEND completes,
    the transaction is committed and appeared atomic.
  • Nested transactions are subsumed into outer
    transaction.

43
Transaction Semantics
  • XBEGIN L1
  • ADD R1, R1, R1
  • ST 1000, R1
  • XEND
  • L2 XBEGIN L2
  • ADD R1, R1, R1
  • ST 2000, R1
  • XEND
  • Two transactions
  • A has an abort handler at L1
  • B has an abort handler at L2
  • Here, very simplistic retry.

A
B
44
Register renaming
  • A name dependence occurs when two instructions
    Inst1 and Inst2 use the same register (or memory
    location), but there is no data transmitted
    between Inst1 and Inst2.
  • If the register is renamed so that Inst1 and
    Inst2 do not conflict, the two instructions can
    execute simultaneously or be reordered.
  • This technique that dynamically eliminates name
    dependences in registers, is called register
    renaming.
  • Register renaming can be done statically ( by
    compiler) or dynamically ( by hardware).

45
Rolling back processor state
  • After XBEGIN instruction we take a snapshot of
    the rename table
  • To keep track of busy registers, we maintain an S
    (saved) bit for each physical register to
    indicate which registers are part of the active
    transaction and it includes the S bits with every
    renaming-table snapshot
  • An active transactions abort handler address,
    nesting depth, and snapshot are part of its
    transactional state.

46
Memory State
  • UTM represents the set of active transactions
    with a single data structure held in system
    memory, the x-state (short for transaction
    state).

47
Xstate Implementation
  • The x-state contains a transaction log for each
    active transaction in the system.
  • Each log consists of
  • A commit record maintains the transactions
    status
  • pending
  • committed
  • aborted
  • A vector of log entries corresponds to a memory
    block that the transaction has read or written
    to. The entry provides
  • pointer to the block
  • The blocks old value (for rollback)
  • A pointer to the commit record
  • Pointers that form a linked list of all entries
    in all transaction logs that refer to the same
    block. (Reader List)

48
Xstate Implementation (Cont)
  • The final part of the x-state consists of
  • log pointer
  • read-write bit
  • for each memory block

49
X-state Data Structure
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
50
More on x-state
  • When a processor references a block that is
    already part of a pending transaction, the system
    checks the RW bit and log pointer to determine
    the correct action
  • use the old value
  • use the new value
  • abort the transaction

51
Commit action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
52
Cleanup action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
53
Abort action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
54
Transactions Conflict
  • A conflict When two or more pending transactions
    have accessed a block and at least one of the
    accesses is for writing.
  • Performing a transaction load
  • check that the log pointer refers to an entry in
    the current transaction log or the RW bit is R.
  • Performing a transaction store
  • check that the log pointer references no other
    transaction
  • In case of a conflict, some of the conflicting
    transactions are aborted.
  • Which transaction should be aborted?

55
Caching
  • For small transaction that fits in cache, UTM,
    like earlier proposed HTM systems, uses cache
    coherence protocol to identify conflicts
  • For transactions too big to fit in cache, the
    x-state for the transaction overflows into the
    ordinary memory hierarchy
  • Most log entries don't need to be created
  • Only create transaction log when transaction is
    run out of physical memory.

56
UTMs Goal
  • support transactions that run for an indefinite
    length of time
  • migrate from one processor to another
  • footprints bigger than the physical memory.
  • The main technique we propose is to treat the
    x-state as a systemwide data structure that uses
    global virtual addresses

57
Benefits and Limits of UTM
  • Limits
  • Complicated implementation
  • Benefits
  • Unlimited footprint
  • Unlimited duration
  • Migration possible
  • Good performance in the common case (small
    transactions)

58
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

59
LTM Visible, Large, Frequent, Scalable
  • Large Transactional Memory
  • Not truly unbounded, but simple and cheap
  • Minimal architectural changes, high performance
  • Small modifications to cache and processor core
  • No changes to main memory, cache coherence
    protocol
  • Can be pin-compatible with conventional
    processors

60
LTMs Restrictions
  • Limiting a transactions footprint to (nearly)
    the size of physical memory.
  • Duration must be less than a time slice
  • Transactions cannot migrate between processors.
  • With these restrictions, we can implement LTM by
    modifying only the cache and processor core

61
LTM vs UTM
  • Like UTM, LTM maintains data about pending
    transactions in the cache and detects conflicts
    using the cache coherency protocol
  • Unlike UTM, LTM does not treat the transaction as
    a data structure. Instead, it binds a transaction
    to a particular cache.
  • Transactional data overflows from the cache into
    a hash table in main memory
  • LTM and UTM have similar semantics XBEGIN and
    XEND instructions are the same
  • In LTM, the cache plays a major part

62
Addition to Cache
  • LTM adds a bit (T) per cache line to indicate
    that the data has been accessed as part of a
    pending transaction.
  • An additional bit (O) is added per cache set to
    indicate that it has overflowed.

63
Cache overflow mechanism
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
64
Cache overflow mechanism
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
65
Cache overflow recording reads
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
66
Cache overflow recording writes
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
67
Cache overflow spilling
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
68
Cache overflow miss handling
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
69
LTM - Summary
  • Transactions as large as physical memory
  • Scalable overflow and commit
  • Easy to implement!
  • Low overhead

70
Outline
  • Lock-Free
  • Hardware Transactional Memory (HTM)
  • Transactions
  • Cache coherence protocol
  • General Implementation
  • Simulation
  • UTM
  • LTM
  • TCC (briefly)
  • Conclusions

71
Transactional Memory Coherence and Consistency
(TCC)
  • Hammond, Wong, Chen, Carlstrom, Davis (Jun
    2004).Transactional Memory Coherence and
    Consistency
  • All transactions, all the time! Code partitioned
    into transactions by programmer or tools
  • Possibly at run-time, for legacy code!
  • All writes buffered in caches, CPUs arbitrate
    system-wide for which one gets to commit
  • Updates broadcast to all CPUs. CPUs detect
    conflicts of their transactions and abort

72
TCC Implementation
Loads stores
CPU Core
storesonly
Local cache hierarchy
r m ? V tag data
Write buffer
Commit control
snooping
commits
Broadcast bus or network
73
Conclusions
  • Unbounded, scalable, and efficient Transactional
    Memory systems can be built.
  • Support large, frequent, and concurrent
    transactions
  • Allow programmers to (finally!) use our parallel
    systems!
  • Three architectures
  • LTM easy to realize, almost unbounded
  • UTM truly unbounded
  • TCC high performance

74
THE END
  • Royi Maimon
  • Merav Havuv
Write a Comment
User Comments (0)
About PowerShow.com