Techniques to Improve Scalability of Transactional Memory systems - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Techniques to Improve Scalability of Transactional Memory systems

Description:

Techniques to Improve Scalability of Transactional Memory systems. Salil Pant. Advisor: Dr G. Byrd. Introduction. Shared memory parallel programs need synchronization ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 42
Provided by: smp7
Category:

less

Transcript and Presenter's Notes

Title: Techniques to Improve Scalability of Transactional Memory systems


1
Techniques to Improve Scalability of
Transactional Memory systems
  • Salil Pant
  • Advisor Dr G. Byrd

2
Introduction
  • Shared memory parallel programs need
    synchronization
  • Lock-based synchronization using atomic
    read-modify-write primitives
  • Problems with locks
  • Solution- Transactional memory
  • Speculative and optimistic
  • Relieves the programmer

3
  • Issues with TM
  • Scalability
  • Contributions
  • Analysis of TM for scalability
  • Value-Predictor
  • Results
  • Proposed work

4
Conventional Synchronization
  • Conservative, blocking, lock-based
  • Atomic read-modify-write primitives
  • Provide atomicity only for a single address.
  • Sync variables exposed to the programmer
  • Programmer orchestrates synchronization
  • Granularity (No. of shared R/W
    variables covered)


    (No. of lock variables)
  • High (gtgt 1) coarse , low (1) fine
  • Fine granularity gt More concurrency gt better
    perf.
  • as long as program runs correctly

5
Problems
  • Software
  • Mapping from locks to shared conf. variables
  • Programmers opt for coarse grain locks
  • Deadlocks, livelocks, starvation other issues
    managed by programmer
  • Blocking sync not good for fault tolerance
  • Hardware
  • Basic test and set not scalable
  • Software queue-based locks too heavy for common
    case

Fine granularity lot of locks hard to
program/debug
6
Transactional Memory
  • Proposed by Herlihy
  • Transactional abstraction -
  • critical sections become transactions
  • ACI properties
  • Optimistic speculative execution of critical
    sections
  • Conflicting accesses detected and execution
    rolled back
  • read-write, write-write, write-read
  • Can be implemented by hardware or software

Lock (X) Update (A) Unlock (X) Lock (Y)
Update (B) Unlock (Y)
Begin_transaction Update(A) End_transaction
Begin_transaction Update(B) End_transactio
n
Lock (X) Lock (Y) Update (A,B) Unlock(Y) U
nlock(X)
Begin_transaction Update(A,B) End_transactio
n
7
Hardware-supported TM
  • Special instructions to indicate transactional
    accesses
  • Initialize buffer to save transactional data
  • Checkpoint at the beginning
  • Buffer to log versions of transactional data
  • Special write buffer
  • In-memory log
  • Conflict detection/resolution mechanism
  • via coherence protocol
  • timestamps local logical clock cpu_id
  • Mechanism to rollback state
  • Hardware to checkpoint processor state
  • ROB-based

8
Hardware TM
  • Additions to the chip ( TLR proposal )

9
Advantages
  • Transfers burden to the designer
  • deadlocks, livelocks, starvation freedom etc.
  • Ease of programming
  • More transactions does not mean hard programs
  • Performs better than locks in the common case
  • More concurrency, less overhead
  • Concurrency now depends on size of transaction
  • Non-blocking advantages
  • Can be implemented in software or by hardware.
  • We mainly focus on hardware

10
Issues with TM
  • TM is a optimistic speculative sync scheme
  • Works best under mild/medium contention
  • How does HTM deal with ?
  • Large transaction sizes
  • System calls or I/O inside transactions
  • Processes/threads getting de-scheduled
  • Thread migration

11
Scalability Issue
  • Scalability of TM with increasing number of
    processors
  • Optimistic execution beneficial at 32 processors
    ?
  • Greater overhead with conflicts/aborts compared
    to lock-based sync
  • Memory processor rollback
  • Network overhead
  • Serialized commit/abort needed to maintain
    atomicity.
  • Transaction sizes predicted to increase
  • Support I/O, system calls within transactions
  • Integrate TM with higher programming language
    models

12
Measuring scalability
  • What are we looking for ?
  • Application vs. system scalability
  • TM overhead conflicts
  • Measure speedup for up to 32 processor systems
  • Tourmaline simulator for TM
  • Simple TM system with a timing model for memory
    accesses.
  • Provides version management conflict detection.
  • Timestamps for conflict resolution
  • Conflicts always abort younger transactions
  • No network model
  • Added simple backoff
  • 2 Splash Benchmarks were transactified
  • Cholesky Raytrace

13
Queue Micro-benchmark
  • Queue Micro-benchmark for TM
  • 210 insert/delete operations
  • Important structure used in splash benchmarks

14
Micro-benchmark Results
15
Benchmark results
16
Observations
  • Conflicts increase with increasing CPUs
  • TM overhead can lead to slowdown
  • Situation gets worse with increased transaction
    sizes
  • Effect on speedup might be worse with a network
    model in place.
  • How to make TM resilient to conflicts ?

17
Value Predictor Idea
  • TM performance degrades with conflicts
  • Certain data structures hard to parallelize
  • No performance difference with TM

18
  • Serializing data/operations are predictable
  • Pointers head, tail etc
  • Sizes constant increment/decrements
  • HTM already includes
  • speculative hardware for buffering
  • checkpoint and rollback capability
  • Still reap benefits of TM
  • Allows running transactions in parallel with
    predicted values
  • Such queues used mainly for task/memory
    management purposes
  • Cholesky, Raytrace, Radiosity

19
Implementation
  • Stride-based, memory-level
  • Base LogTM model
  • In-memory logging of old values during Xn stores
  • Eager conflict detection, timestamps for conflict
    resolution
  • Uses a Nack-based coherence protocol
  • Deadlock detection mechanism
  • Commits are easy
  • Aborts need memory processor rollback.
  • Nacks used to trigger value predictor

20
Implementation
  • Addresses identified as predictable by the
    programmer/compiler.
  • Value predictor initializes entry with the
    address
  • VP entry, 1 per VP address
  • Ordered list of real values, 2 in our design
  • Ordered list of predicted values
  • Ordered list of predicted cpus
  • Fortunately, max 3 or 4 VP entries needed so
    far.

21
  • Need an extra buffer to hold predicted data.
  • Only with LogTM
  • Cannot log predicted load value in memory
  • Predictions checked at commit time
  • Execution does not advance beyond commit until
    verified
  • Needs changes in the coherence protocol
  • More deadlock scenarios
  • Simplifications
  • Address, VP entries
  • Timing of VP
  • Always generate exclusive requests

22
Implementation
  • Log TM model

Directory M 1
Nack
Data
GetX
GetX
FGetX
CPU 1
CPU 3
CPU 2
Nack
23
Implementation
  • Generating predictions

Directory M-1 S-2
Value Predictor
FGetX
Nack
Nack
GetX
Retry
FGetX
Pred
CPU 1
CPU 3
CPU 2
Nack
Nack
24
State after predictions
Directory M-1 S-2-3
Value Predictor
Retry
Nack
Retry
FGetX
FGetX
Nack
CPU 1
CPU 3
CPU 2
Nack
Nack
25
Successful predictions
Value Predictor
Directory M-1 S-2-3
M-2 S-3
Retry
Nack
Retry
FGetX
FGetX
Unblock
Result
CPU 1
CPU 3
CPU 2
Nack
26
  • Failed Predictions

Value Predictor
Directory M-1 S-2-3
NP S-3
Retry
Retry
FGetX
Result
Unblock
Result
CPU 1
CPU 3
CPU 2
27
Evaluation
  • Microbenchmarks Loop-based, 210 Xn
  • Shared-counter
  • Simple counter, increment by fixed value
  • Queue-based
  • Insert only
  • Random inserts and deletes
  • Simulation platform
  • SIMICS in-order processors (1,2,4,8,16)
  • GEMS (RUBY) memory system
  • Highly optimized LogTM model for experiments
  • Cholesky Raytrace benchmarks
  • Both contain a linked-list for memory management.
  • Cholesky could not be completely transactified

28
Results
29
Splash Benchmarks
  • Adding directives to support value prediction

Cholesky
Raytrace
30
Splash benchmarks
Table with TM parameters for 16 processors
31
Observations
  • Value predictor can improve speedup without much
    overhead
  • Performance gains with increasing number of
    processors
  • Aborts increases as number of processors
    increases
  • Is TM scalable ?
  • More benchmarks needed

32
Extending the value predictor
  • Improving the simulation model
  • Exploring other types of value predictors
  • Expanding application scope
  • Controlling aggressiveness
  • Adding confidence mechanisms.
  • Reducing hardware complexity of the value
    predictor entry.

33
Proposed Ideas
  • Value predictor not general enough!
  • Need to reduce conflicts
  • Better backoff schemes
  • Centralized transaction scheduler
  • Intelligent backoff times
  • Expose transactions to the directory
  • begin_Xn and end_Xn messages to the directory ?
  • Count number of memory accesses in transactions
  • Generate backoff time based on count

34
Proposed Ideas contd.
  • Why is this different from any other scalability
    research ?
  • Recent work by Bobba shows HTM designs impact
    performance by almost 80.
  • Different data/conflict management schemes needed
    for different applications?
  • STM can help, but performance suffers
  • Can we have both lazy and eager version
    management?
  • Is HTM on large systems a good idea ?

35
Proposed Ideas contd.
  • Effectiveness of Nacks/Stalls decreases as number
    of processors increases
  • Need stalling mechanism without the overhead of
    deadlocks
  • Stall transactions after restart
  • Use timestamps to avoid starvation
  • Need to understand hardware requirements
  • Verilog model
  • Proposals need hardware evaluation
  • Value predictor
  • Speculative buffer

36
Experiments/Analysis
  • Need better benchmarks
  • Synchronization intensive
  • SPECJBB , STAMP, Java Grande benchmarks
  • Larger transactions
  • Test up to 64 processors
  • Simulations with SIMICS GEMS

37
  • Contribution
  • Identify scalability bottleneck with TM
  • Value predictor for certain applications
  • Proposal
  • Extending value predictor work
  • Improved backoff schemes
  • Transaction queuing/stalling
  • Hardware evaluation
  • END
  • Questions ?

38
Value Predictor
Directory M-1 S-2-3
M-2 S-3
Retry
Nack
Retry
FGetX
FGetX
Unblock
Result
CPU 1
CPU 3
CPU 2
Nack
39
(No Transcript)
40
  • Overall, TCCs FPGA implementation
  • adds 14 overhead in the control logic, and 29
    in on chip memory as compared to a
    non-speculative incarnation of our cache.

41
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com