Hardware Transactional Memory for GPU Architectures* - PowerPoint PPT Presentation

About This Presentation

Title:

Hardware Transactional Memory for GPU Architectures*

Description:

Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 55

Provided by: Aamod

Category:

more less

Transcript and Presenter's Notes

Title: Hardware Transactional Memory for GPU Architectures*

1
Hardware Transactional Memory for GPU
Architectures

Wilson W. L. Fung
Inderpeet Singh
Andrew Brownsword
Tor M. Aamodt
University of British Columbia
In Proc. 2011 ACM/IEEE Intl Symp.
Microarchitecture (MICRO-44)

2
Motivation

Lifetime of GPU Application Development

?
Wilson Fung, Inderpeet Singh, Andrew Brownsword,
Tor Aamodt
3
Talk Outline

What we mean by GPU in this work.
Data Synchronization on GPUs.
What is Transactional Memory (TM)?
TM is compatible with OpenCL.
but is TM compatible with GPU hardware?
KILO TM A Hardware TM for GPUs.
Results

4
What is a GPU (in this work)?

GPU is NVIDIA/AMD-like, Compute Accelerator
SIMD HW Aggressive Memory Subsystem gt High
Compute Throughput and Efficiency
Non-Graphics API OpenCL, DirectCompute, CUDA
Programming Model Hierarchy of scalar threads
Today Limited Communication Synchronization

Kernel
Blocks
Blocks
Work Group / Thread Blocks
Global Memory
Shared (Local) Memory
Barrier
5
Baseline GPU Architecture
Memory Partition
Memory Partition
Memory Partition
Atomic Op. Unit
Interconnection Network
Last-Level Cache Bank
Off-Chip DRAM Channel
6
Stack-Based SIMD Reconvergence (SIMT)
(Levinthal SIGGRAPH84, Fung MICRO07)
A/1111
B/1111
C/1001
D/0110
E/1111
G/1111
17
7
Data Synchronizations on GPUs

Motivation
Solve wider range of problems on GPU
Data Race ? Data Synchronization
Current Solution Atomic read-modify-write
(32-bit/64-bit).
Best Soln?
Why Transactional Memory?
E.g. N-Body with 5M bodies (traditional sync, not
TM)CUDA SDK O(n2) 1640 s (barrier)Barnes
Hut O(nLogn) 5.2 s (atomics, harder to get
right)
Easier to Write/Debug Efficient Algorithms
Practical efficiency. Want efficiency of GPU
with reasonable (not superhuman) effort and time.

8
Data Synchronizations on GPUs

Deadlock-free code with fine-grained locks and
10,000 hardware scheduled threads is hard

Other general problems with lock based
synchronization
Implicit relationship between locks and objects
being protected
Code is not composable

9
Data Synchronization Problems Specific to GPUs

Interaction between locks and SIMT control flow
can cause deadlocks

A done 0 B while(!done) C
if(atomicCAS(lock,0,1)1) D // Critical
Section E lock 0 F done 1 G H
10
Transactional Memory

Program specifies atomic code blocks called
transactions Herlihy93

TM Version atomic Xc XaXb
Lock Version Lock(Xa) Lock(Xb) Lock(Xc)
Xc XaXb Unlock(Xc) Unlock(Xb) Un
lock(Xa)
11
Transactional Memory
Programmers View
TX1
TX2
Time
Time
OR
TX2
TX1
12
Transactional Memory

Each transaction has 3 phases
Execution
Track all memory accesses (Read-Set and
Write-Set)
Validation
Detect any conflicting accesses between
transactions
Resolve conflict if needed (abort/stall)
Commit
Update global memory

13
Transactional Memory on OpenCL

A natural extension to OpenCL Programming Model
Program can launch many more threads than the
hardware can execute concurrently
GPU-TM? Current threads running transactions do
not need to wait for future unscheduled threads

GPU HW
14
Are TM and GPUs Incompatible?

The problem with GPUs (from TM perspective)
1000s of concurrent threads
Inter-thread spatial locality common
No cache coherence
No private cache for each thread (Buffering?)
Tx Abort ? Control flow divergence

15
Hardware TM for GPUs ChallengeConflict Detection
Private Data Cache
Signature
TX1
Scalable Coherence
No coherence on GPUs? Each scalar thread needs
own cache?
TX2
TX3
TX4
16
Hardware TM for GPUs ChallengeTransaction
Rollback
2MB Total On-Chip Storage
17
Hardware TM for GPUs ChallengeAccess
Granularity and Write Buffer
GPU Core (SM)
L1 Data Cache
CPU Core
L1 Data Cache
TX
Problem 384 lines / 1536 threads lt 1 line per
thread!
18
Hardware TM on GPUs ChallengeSIMT Hardware

On GPUs, scalar threads in a warp/wavefront
execute in lockstep

A Warp with 8 Scalar Threads
... TxBegin LD r2,B ADD r2,r2,2 ST
r2,A TxCommit ...
Reconvergence?
19
Goal

We take it as a given that most programmers
trying lock based programming on a GPU will give
up before they manage to get their application
working.
Hence, our goal was to find the most efficient
approach to implement TM on GPU.

20
KILO TM

Supports 1000s of concurrent transactions
Transaction-aware SIMT stack
No cache coherence protocol dependency
Word-level conflict detection
Captures 59 of FG Lock Performance
128X Faster than Serialized Tx Exec.

21
KILO TM Design Highlights

Value-Based Conflict Detection
Self-Validation Abort Simple Communication
No Cache Coherence Dependence
Speculative Validation
Increase Commit Parallelism

22
High Level GPU Architecture KILO TM
Implementation Overview
23
KILO TM SIMT Core Changes

SW Register Checkpoint
Observation Most overwritten registers not used
Compiler analysis can identify what to checkpoint
Transaction Abort
Do-While Loop
Extend SIMT Stack with special entries to
trackaborted transactionsin each warp

TxBegin LD r2,B ADD r2,r2,2 ST r2,A TxCommit
24
Transaction-Aware SIMT Stack
25
KILO TM Value-Based Conflict Detection
Global Memory
A1
A1
TX1 atomicBA1
Private Memory
B0
B2
TxBegin LD r1,A ADD r1,r1,1 ST r1,B TxCommit
A1
B2
B2
TxBegin LD r2,B ADD r2,r2,2 ST r2,A TxCommit
B0

Self-Validation Abort
Only detects existence of conflict (not identity)
gt No Tx to Tx Msg Simple Communication

A2
26
Parallel Validation?
Data Race!?!
Global Memory
A1
A1
TX1 atomicBA1
Private Memory
B0
B0
OR
A1
B2
B2
B0
A2
A2
27
Serialize Validation?
TX1
TX2
Time
V C
Stall

Benefit 1 No Data Race
Benefit 2 No Live Lock (generic lazy TM prob.)
Drawback Serializes Non-Conflicting Transactions
(collateral damage)

28
Identifying Non-conflicting Tx Step 1 Leverage
Parallelism
Global Memory Partition
Commit Unit
Global Memory Partition
TX1
Commit Unit
TX2
Global Memory Partition
Commit Unit
29
Solution Speculative Validation

Key Idea Split Validation into two parts
Part 1 Check recently committed transactions
Part 2 Check concurrently committing transactions

30
KILO TM Speculative Validation

Memory subsystem is deeply pipelined and highly
parallel

Read-Log Write-Log
Commit Unit
TX1
TX3
Validation Queue
Global Memory Partition
TX2
R(C),W(D)
Log Transfer Spec. Validation
TX1
TX2
Hazard Detection
C
R(A),W(B)
Validation Wait
A
D
TX3
Finalize Outcome
R(D),W(E)
Commit
31
KILO TM Speculative Validation
TX1
TX2
TX3
R(C),W(D)
R(A),W(B)
R(D),W(E)
Commit Unit
Validation Queue
Global Memory Partition
TX3
Log Transfer Spec. Validation
TX2
Hazard Detection
TX1
C
Validation Wait
A
D
STALL
Finalize Outcome
Commit
32
Log Storage

Transaction logs are stored at the private memory
of each thread
Located in DRAM, cached in L1 and L2 caches

Wavefront
Read-Log Ptr
Write-Log Ptr
33
Log Transfer

Entries heading to same memory partition can be
grouped into a larger packet

Read-Log Ptr
Write-Log Ptr
34
Distributed Commit / HW Org.
35
ABA Problem?

Classic Example Linked List Based Stack
Thread 0 pop()

while (true) t top Next
t-gtNext // thread 2 pop A, pop B, push A
if (atomicCAS(top, t, next) t) break //
succeeds!
36
ABA Problem?

atomicCAS protects only a single word
Only part of the data structure
Value-based conflict detection protects all
relevant parts of the data structure

while (true) t top Next t-gtNext
if (atomicCAS(top, t, next) t) break //
succeeds!
37
Evaluation Methodology

GPGPU-Sim 3.0 (BSD license)
Detailed IPC Correlation of 0.93 vs GT 200
KILO TM (Timing-Driven Memory Accesses)
GPU TM Applications
Hash Table (HT-H, HT-L)
Bank Account (ATM)
Cloth Physics (CL)
Barnes Hut (BH)
CudaCuts (CC)
Data Mining (AP)

GPGPU-Sim 3.0.x running SASS (decuda)

0.976 correlation on subset of CUDA SDK that
decuda correctly Disassembles Note Rest of
data uses PTX instead of SASS (0.93 correlation)
(We believe GPGPU-Sim is reasonable proxy.)
39
Performance (vs. Serializing Tx)
40
Absolute Performance (IPC)
IPC

TM on GPU performs well for applications with low
contention.
Poorly Memory divergence, low parallelism, high
conflict rate
(tackle through alg. design/tuning?)
CPU vs GPU?
CC FG-Lock version 400X faster than its CPU
version
BH FG-Lock version 2.5X faster than its CPU
version

41
Performance (Exec. Time)
Captures 59 of FG Lock Performance 128X Faster
than Serialized Tx Exec.
42
KILO TM Scaling
43
Abort Commit Ratio
Increasing number of TXs gt increase probability
of conflict Two possible solutions (future
work) Solution 1 Application performance
tuning (easier with TM vs. FG Lock) Solution 2
Transaction schedule
44
Thread Cycle Breakdown

Status of a thread at each cycle
Categories
TC In a warp stalled by concurrency control
TO In a warp committing its transactions
TW Have passed commit, and waiting for other
threads in the warp to pass
TA Executing an eventually aborted transaction
TU Executing an eventually committed transaction
(Useful work)
AT Acquiring a lock or doing an Atomic Operation
BA Waiting at a Barrier
NL Doing non-transactional (Normal) work

45
Thread Cycle Breakdown
KL
KL
KL
FGL
FGL
FGL
KL-UC
KL-UC
IDEAL
IDEAL
KL-UC
IDEAL
KL-UC
IDEAL
HT-H HT-L ATM CL
BH CC AP
46
Core Cycle Breakdown

Action performed by a core at each cycle
Categories
EXEC Issuing a warp for execution
STALL Stalled by a downstream warp
SCRB All warps blocked by the scoreboard, due to
data hazards, concurrency control, pending
commits (or any combination thereof)
IDLE None of the warps are ready in the
instruction buffer.

47
Core Cycle Breakdown
KL
KL
FGL
FGL
FGL
KL-UC
KL-UC
IDEAL
IDEAL
KL-UC
IDEAL
KL-UC
IDEAL
48
Read-Write Buffer Usage
49
In-Flight Buffers
50
Implementation Complexity