Optimistic Intra-Transaction Parallelism using Thread Level Speculation

About This Presentation

Title:

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Description:

Title: PowerPoint Presentation Last modified by: colohan Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 191

Provided by: eecgToro6

Learn more at: https://www.eecg.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimistic Intra-Transaction Parallelism using Thread Level Speculation

1
Optimistic Intra-Transaction Parallelism using
Thread Level Speculation

Chris Colohan1, Anastassia Ailamaki1,
J. Gregory Steffan2 and Todd C. Mowry1,3
1Carnegie Mellon University
2University of Toronto
3Intel Research Pittsburgh

2
Chip Multiprocessors are Here!
AMD Opteron
IBM Power 5
Intel Yonah

2 cores now, soon will have 4, 8, 16, or 32
Multiple threads per core
How do we best use them?

3
Multi-Core Enhances Throughput
Database Server
Users
Cores can run concurrent transactions and improve
throughput
4
Using Multiple Cores
Database Server
Users
Can multiple cores improve transaction latency?
5
Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line

Intra-query parallelism
Used for long-running queries (decision support)
Does not work for short queries
Short queries dominate in commercial workloads

6
Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line

Intra-transaction parallelism
Each thread spans multiple queries
Hard to add to existing systems!
Need to change interface, add latches and locks,
worry about correctness of parallel execution

7
Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line

Intra-transaction parallelism
Breaks transaction into threads
Hard to add to existing systems!
Need to change interface, add latches and locks,
worry about correctness of parallel execution

Thread Level Speculation (TLS) makes
parallelization easier.
8
Thread Level Speculation (TLS)
p
p
q
q
p
q
Sequential
Parallel
9
Thread Level Speculation (TLS)

Use epochs
Detect violations
Restart to recover
Buffer state
Oldest epoch
Never restarts
No buffering
Worst case
Sequential
Best case
Fully parallel

Epoch 1
Epoch 2
p
Violation!
p
p
R2
q
q
p
q
Sequential
Parallel
Data dependences limit performance.
10
A Coordinated Effort
Choose epoch boundaries
TransactionProgrammer
DBMS Programmer
Remove performance bottlenecks
Hardware Developer
Add TLS support to architecture
11
So whats new?

Intra-transaction parallelism
Without changing the transactions
With minor changes to the DBMS
Without having to worry about locking
Without introducing concurrency bugs
With good performance
Halve transaction latency on four cores

12
Related Work

Optimistic Concurrency Control (Kung82)
Sagas (MolinaSalem87)
Transaction chopping (Shasha95)

13
Outline

Introduction
Related work
Dividing transactions into epochs
Removing bottlenecks in the DBMS
Results
Conclusions

14
Case Study New Order (TPC-C)
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock WHERE i_iditem UPDATE stock
WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line

Only dependence is the quantity field
Very unlikely to occur (1/100,000)

15
Case Study New Order (TPC-C)
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock WHERE i_iditem UPDATE stock
WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order TLS_foreach(item) GET quantity
FROM stock WHERE i_iditem UPDATE
stock WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
16
Outline

Introduction
Related work
Dividing transactions into epochs
Removing bottlenecks in the DBMS
Results
Conclusions

17
Dependences in DBMS
18
Dependences in DBMS

Dependences serialize execution!
Example statistics gathering
pages_pinned
TLS maintains serial ordering of increments
To remove, use per-CPU counters
Performance tuning
Profile execution
Remove bottleneck dependence
Repeat

19
Buffer Pool Management
CPU
get_page(5)
put_page(5)
Buffer Pool
ref 1
ref 0
20
Buffer Pool Management
CPU
get_page(5)
get_page(5)
put_page(5)
put_page(5)
get_page(5)
Buffer Pool
put_page(5)
TLS ensures first epoch gets page first. Who
cares?
ref 0

TLS maintains original load/store order
Sometimes this is not needed

21
Buffer Pool Management

Escape speculation
Invoke operation
Store undo function
Resume speculation

CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
put_page(5)
put_page(5)
get_page(5)
Buffer Pool
put_page(5)
ref 0
Isolated undoing get_page will not affect
other transactions Undoable have an
operation (put_page) which returns the
system to its initial state
22
Buffer Pool Management
CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
get_page(5)
Buffer Pool
Not undoable!
ref 0
23
Buffer Pool Management
CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
Buffer Pool
ref 0

Delay put_page until end of epoch
Avoid dependence

24
Removing Bottleneck Dependences

We introduce three techniques
Delay operations until non-speculative
Mutex and lock acquire and release
Buffer pool, memory, and cursor release
Log sequence number assignment
Escape speculation
Buffer pool, memory, and cursor allocation
Traditional parallelization
Memory allocation, cursor pool, error checks,
false sharing

25
Outline

Introduction
Related work
Dividing transactions into epochs
Removing bottlenecks in the DBMS
Results
Conclusions

26
Experimental Setup

Detailed simulation
Superscalar, out-of-order, 128 entry reorder
buffer
Memory hierarchy modeled in detail
TPC-C transactions on BerkeleyDB
In-core database
Single user
Single warehouse
Measure interval of 100 transactions
Measuring latency not throughput

27
Optimizing the DBMS New Order
1.25
26 improvement
1
0.75
Time (normalized)
Other CPUs not helping
0.5
Cant optimize much more
Cache misses increase
0.25
0
Sequential
28
Optimizing the DBMS New Order
1.25
1
0.75
Time (normalized)
0.5
0.25
0
This process took me 30 days and lt1200 lines of
code.
Sequential
29
Other TPC-C Transactions
1
0.75
Idle CPU
Failed
Time (normalized)
Cache Miss
0.5
Busy
0.25
0
New Order
Delivery
Stock Level
Payment
Order Status
30
Conclusions

TLS makes intra-transaction parallelism practical
Reasonable changes to transaction, DBMS, and
hardware
Halve transaction latency

31
Needed backup slides (not done yet)

2 proc. Results
Shared caches may change how you want to extract
parallelism!
Just have lots of transactions no sharing
TLS may have more sharing

32
Any questions?

For more information, see
www.colohan.com

33
Backup Slides Follow
34
LATCHES
35
Latches

Mutual exclusion between transactions
Cause violations between epochs
Read-test-write cycle ? RAW
Not needed between epochs
TLS already provides mutual exclusion!

36
Latches Aggressive Acquire
Acquire latch_cnt work latch_cnt--
latch_cnt work (enqueue release)
latch_cnt work (enqueue release)
Commit work latch_cnt--
Commit work latch_cnt-- Release
37
Latches Lazy Acquire
Acquire work Release
(enqueue acquire) work (enqueue release)
(enqueue acquire) work (enqueue release)
Acquire Commit work Release
Acquire Commit work Release
38
HARDWARE
39
TLS in Database Systems

Large epochs
More dependences
Must tolerate
More state
Bigger buffers

Non-Database TLS
TLS in Database Systems
40
Feedback Loop
for() do_work()
41
Violations Feedback
p
Violation!
p
p
R2
q
q
p
q
Sequential
Parallel
42
Eliminating Violations
0x0FD8? 0xFD20 0x0FC0? 0xFC18
43
Tolerating Violations Sub-epochs
Violation!
q
Sub-epochs
44
Sub-epochs

Started periodically by hardware
How many?
When to start?
Hardware implementation
Just like epochs
Use more epoch contexts
No need to check violations between sub-epochs
within an epoch

Violation!
q
Sub-epochs
45
Old TLS Design
Buffer speculative state in write back L1 cache
CPU
CPU
CPU
CPU
L1
L1
L1
L1
Restart by invalidating speculative lines
Invalidation
Detect violations through invalidations

Problems
L1 cache not large enough
Later epochs only get values on commit

L2
Rest of system only sees committed data
Rest of memory system
46
New Cache Design
CPU
CPU
CPU
CPU
Speculative writes immediately visible to L2 (and
later epochs)
L1
L1
L1
L1
Restart by invalidating speculative lines
Buffer speculative and non-speculative state for
all epochs in L2
L2
L2
Invalidation
Detect violations at lookup time
Rest of memory system
Invalidation coherence between L2 caches
47
New Features
New!
CPU
CPU
CPU
CPU
Speculative state in L1 and L2 cache
L1
L1
L1
L1
Cache line replication (versions)
L2
L2
Data dependence tracking within cache
Speculative victim cache
Rest of memory system
48
Scaling
Time (normalized)
49
Evaluating a 4-CPU system
Parallelized benchmark run on 1 CPU
Original benchmark run on 1 CPU
Without sub-epoch support
1
0.75
Parallel execution
Time (normalized)
0.5
Ignore violations (Amdahls Law limit)
0.25
0
TLS Seq
Baseline
Sequential
No Sub-epoch
No Speculation
50
Sub-epochs How many/How big?

Supporting more sub-epochs is better
Spacing depends on location of violations
Even spacing is good enough

51
Query Execution

Actions taken by a query
Bring pages into buffer pool
Acquire and release latches locks
Allocate/free memory
Allocate/free and use cursors
Use B-trees
Generate log entries

These generate violations.
52
Applying TLS

Parallelize loop
Run benchmark
Remove bottleneck
Go to 2

53
Outline
TransactionProgrammer
DBMS Programmer
Hardware Developer
54
Violation Prediction
55
Violation Prediction

Predictor problems
Large epochs ? many predictions
Failed prediction ? violation
Incorrect prediction ? large stall
Two predictors required
Last store
Dependent load

Predictor
q
Done
q
Predict Dependences
56
TLS Execution
CPU 1
CPU 2
CPU 3
CPU 4
p
L1
L1
L1
L1
Violation!
p
R2
q
L2
Rest of memory system
57
TLS Execution
p
Violation!
p
p
R2
q
s
t
58
TLS Execution
p
Violation!
p
p
R2
q
s
t
59
TLS Execution
p
Violation!
p
p
R2
q
60
TLS Execution
p
Violation!
p
p
R2
q
q
61
TLS Execution
p
Violation!
p
p
R2
q
q
62
Replication
p
Violation!
p
p
R2
q
q
q
Cant invalidate line if it contains two epochs
changes
63
Replication
p
Violation!
p
p
R2
q
q
q
q
64
Replication
p
Violation!
p
p
R2
q
q
q
q

Makes epochs independent
Enables sub-epochs

65
Sub-epochs
p
1a

q
p
p
1b

q
q
p
1c

q

q
1d
p

Uses more epoch contexts
Detection/buffering/rewind is free
More replication
Speculative victim cache

66
get_page() wrapper

page_t get_page_wrapper(pageid_t id)
static tls_mutex mut
page_t ret
tls_escape_speculation()
check_get_arguments(id)
tls_acquire_mutex(mut)
ret get_page(id)
tls_release_mutex(mut)
tls_on_violation(put, ret)
tls_resume_speculation()
return ret

? Wraps get_page()
67
get_page() wrapper

page_t get_page_wrapper(pageid_t id)
static tls_mutex mut
page_t ret
tls_escape_speculation()
check_get_arguments(id)
tls_acquire_mutex(mut)
ret get_page(id)
tls_release_mutex(mut)
tls_on_violation(put, ret)
tls_resume_speculation()
return ret

? No violations while calling get_page()
68
get_page() wrapper

page_t get_page_wrapper(pageid_t id)
static tls_mutex mut
page_t ret
tls_escape_speculation()
check_get_arguments(id)
tls_acquire_mutex(mut)
ret get_page(id)
tls_release_mutex(mut)
tls_on_violation(put, ret)
tls_resume_speculation()
return ret

? May get bad input data from speculative thread!
69
get_page() wrapper

page_t get_page_wrapper(pageid_t id)
static tls_mutex mut
page_t ret
tls_escape_speculation()
check_get_arguments(id)
tls_acquire_mutex(mut)
ret get_page(id)
tls_release_mutex(mut)
tls_on_violation(put, ret)
tls_resume_speculation()
return ret

? Only one epoch per transaction at a time
70
get_page() wrapper

page_t get_page_wrapper(pageid_t id)
static tls_mutex mut
page_t ret
tls_escape_speculation()
check_get_arguments(id)
tls_acquire_mutex(mut)
ret get_page(id)
tls_release_mutex(mut)
tls_on_violation(put, ret)
tls_resume_speculation()
return ret

? How to undo get_page()
71
get_page() wrapper

Isolated
Undoing this operation does not cause cascading
aborts
Undoable
Easy way to return system to initial state
Can also be used for
Cursor management
malloc()

page_t get_page_wrapper(pageid_t id)
static tls_mutex mut
page_t ret
tls_escape_speculation()
check_get_arguments(id)
tls_acquire_mutex(mut)
ret get_page(id)
tls_release_mutex(mut)
tls_on_violation(put, ret)
tls_resume_speculation()
return ret

72
TPC-C Benchmark
Company
Warehouse 1
Warehouse W
District 1
District 2
District 10
Cust 1
Cust 2
Cust 3k
73
TPC-C Benchmark
10
Warehouse W
District W10
History W30k
100k
3k
1
Customer W30k
Stock W100k
New Order W9k
1
0-1
3
W
Order W30k
Order Line W300k
Item 100k
5-15
74
What is TLS?
while(cond) x hashi ... hashj
y ...
75
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ...
hash19 ... hash21 ...
hash33 ... hash30 ...
hash10 ... hash25 ...
76
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ...
hash19 ... hash21 ...
hash33 ... hash30 ...
hash10 ... hash25 ...
Violation!
77
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ... attempt_commit()
hash19 ... hash21 ... attempt_commit()
hash33 ... hash30 ... attempt_commit()
hash10 ... hash25 ... attempt_commit()
Violation!
?
?
?
?
78
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ... attempt_commit()
hash19 ... hash21 ... attempt_commit()
hash33 ... hash30 ... attempt_commit()
hash10 ... hash25 ... attempt_commit()
Violation!
?
?
?
?
Redo
Thread 4
hash10 ... hash25 ... attempt_commit()
?
79
TLS Hardware Design

Whats new?
Large threads
Epochs will communicate
Complex control flow
Huge legacy code base
How does hardware change?
Store state in L2 instead of L1
Reversible atomic operations
Tolerate dependences
Aggressive update propagation (implicit
forwarding)
Sub-epochs

80
L1 Cache Line
Valid
Data
LRU
Tag
SL
SM

SL bit
L2 cache knows this line has been speculatively
loaded
On violation or commit clear
SM bit
This line contains speculative changes
On commit clear
On violation SM ? Invalid
? Otherwise, just like a normal cache ?

81
escaping Speculation
Valid
Data
Stale
LRU
Tag
SL
SM

? Speculative epoch wants to make system visible
change!
Ignore SM lines while escaped
Stale bit
This line may be outdated by speculative work.
On violation or commit clear

82
L1 to L2 communication

L2 sees all stores (write through)
L2 sees first load of an epoch
NotifySL message
? L2 can track data dependences!

83
L1 Changes Summary
Valid
Data
Stale
LRU
Tag
SL
SM

Add three bits to each line
SL
SM
Stale
Modify tag match to recognize bits
Add queue of NotifySL requests

84
L2 Cache Line
Fine Grained SM
CPU1
CPU2
Exclusive
Valid
Data
Dirty
LRU
Tag
SL
SL
SM
SM

Cache line can be
Modified by one CPU
Loaded by multiple CPUs

85
Cache Line Conflicts

Three classes of conflict
Epoch 2 stores, epoch 1 loads
Need old version to load
Epoch 1 stores, epoch 2 stores
Need to keep changes separate
Epoch 1 loads, epoch 2 stores
Need to be able to discard line on violation
? Need a way of storing multiple conflicting
versions in the cache ?

86
Cache line replication

On conflict, replicate line
Split line into two copies
Divide SM and SL bits at split point
Divide directory bits at split point

87
Replication Problems

Complicates line lookup
Need to find all replicas and select best
Best most recent replica
Change management
On write, update all later copies
Also need to find all more speculative replicas
to check for violations
On commit must get rid of stale lines
Invalidation Required Buffer (IRB)

88
Victim Cache

How do you deal with a full cache set?
Use a victim cache
Holds evicted lines without losing SM SL bits
Must be fast ?every cache lookup needs to know
Do I have the best replica of this line?
Critical path
Do I cause a violation?
Not on critical path

89
Summary of Hardware Support

Sub-epochs
Violations hurt less!
Shared cache TLS support
Faster communication
More room to store state
RAOs
Dont speculate on known operations
Reduces amount of speculative state

90
Summary of Hardware Changes

Sub-epochs
Checkpoint register state
Needs replicas in cache
Shared cache TLS support
Speculative L1
Replication in L1 and L2
Speculative victim cache
Invalidation Required Buffer
RAOs
Suspend/resume speculation
Mutexes
Undo list

91
TLS Execution
CPU
CPU
CPU
CPU
p
L1
L1
L1
L1
Invalidation
Violation!
p
p
q
R2
q
L2
Rest of memory system
p
p
p
q
q
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
Problems with Old Cache Design

Database epochs are large
L1 cache not large enough
Sub-epochs add more state
L1 cache not associative enough
Database epochs communicate
L1 cache only communicates committed data

96
Intro Summary

TLS makes intra-transaction parallelism easy
Divide transaction into epochs
Hardware support
Detect violations
Restart to recover
Sub-epochs mitigate penalty
Buffer state
New process
Modify software ? avoid violations ?
improve performance

97
The Many Faces of Ogg
98
The Many Faces of Ogg
99
Removing Bottlenecks

Three general techniques
Partition data structures
malloc
Postpone operations until non-speculative
Latches and locks, log entries
Handle speculation manually
Buffer pool

100
Bottlenecks Encoutered

Buffer pool
Latches Locks
Malloc/free
Cursor queues
Error checks
False sharing
B-tree performance optimization
Log entries

101
The Many Faces of Ogg
102
Performance on 4 CPUs
Unmodified benchmark
Modified benchmark
103
Incremental Parallelization
4 CPUs
104
Scaling
Unmodified benchmark
Modified benchmark
105
Parallelization is Hard
Tuning
Performance Improvement
Tuning
Tuning
Tuning
Hand Parallelization
Programmer Effort
Parallelizing Compiler
106
Case Study New Order (TPC-C)

Begin transaction
End transaction

107
Case Study New Order (TPC-C)

Begin transaction
Read customer info
Read increment order
Create new order
End transaction

108
Case Study New Order (TPC-C)

Begin transaction
Read customer info
Read increment order
Create new order
For each item in order
Get item info
Decrement count in stock
Record order info
End transaction

109
Case Study New Order (TPC-C)

Begin transaction
Read customer info
Read increment order
Create new order
For each item in order
Get item info
Decrement count in stock
Record order info
End transaction

80 of transaction execution time
110
Case Study New Order (TPC-C)

Begin transaction
Read customer info
Read increment order
Create new order
For each item in order
Get item info
Decrement count in stock
Record order info
End transaction

80 of transaction execution time
111
The Many Faces of Ogg
112
Step 2 Changing the Software
113
No problem!

Loop is easy to parallelize using TLS!
Not really
Calls into DBMS invoke complex operations
Ogg needs to do some work
Many operations in DBMS are parallel
Not written with TLS in mind!

114
Resource Management

Mutexes
acquired and released
Locks
locked and unlocked
Cursors
pushed and popped from free stack
Memory
allocated and freed
Buffer pool entries
Acquired and released

115
Mutexes Deadlock?

Problem
Re-ordered acquire/release operations!
Possibly introduced deadlock?
Solutions
Avoidance
Static acquire order
Recovery
Detect deadlock and violate

116
Locks

Like mutexes, but
Allows multiple readers
No memory overhead when not held
Often held for much longer
Treat similarly to mutexes

117
Cursors

Used for traversing B-trees
Pre-allocated, kept in pools

118
Maintaining Cursor Pool
head
Get
Use
Release
119
Maintaining Cursor Pool
head
Get
Use
Release
120
Maintaining Cursor Pool
head
Get
Use
Release
121
Maintaining Cursor Pool
Violation!
head
Get
Get
Use
Use
Release
Release
122
Parallelizing Cursor Pool

Use per-CPU pools
Modify code each CPU gets its own pool
No sharing no violations!
Requires cpuid() instruction

Get
Get
head
head
Use
Use
Release
Release
123
Memory Allocation

Problem
malloc() metadata causes dependences
Solutions
Per-cpu memory pools
Parallelized free list

124
The Log

Append records to global log
Appending causes dependence
Cant parallelize
Global log sequence number (LSN)
Generate log records in buffers
Assign LSNs when homefree

125
B-Trees

Leaf pages contain free space counts
Inserts of random records o.k.
Inserting adjacent records
Dependence on decrementing count
Page splits
Infrequent

126
Other Dependences

Statistics gathering
Error checks
False sharing

127
Related Work

Lots of work in TLS
Multiscalar (Wisconsin)
Hydra (Stanford)
IACOMA (Illinois)
RAW (MIT)
Hand parallelizing using TLS
Manohar Prabhu and Kunle Olukotun (PPoPP03)

128
Any questions?
129
Why is this a problem?

B-tree insertion into ORDERLINE table
Key is ol_n
DBMS does not know that keys will be sequential
Each insert usually updates the same btree page

130
Sequential Btree Inserts
4
free
free
1
4
3
2
item
free
item
item
item
item
item
item
free
free
item
item
item
free
free
free
free
free
free
free
free
free
free
free
131
Improvement SM Versioning

Blow away all SM lines on a violation?
May be bad!
Instead
On primary violation
Only invalidate locally modified SM lines
On secondary violation
Invalidate all SM lines
Needs one more bit LocalSM
May decrease number of misses to L2 on violation

132
Outline

Store state in L2 instead of L1
Reversible atomic operations
Tolerate dependences
Aggressive update propagation (implicit
forwarding)
Sub-epochs
Results and analysis

133
Outline

Store state in L2 instead of L1
Reversible atomic operations
Tolerate dependences
Aggressive update propagation (implicit
forwarding)
Sub-epochs
Results and analysis

134
Tolerating dependences

Aggressive update propagation
Get for free!
Sub-epochs
Periodically checkpoint epochs
Every N instructions?
Picking N may be interesting
Perhaps checkpoints could be set before the
location of previous violations?

135
Outline

Store state in L2 instead of L1
Reversible atomic operations
Tolerate dependences
Aggressive update propagation (implicit
forwarding)
Sub-epochs
Results and analysis

136
Why not faster?

Possible reasons
Idle cpus
RAO mutexes
Violations
Cache effects
Data dependences

137
Why not faster?

Possible reasons
Idle cpus
9 epochs/region average
Two bundles of four and one of one
¼ of cpu cycles wasted!
RAO mutexes
Violations
Cache effects
Data dependences

138
Why not faster?

Possible reasons
Idle cpus
RAO mutexes
Not implemented yet
Ooops!
Violations
Cache effects
Data dependences

139
Why not faster?

Possible reasons
Idle cpus
RAO mutexes
Violations
21/969 epochs violated
Distance 1 magic synchronized
2.2Mcycles (over 4 cpus)
About 1.5
Cache effects
Data dependences

140
Why not faster?

Possible reasons
Idle cpus
RAO mutexes
Violations
Cache effects
Deserves its own slide. ?
Data dependences

141
Cache effects of speculation

Only 20 of references are speculative!
Speculative references have small impact on
non-speculative hit rate (lt1)
Speculative refs miss a lot in L1
9-15 for reads, 2-6 for writes
L2 saw HUGE increase in traffic
152k refs to 3474k refs
Spec/non spec lines are thrashing from L1s

142
Why not faster?

Possible reasons
Idle cpus
RAO mutexes
Violations
Cache effects
Data dependences
Oh yeah!
Btree item count
Split up btree insert?
alloc and write
Do alloc as RAO
Needs more thought

143
L2 Cache Line
Fine Grained SM
CPU1
CPU2
Exclusive
Valid
Data
Dirty
LRU
Tag
SL
SL
SM
SM
Set 1
Set 2
144
Why are you here?

Want faster database systems
Have funky new hardware Thread Level
Speculation (TLS)
How can we apply TLS todatabase systems?
Side question
Is this a VLDB or an ASPLOS talk?

145
How?

Divide transaction into TLS-threads
Run TLS-threads in parallel maintain sequential
semantics
Profit!

146
Why parallelize transactions?

Decrease transaction latency
Increase concurrency while avoiding concurrency
control bottleneck
A.k.a. use more CPUs, same of xactions
The obvious
Database performance matters

147
Shopping List

What do we need? (research scope)
Cheap hardware
Thread Level Speculation (TLS)
Minor changes allowed.
Important database application
TPC-C
Almost no changes allowed!
Modular database system
BerkeleyDB
Some changes allowed.

148
Outline

TLS Hardware
The Benchmark (TPC-C)
Changing the database system
Results
Conclusions

149
Outline

TLS Hardware
The Benchmark (TPC-C)
Changing the database system
Results
Conclusions

150
Whats new?

Database operations are
Large
Complex
Large TLS-threads
Lots of dependences
Difficult to analyze
Want
Programmer optimization effort faster program

151
Hardware changes summary

Must tolerate dependences
Prediction?
Implicit forwarding?
May need larger caches
May need larger associativity

152
Outline

TLS Hardware
The Benchmark (TPC-C)
Changing the database system
Results
Conclusions

153
Parallelization Strategy

Pick a benchmark
Parallelize a loop
Analyze dependences
Optimize away dependences
Evaluate performance
If not satisfied, goto 3

154
Outline

TLS Hardware
The Benchmark (TPC-C)
Changing the database system
Resource management
The log
B-trees
False sharing
Results
Conclusions

155
Outline

TLS Hardware
The Benchmark (TPC-C)
Changing the database system
Results
Conclusions

156
Results

Viola simulator
Single CPI
Perfect violation prediction
No memory system
4 cpus
Exhaustive dependence tracking
Currently working on an out-of-order superscalar
simulation (cello)
10 transaction warm-up
Measure 100 transactions

157
Outline

TLS Hardware
The Benchmark (TPC-C)
Changing the database system
Results
Conclusions

158
Conclusions

TLS can improve transaction latency
Violation predictors important
Iff dependences must be tolerated
TLS makes hand parallelizing easier

159
Improving Database Performance

How to improve performance
Parallelize transaction
Increase number of concurrent transactions
Both of these require independence of database
operations!

160
Case Study New Order (TPC-C)

Begin transaction
End transaction

161
Case Study New Order (TPC-C)

Begin transaction
Read customer info (customer, warehouse)
Read increment order (district)
Create new order (orders, neworder)
End transaction

162
Case Study New Order (TPC-C)

Begin transaction
Read customer info (customer, warehouse)
Read increment order (district)
Create new order (orders, neworder)
For each item in order
Get item info (item)
Decrement count in stock (stock)
Record order info (orderline)
End transaction

163
Case Study New Order (TPC-C)

Begin transaction
Read customer info (customer, warehouse)
Read increment order (district)
Create new order (orders, neworder)
For each item in order
Get item info (item)
Decrement count in stock (stock)
Record order info (orderline)
End transaction

Parallelize this loop
164
Case Study New Order (TPC-C)

Begin transaction
Read customer info (customer, warehouse)
Read increment order (district)
Create new order (orders, neworder)
For each item in order
Get item info (item)
Decrement count in stock (stock)
Record order info (orderline)
End transaction

Parallelize this loop
165
Implementing on a Real DB

Using BerkeleyDB
Table Database
Give database any arbitrary key
?will return arbitrary data (bytes)
Use structs for keys and rows
Database provides ACID through
Transactions
Locking (page level)
Storage management
Provides indexing using b-trees

166
Parallelizing a Transaction

For each item in order
Get item info (item)
Decrement count in stock (stock)
Record order info (order line)

167
Parallelizing a Transaction

For each item in order
Get item info (item)
Decrement count in stock (stock)
Record order info (order line)

Get cursor from pool
Use cursor to traverse b-tree
Find row, lock page for row
Release cursor to pool

168
Maintaining Cursor Pool
head
Get
Use
Release
169
Maintaining Cursor Pool
head
Get
Use
Release
170
Maintaining Cursor Pool
head
Get
Use
Release
171
Maintaining Cursor Pool
Violation!
head
Get
Get
Use
Use
Release
Release
172
Parallelizing Cursor Pool 1

Use per-CPU pools
Modify code each CPU gets its own pool
No sharing no violations!
Requires cpuid() instruction

173
Parallelizing Cursor Pool 2

Dequeue and enqueue atomic and unordered
Delay enqueue until end of thread
Forces separate pools
Avoids modificationof data struct

Get
head
Get
Use
Use
Release
Release
174
Parallelizing Cursor Pool 3

Atomic unordered dequeue enqueue
Cursor struct is TLS unordered
Struct defined as a byte range in memory

Get
Get
head
Get
Get
Use
Use
Use
Use
Release
Release
Release
Release
175
Parallelizing Cursor Pool 4

Mutex protect dequeue enqueuedeclare pointer
to cursor struct to be TLS unordered
Any access through pointer does not have TLS
applied
Pointer is tainted, any copies of it keep this
property

176
Problems with 3 4

What exactly is the boundary of a structure?
How do you express the concept of object in a
loosely-typed language like C?
A byte range or a pointer is only an
approximation.
Dynamically allocated sub-components?

177
Mutexes in a TLS world

Two types of threads
real threads
TLS threads
Two types of mutexes
Inter-real-thread
Inter-TLS-thread

178
Inter-real-thread Mutexes

Acquire get mutex for all TLS threads
Release release for current TLS thread
May still be held by another TLS thread!

179
Inter-TLS-thread Mutexes

Should never interact between two real threads
Implies no TLS ordering between TLS threads while
mutex is held
But what do to on a violation?
Cant just throw away changes to memory
Must undo operations performed in critical section

180
Parallelizing Databases using TLS

Split transactions into threads
Threads created are large
60k instructions
16kB of speculative state
More dependences between threads
How do we design a machine which can handle these
large threads?

181
The Old Way
P
P
P
P
P
P
P
P
Speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Committed state

L3
Memory System
182
The Old Way

Advantages
Each epoch has its own L1 cache
Epoch state does not intermix
Disadvantages
L1 cache is too small!
Full cache dead meat
No shared speculative memory

183
The New Way

L2 cache is huge!
State of the art in caches, Power5
1.92MB 10-way L2
32kB 4-way L1
Shared speculative memory for free
Keeps TLS logic off of the critical path

184
TLS Shared L2 Design

L1 Write-through write-no-allocate
CullerSinghGupta99
Easy to understand and reason about
Writes visible to L2 simplifies shared
speculative memory
L2 cache shared cache architecture with
replication
Rest of memory distributed TLS coherence

185
TLS Shared L2 Design

Explain from the top down

P
P
P
P
P
P
P
P
Cached speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Real speculative state

L3
Memory System
186
TLS Shared L2 Design
P
P
P
P
P
P
P
P
Cached speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Real speculative state

L3
Memory System
187
Part II Dealing with dependences
188
Predictor Design