Title: Optimistic Intra-Transaction Parallelism using Thread Level Speculation
1Optimistic Intra-Transaction Parallelism using
Thread Level Speculation
- Chris Colohan1, Anastassia Ailamaki1,
- J. Gregory Steffan2 and Todd C. Mowry1,3
- 1Carnegie Mellon University
- 2University of Toronto
- 3Intel Research Pittsburgh
2Chip Multiprocessors are Here!
AMD Opteron
IBM Power 5
Intel Yonah
- 2 cores now, soon will have 4, 8, 16, or 32
- Multiple threads per core
- How do we best use them?
3Multi-Core Enhances Throughput
Database Server
Users
Cores can run concurrent transactions and improve
throughput
4Using Multiple Cores
Database Server
Users
Can multiple cores improve transaction latency?
5Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line
- Intra-query parallelism
- Used for long-running queries (decision support)
- Does not work for short queries
- Short queries dominate in commercial workloads
6Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line
- Intra-transaction parallelism
- Each thread spans multiple queries
- Hard to add to existing systems!
- Need to change interface, add latches and locks,
worry about correctness of parallel execution
7Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line
- Intra-transaction parallelism
- Breaks transaction into threads
- Hard to add to existing systems!
- Need to change interface, add latches and locks,
worry about correctness of parallel execution
Thread Level Speculation (TLS) makes
parallelization easier.
8Thread Level Speculation (TLS)
p
p
q
q
p
q
Sequential
Parallel
9Thread Level Speculation (TLS)
- Use epochs
- Detect violations
- Restart to recover
- Buffer state
- Oldest epoch
- Never restarts
- No buffering
- Worst case
- Sequential
- Best case
- Fully parallel
Epoch 1
Epoch 2
p
Violation!
p
p
R2
q
q
p
q
Sequential
Parallel
Data dependences limit performance.
10A Coordinated Effort
Choose epoch boundaries
TransactionProgrammer
DBMS Programmer
Remove performance bottlenecks
Hardware Developer
Add TLS support to architecture
11So whats new?
- Intra-transaction parallelism
- Without changing the transactions
- With minor changes to the DBMS
- Without having to worry about locking
- Without introducing concurrency bugs
- With good performance
- Halve transaction latency on four cores
12Related Work
- Optimistic Concurrency Control (Kung82)
- Sagas (MolinaSalem87)
- Transaction chopping (Shasha95)
13Outline
- Introduction
- Related work
- Dividing transactions into epochs
- Removing bottlenecks in the DBMS
- Results
- Conclusions
14Case Study New Order (TPC-C)
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock WHERE i_iditem UPDATE stock
WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
- Only dependence is the quantity field
- Very unlikely to occur (1/100,000)
15Case Study New Order (TPC-C)
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock WHERE i_iditem UPDATE stock
WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order TLS_foreach(item) GET quantity
FROM stock WHERE i_iditem UPDATE
stock WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
16Outline
- Introduction
- Related work
- Dividing transactions into epochs
- Removing bottlenecks in the DBMS
- Results
- Conclusions
17Dependences in DBMS
18Dependences in DBMS
- Dependences serialize execution!
- Example statistics gathering
- pages_pinned
- TLS maintains serial ordering of increments
- To remove, use per-CPU counters
- Performance tuning
- Profile execution
- Remove bottleneck dependence
- Repeat
19Buffer Pool Management
CPU
get_page(5)
put_page(5)
Buffer Pool
ref 1
ref 0
20Buffer Pool Management
CPU
get_page(5)
get_page(5)
put_page(5)
put_page(5)
get_page(5)
Buffer Pool
put_page(5)
TLS ensures first epoch gets page first. Who
cares?
ref 0
- TLS maintains original load/store order
- Sometimes this is not needed
21Buffer Pool Management
- Escape speculation
- Invoke operation
- Store undo function
- Resume speculation
CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
put_page(5)
put_page(5)
get_page(5)
Buffer Pool
put_page(5)
ref 0
Isolated undoing get_page will not affect
other transactions Undoable have an
operation (put_page) which returns the
system to its initial state
22Buffer Pool Management
CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
get_page(5)
Buffer Pool
Not undoable!
ref 0
23Buffer Pool Management
CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
Buffer Pool
ref 0
- Delay put_page until end of epoch
- Avoid dependence
24Removing Bottleneck Dependences
- We introduce three techniques
- Delay operations until non-speculative
- Mutex and lock acquire and release
- Buffer pool, memory, and cursor release
- Log sequence number assignment
- Escape speculation
- Buffer pool, memory, and cursor allocation
- Traditional parallelization
- Memory allocation, cursor pool, error checks,
false sharing
25Outline
- Introduction
- Related work
- Dividing transactions into epochs
- Removing bottlenecks in the DBMS
- Results
- Conclusions
26Experimental Setup
- Detailed simulation
- Superscalar, out-of-order, 128 entry reorder
buffer - Memory hierarchy modeled in detail
- TPC-C transactions on BerkeleyDB
- In-core database
- Single user
- Single warehouse
- Measure interval of 100 transactions
- Measuring latency not throughput
27Optimizing the DBMS New Order
1.25
26 improvement
1
0.75
Time (normalized)
Other CPUs not helping
0.5
Cant optimize much more
Cache misses increase
0.25
0
Sequential
28Optimizing the DBMS New Order
1.25
1
0.75
Time (normalized)
0.5
0.25
0
This process took me 30 days and lt1200 lines of
code.
Sequential
29Other TPC-C Transactions
1
0.75
Idle CPU
Failed
Time (normalized)
Cache Miss
0.5
Busy
0.25
0
New Order
Delivery
Stock Level
Payment
Order Status
30Conclusions
- TLS makes intra-transaction parallelism practical
- Reasonable changes to transaction, DBMS, and
hardware - Halve transaction latency
31Needed backup slides (not done yet)
- 2 proc. Results
- Shared caches may change how you want to extract
parallelism! - Just have lots of transactions no sharing
- TLS may have more sharing
32Any questions?
- For more information, see
- www.colohan.com
33Backup Slides Follow
34LATCHES
35Latches
- Mutual exclusion between transactions
- Cause violations between epochs
- Read-test-write cycle ? RAW
- Not needed between epochs
- TLS already provides mutual exclusion!
36Latches Aggressive Acquire
Acquire latch_cnt work latch_cnt--
latch_cnt work (enqueue release)
latch_cnt work (enqueue release)
Commit work latch_cnt--
Commit work latch_cnt-- Release
37Latches Lazy Acquire
Acquire work Release
(enqueue acquire) work (enqueue release)
(enqueue acquire) work (enqueue release)
Acquire Commit work Release
Acquire Commit work Release
38HARDWARE
39TLS in Database Systems
- Large epochs
- More dependences
- Must tolerate
- More state
- Bigger buffers
Non-Database TLS
TLS in Database Systems
40Feedback Loop
for() do_work()
41Violations Feedback
p
Violation!
p
p
R2
q
q
p
q
Sequential
Parallel
42Eliminating Violations
0x0FD8? 0xFD20 0x0FC0? 0xFC18
43Tolerating Violations Sub-epochs
Violation!
q
Sub-epochs
44Sub-epochs
- Started periodically by hardware
- How many?
- When to start?
- Hardware implementation
- Just like epochs
- Use more epoch contexts
- No need to check violations between sub-epochs
within an epoch
Violation!
q
Sub-epochs
45Old TLS Design
Buffer speculative state in write back L1 cache
CPU
CPU
CPU
CPU
L1
L1
L1
L1
Restart by invalidating speculative lines
Invalidation
Detect violations through invalidations
- Problems
- L1 cache not large enough
- Later epochs only get values on commit
L2
Rest of system only sees committed data
Rest of memory system
46New Cache Design
CPU
CPU
CPU
CPU
Speculative writes immediately visible to L2 (and
later epochs)
L1
L1
L1
L1
Restart by invalidating speculative lines
Buffer speculative and non-speculative state for
all epochs in L2
L2
L2
Invalidation
Detect violations at lookup time
Rest of memory system
Invalidation coherence between L2 caches
47New Features
New!
CPU
CPU
CPU
CPU
Speculative state in L1 and L2 cache
L1
L1
L1
L1
Cache line replication (versions)
L2
L2
Data dependence tracking within cache
Speculative victim cache
Rest of memory system
48Scaling
Time (normalized)
49Evaluating a 4-CPU system
Parallelized benchmark run on 1 CPU
Original benchmark run on 1 CPU
Without sub-epoch support
1
0.75
Parallel execution
Time (normalized)
0.5
Ignore violations (Amdahls Law limit)
0.25
0
TLS Seq
Baseline
Sequential
No Sub-epoch
No Speculation
50Sub-epochs How many/How big?
- Supporting more sub-epochs is better
- Spacing depends on location of violations
- Even spacing is good enough
51Query Execution
- Actions taken by a query
- Bring pages into buffer pool
- Acquire and release latches locks
- Allocate/free memory
- Allocate/free and use cursors
- Use B-trees
- Generate log entries
These generate violations.
52Applying TLS
- Parallelize loop
- Run benchmark
- Remove bottleneck
- Go to 2
53Outline
TransactionProgrammer
DBMS Programmer
Hardware Developer
54Violation Prediction
55Violation Prediction
- Predictor problems
- Large epochs ? many predictions
- Failed prediction ? violation
- Incorrect prediction ? large stall
- Two predictors required
- Last store
- Dependent load
Predictor
q
Done
q
Predict Dependences
56TLS Execution
CPU 1
CPU 2
CPU 3
CPU 4
p
L1
L1
L1
L1
Violation!
p
R2
q
L2
Rest of memory system
57TLS Execution
p
Violation!
p
p
R2
q
s
t
58TLS Execution
p
Violation!
p
p
R2
q
s
t
59TLS Execution
p
Violation!
p
p
R2
q
60TLS Execution
p
Violation!
p
p
R2
q
q
61TLS Execution
p
Violation!
p
p
R2
q
q
62Replication
p
Violation!
p
p
R2
q
q
q
Cant invalidate line if it contains two epochs
changes
63Replication
p
Violation!
p
p
R2
q
q
q
q
64Replication
p
Violation!
p
p
R2
q
q
q
q
- Makes epochs independent
- Enables sub-epochs
65Sub-epochs
p
1a
q
p
p
1b
q
q
p
1c
q
q
1d
p
- Uses more epoch contexts
- Detection/buffering/rewind is free
- More replication
- Speculative victim cache
66get_page() wrapper
- page_t get_page_wrapper(pageid_t id)
- static tls_mutex mut
- page_t ret
- tls_escape_speculation()
- check_get_arguments(id)
- tls_acquire_mutex(mut)
- ret get_page(id)
- tls_release_mutex(mut)
- tls_on_violation(put, ret)
- tls_resume_speculation()
- return ret
? Wraps get_page()
67get_page() wrapper
- page_t get_page_wrapper(pageid_t id)
- static tls_mutex mut
- page_t ret
- tls_escape_speculation()
- check_get_arguments(id)
- tls_acquire_mutex(mut)
- ret get_page(id)
- tls_release_mutex(mut)
- tls_on_violation(put, ret)
- tls_resume_speculation()
- return ret
? No violations while calling get_page()
68get_page() wrapper
- page_t get_page_wrapper(pageid_t id)
- static tls_mutex mut
- page_t ret
- tls_escape_speculation()
- check_get_arguments(id)
- tls_acquire_mutex(mut)
- ret get_page(id)
- tls_release_mutex(mut)
- tls_on_violation(put, ret)
- tls_resume_speculation()
- return ret
? May get bad input data from speculative thread!
69get_page() wrapper
- page_t get_page_wrapper(pageid_t id)
- static tls_mutex mut
- page_t ret
- tls_escape_speculation()
- check_get_arguments(id)
- tls_acquire_mutex(mut)
- ret get_page(id)
- tls_release_mutex(mut)
- tls_on_violation(put, ret)
- tls_resume_speculation()
- return ret
? Only one epoch per transaction at a time
70get_page() wrapper
- page_t get_page_wrapper(pageid_t id)
- static tls_mutex mut
- page_t ret
- tls_escape_speculation()
- check_get_arguments(id)
- tls_acquire_mutex(mut)
- ret get_page(id)
- tls_release_mutex(mut)
- tls_on_violation(put, ret)
- tls_resume_speculation()
- return ret
? How to undo get_page()
71get_page() wrapper
- Isolated
- Undoing this operation does not cause cascading
aborts - Undoable
- Easy way to return system to initial state
- Can also be used for
- Cursor management
- malloc()
- page_t get_page_wrapper(pageid_t id)
- static tls_mutex mut
- page_t ret
- tls_escape_speculation()
- check_get_arguments(id)
- tls_acquire_mutex(mut)
- ret get_page(id)
- tls_release_mutex(mut)
- tls_on_violation(put, ret)
- tls_resume_speculation()
- return ret
72TPC-C Benchmark
Company
Warehouse 1
Warehouse W
District 1
District 2
District 10
Cust 1
Cust 2
Cust 3k
73TPC-C Benchmark
10
Warehouse W
District W10
History W30k
100k
3k
1
Customer W30k
Stock W100k
New Order W9k
1
0-1
3
W
Order W30k
Order Line W300k
Item 100k
5-15
74What is TLS?
while(cond) x hashi ... hashj
y ...
75What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ...
hash19 ... hash21 ...
hash33 ... hash30 ...
hash10 ... hash25 ...
76What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ...
hash19 ... hash21 ...
hash33 ... hash30 ...
hash10 ... hash25 ...
Violation!
77What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ... attempt_commit()
hash19 ... hash21 ... attempt_commit()
hash33 ... hash30 ... attempt_commit()
hash10 ... hash25 ... attempt_commit()
Violation!
?
?
?
?
78What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ... attempt_commit()
hash19 ... hash21 ... attempt_commit()
hash33 ... hash30 ... attempt_commit()
hash10 ... hash25 ... attempt_commit()
Violation!
?
?
?
?
Redo
Thread 4
hash10 ... hash25 ... attempt_commit()
?
79TLS Hardware Design
- Whats new?
- Large threads
- Epochs will communicate
- Complex control flow
- Huge legacy code base
- How does hardware change?
- Store state in L2 instead of L1
- Reversible atomic operations
- Tolerate dependences
- Aggressive update propagation (implicit
forwarding) - Sub-epochs
80L1 Cache Line
Valid
Data
LRU
Tag
SL
SM
- SL bit
- L2 cache knows this line has been speculatively
loaded - On violation or commit clear
- SM bit
- This line contains speculative changes
- On commit clear
- On violation SM ? Invalid
- ? Otherwise, just like a normal cache ?
81escaping Speculation
Valid
Data
Stale
LRU
Tag
SL
SM
- ? Speculative epoch wants to make system visible
change! - Ignore SM lines while escaped
- Stale bit
- This line may be outdated by speculative work.
- On violation or commit clear
82L1 to L2 communication
- L2 sees all stores (write through)
- L2 sees first load of an epoch
- NotifySL message
- ? L2 can track data dependences!
83L1 Changes Summary
Valid
Data
Stale
LRU
Tag
SL
SM
- Add three bits to each line
- SL
- SM
- Stale
- Modify tag match to recognize bits
- Add queue of NotifySL requests
84L2 Cache Line
Fine Grained SM
CPU1
CPU2
Exclusive
Valid
Data
Dirty
LRU
Tag
SL
SL
SM
SM
- Cache line can be
- Modified by one CPU
- Loaded by multiple CPUs
85Cache Line Conflicts
- Three classes of conflict
- Epoch 2 stores, epoch 1 loads
- Need old version to load
- Epoch 1 stores, epoch 2 stores
- Need to keep changes separate
- Epoch 1 loads, epoch 2 stores
- Need to be able to discard line on violation
- ? Need a way of storing multiple conflicting
versions in the cache ?
86Cache line replication
- On conflict, replicate line
- Split line into two copies
- Divide SM and SL bits at split point
- Divide directory bits at split point
87Replication Problems
- Complicates line lookup
- Need to find all replicas and select best
- Best most recent replica
- Change management
- On write, update all later copies
- Also need to find all more speculative replicas
to check for violations - On commit must get rid of stale lines
- Invalidation Required Buffer (IRB)
88Victim Cache
- How do you deal with a full cache set?
- Use a victim cache
- Holds evicted lines without losing SM SL bits
- Must be fast ?every cache lookup needs to know
- Do I have the best replica of this line?
- Critical path
- Do I cause a violation?
- Not on critical path
89Summary of Hardware Support
- Sub-epochs
- Violations hurt less!
- Shared cache TLS support
- Faster communication
- More room to store state
- RAOs
- Dont speculate on known operations
- Reduces amount of speculative state
90Summary of Hardware Changes
- Sub-epochs
- Checkpoint register state
- Needs replicas in cache
- Shared cache TLS support
- Speculative L1
- Replication in L1 and L2
- Speculative victim cache
- Invalidation Required Buffer
- RAOs
- Suspend/resume speculation
- Mutexes
- Undo list
91TLS Execution
CPU
CPU
CPU
CPU
p
L1
L1
L1
L1
Invalidation
Violation!
p
p
q
R2
q
L2
Rest of memory system
p
p
p
q
q
92(No Transcript)
93(No Transcript)
94(No Transcript)
95Problems with Old Cache Design
- Database epochs are large
- L1 cache not large enough
- Sub-epochs add more state
- L1 cache not associative enough
- Database epochs communicate
- L1 cache only communicates committed data
96Intro Summary
- TLS makes intra-transaction parallelism easy
- Divide transaction into epochs
- Hardware support
- Detect violations
- Restart to recover
- Sub-epochs mitigate penalty
- Buffer state
- New process
- Modify software ? avoid violations ?
improve performance
97The Many Faces of Ogg
98The Many Faces of Ogg
99Removing Bottlenecks
- Three general techniques
- Partition data structures
- malloc
- Postpone operations until non-speculative
- Latches and locks, log entries
- Handle speculation manually
- Buffer pool
100Bottlenecks Encoutered
- Buffer pool
- Latches Locks
- Malloc/free
- Cursor queues
- Error checks
- False sharing
- B-tree performance optimization
- Log entries
101The Many Faces of Ogg
102Performance on 4 CPUs
Unmodified benchmark
Modified benchmark
103Incremental Parallelization
4 CPUs
104Scaling
Unmodified benchmark
Modified benchmark
105Parallelization is Hard
Tuning
Performance Improvement
Tuning
Tuning
Tuning
Hand Parallelization
Programmer Effort
Parallelizing Compiler
106Case Study New Order (TPC-C)
- Begin transaction
- End transaction
107Case Study New Order (TPC-C)
- Begin transaction
- Read customer info
- Read increment order
- Create new order
- End transaction
108Case Study New Order (TPC-C)
- Begin transaction
- Read customer info
- Read increment order
- Create new order
- For each item in order
- Get item info
- Decrement count in stock
- Record order info
-
- End transaction
109Case Study New Order (TPC-C)
- Begin transaction
- Read customer info
- Read increment order
- Create new order
- For each item in order
- Get item info
- Decrement count in stock
- Record order info
-
- End transaction
80 of transaction execution time
110Case Study New Order (TPC-C)
- Begin transaction
- Read customer info
- Read increment order
- Create new order
- For each item in order
- Get item info
- Decrement count in stock
- Record order info
-
- End transaction
80 of transaction execution time
111The Many Faces of Ogg
112Step 2 Changing the Software
113No problem!
- Loop is easy to parallelize using TLS!
- Not really
- Calls into DBMS invoke complex operations
- Ogg needs to do some work
- Many operations in DBMS are parallel
- Not written with TLS in mind!
114Resource Management
- Mutexes
- acquired and released
- Locks
- locked and unlocked
- Cursors
- pushed and popped from free stack
- Memory
- allocated and freed
- Buffer pool entries
- Acquired and released
115Mutexes Deadlock?
- Problem
- Re-ordered acquire/release operations!
- Possibly introduced deadlock?
- Solutions
- Avoidance
- Static acquire order
- Recovery
- Detect deadlock and violate
116Locks
- Like mutexes, but
- Allows multiple readers
- No memory overhead when not held
- Often held for much longer
- Treat similarly to mutexes
117Cursors
- Used for traversing B-trees
- Pre-allocated, kept in pools
118Maintaining Cursor Pool
head
Get
Use
Release
119Maintaining Cursor Pool
head
Get
Use
Release
120Maintaining Cursor Pool
head
Get
Use
Release
121Maintaining Cursor Pool
Violation!
head
Get
Get
Use
Use
Release
Release
122Parallelizing Cursor Pool
- Use per-CPU pools
- Modify code each CPU gets its own pool
- No sharing no violations!
- Requires cpuid() instruction
Get
Get
head
head
Use
Use
Release
Release
123Memory Allocation
- Problem
- malloc() metadata causes dependences
- Solutions
- Per-cpu memory pools
- Parallelized free list
124The Log
- Append records to global log
- Appending causes dependence
- Cant parallelize
- Global log sequence number (LSN)
- Generate log records in buffers
- Assign LSNs when homefree
125B-Trees
- Leaf pages contain free space counts
- Inserts of random records o.k.
- Inserting adjacent records
- Dependence on decrementing count
- Page splits
- Infrequent
126Other Dependences
- Statistics gathering
- Error checks
- False sharing
127Related Work
- Lots of work in TLS
- Multiscalar (Wisconsin)
- Hydra (Stanford)
- IACOMA (Illinois)
- RAW (MIT)
- Hand parallelizing using TLS
- Manohar Prabhu and Kunle Olukotun (PPoPP03)
128Any questions?
129Why is this a problem?
- B-tree insertion into ORDERLINE table
- Key is ol_n
- DBMS does not know that keys will be sequential
- Each insert usually updates the same btree page
130Sequential Btree Inserts
4
free
free
1
4
3
2
item
free
item
item
item
item
item
item
free
free
item
item
item
free
free
free
free
free
free
free
free
free
free
free
131Improvement SM Versioning
- Blow away all SM lines on a violation?
- May be bad!
- Instead
- On primary violation
- Only invalidate locally modified SM lines
- On secondary violation
- Invalidate all SM lines
- Needs one more bit LocalSM
- May decrease number of misses to L2 on violation
132Outline
- Store state in L2 instead of L1
- Reversible atomic operations
- Tolerate dependences
- Aggressive update propagation (implicit
forwarding) - Sub-epochs
- Results and analysis
133Outline
- Store state in L2 instead of L1
- Reversible atomic operations
- Tolerate dependences
- Aggressive update propagation (implicit
forwarding) - Sub-epochs
- Results and analysis
134Tolerating dependences
- Aggressive update propagation
- Get for free!
- Sub-epochs
- Periodically checkpoint epochs
- Every N instructions?
- Picking N may be interesting
- Perhaps checkpoints could be set before the
location of previous violations?
135Outline
- Store state in L2 instead of L1
- Reversible atomic operations
- Tolerate dependences
- Aggressive update propagation (implicit
forwarding) - Sub-epochs
- Results and analysis
136Why not faster?
- Possible reasons
- Idle cpus
- RAO mutexes
- Violations
- Cache effects
- Data dependences
137Why not faster?
- Possible reasons
- Idle cpus
- 9 epochs/region average
- Two bundles of four and one of one
- ¼ of cpu cycles wasted!
- RAO mutexes
- Violations
- Cache effects
- Data dependences
138Why not faster?
- Possible reasons
- Idle cpus
- RAO mutexes
- Not implemented yet
- Ooops!
- Violations
- Cache effects
- Data dependences
139Why not faster?
- Possible reasons
- Idle cpus
- RAO mutexes
- Violations
- 21/969 epochs violated
- Distance 1 magic synchronized
- 2.2Mcycles (over 4 cpus)
- About 1.5
- Cache effects
- Data dependences
140Why not faster?
- Possible reasons
- Idle cpus
- RAO mutexes
- Violations
- Cache effects
- Deserves its own slide. ?
- Data dependences
141Cache effects of speculation
- Only 20 of references are speculative!
- Speculative references have small impact on
non-speculative hit rate (lt1) - Speculative refs miss a lot in L1
- 9-15 for reads, 2-6 for writes
- L2 saw HUGE increase in traffic
- 152k refs to 3474k refs
- Spec/non spec lines are thrashing from L1s
142Why not faster?
- Possible reasons
- Idle cpus
- RAO mutexes
- Violations
- Cache effects
- Data dependences
- Oh yeah!
- Btree item count
- Split up btree insert?
- alloc and write
- Do alloc as RAO
- Needs more thought
143L2 Cache Line
Fine Grained SM
CPU1
CPU2
Exclusive
Valid
Data
Dirty
LRU
Tag
SL
SL
SM
SM
Set 1
Set 2
144Why are you here?
- Want faster database systems
- Have funky new hardware Thread Level
Speculation (TLS) - How can we apply TLS todatabase systems?
- Side question
- Is this a VLDB or an ASPLOS talk?
145How?
- Divide transaction into TLS-threads
- Run TLS-threads in parallel maintain sequential
semantics - Profit!
146Why parallelize transactions?
- Decrease transaction latency
- Increase concurrency while avoiding concurrency
control bottleneck - A.k.a. use more CPUs, same of xactions
- The obvious
- Database performance matters
147Shopping List
- What do we need? (research scope)
- Cheap hardware
- Thread Level Speculation (TLS)
- Minor changes allowed.
- Important database application
- TPC-C
- Almost no changes allowed!
- Modular database system
- BerkeleyDB
- Some changes allowed.
148Outline
- TLS Hardware
- The Benchmark (TPC-C)
- Changing the database system
- Results
- Conclusions
149Outline
- TLS Hardware
- The Benchmark (TPC-C)
- Changing the database system
- Results
- Conclusions
150Whats new?
- Database operations are
- Large
- Complex
- Large TLS-threads
- Lots of dependences
- Difficult to analyze
- Want
- Programmer optimization effort faster program
151Hardware changes summary
- Must tolerate dependences
- Prediction?
- Implicit forwarding?
- May need larger caches
- May need larger associativity
152Outline
- TLS Hardware
- The Benchmark (TPC-C)
- Changing the database system
- Results
- Conclusions
153Parallelization Strategy
- Pick a benchmark
- Parallelize a loop
- Analyze dependences
- Optimize away dependences
- Evaluate performance
- If not satisfied, goto 3
154Outline
- TLS Hardware
- The Benchmark (TPC-C)
- Changing the database system
- Resource management
- The log
- B-trees
- False sharing
- Results
- Conclusions
155Outline
- TLS Hardware
- The Benchmark (TPC-C)
- Changing the database system
- Results
- Conclusions
156Results
- Viola simulator
- Single CPI
- Perfect violation prediction
- No memory system
- 4 cpus
- Exhaustive dependence tracking
- Currently working on an out-of-order superscalar
simulation (cello) - 10 transaction warm-up
- Measure 100 transactions
157Outline
- TLS Hardware
- The Benchmark (TPC-C)
- Changing the database system
- Results
- Conclusions
158Conclusions
- TLS can improve transaction latency
- Violation predictors important
- Iff dependences must be tolerated
- TLS makes hand parallelizing easier
159Improving Database Performance
- How to improve performance
- Parallelize transaction
- Increase number of concurrent transactions
- Both of these require independence of database
operations!
160Case Study New Order (TPC-C)
- Begin transaction
- End transaction
161Case Study New Order (TPC-C)
- Begin transaction
- Read customer info (customer, warehouse)
- Read increment order (district)
- Create new order (orders, neworder)
- End transaction
162Case Study New Order (TPC-C)
- Begin transaction
- Read customer info (customer, warehouse)
- Read increment order (district)
- Create new order (orders, neworder)
- For each item in order
- Get item info (item)
- Decrement count in stock (stock)
- Record order info (orderline)
-
- End transaction
163Case Study New Order (TPC-C)
- Begin transaction
- Read customer info (customer, warehouse)
- Read increment order (district)
- Create new order (orders, neworder)
- For each item in order
- Get item info (item)
- Decrement count in stock (stock)
- Record order info (orderline)
-
- End transaction
Parallelize this loop
164Case Study New Order (TPC-C)
- Begin transaction
- Read customer info (customer, warehouse)
- Read increment order (district)
- Create new order (orders, neworder)
- For each item in order
- Get item info (item)
- Decrement count in stock (stock)
- Record order info (orderline)
-
- End transaction
Parallelize this loop
165Implementing on a Real DB
- Using BerkeleyDB
- Table Database
- Give database any arbitrary key
- ?will return arbitrary data (bytes)
- Use structs for keys and rows
- Database provides ACID through
- Transactions
- Locking (page level)
- Storage management
- Provides indexing using b-trees
166Parallelizing a Transaction
- For each item in order
- Get item info (item)
- Decrement count in stock (stock)
- Record order info (order line)
167Parallelizing a Transaction
- For each item in order
- Get item info (item)
- Decrement count in stock (stock)
- Record order info (order line)
- Get cursor from pool
- Use cursor to traverse b-tree
- Find row, lock page for row
- Release cursor to pool
168Maintaining Cursor Pool
head
Get
Use
Release
169Maintaining Cursor Pool
head
Get
Use
Release
170Maintaining Cursor Pool
head
Get
Use
Release
171Maintaining Cursor Pool
Violation!
head
Get
Get
Use
Use
Release
Release
172Parallelizing Cursor Pool 1
- Use per-CPU pools
- Modify code each CPU gets its own pool
- No sharing no violations!
- Requires cpuid() instruction
173Parallelizing Cursor Pool 2
- Dequeue and enqueue atomic and unordered
- Delay enqueue until end of thread
- Forces separate pools
- Avoids modificationof data struct
Get
head
Get
Use
Use
Release
Release
174Parallelizing Cursor Pool 3
- Atomic unordered dequeue enqueue
- Cursor struct is TLS unordered
- Struct defined as a byte range in memory
Get
Get
head
Get
Get
Use
Use
Use
Use
Release
Release
Release
Release
175Parallelizing Cursor Pool 4
- Mutex protect dequeue enqueuedeclare pointer
to cursor struct to be TLS unordered - Any access through pointer does not have TLS
applied - Pointer is tainted, any copies of it keep this
property
176Problems with 3 4
- What exactly is the boundary of a structure?
- How do you express the concept of object in a
loosely-typed language like C? - A byte range or a pointer is only an
approximation. - Dynamically allocated sub-components?
177Mutexes in a TLS world
- Two types of threads
- real threads
- TLS threads
- Two types of mutexes
- Inter-real-thread
- Inter-TLS-thread
178Inter-real-thread Mutexes
- Acquire get mutex for all TLS threads
- Release release for current TLS thread
- May still be held by another TLS thread!
179Inter-TLS-thread Mutexes
- Should never interact between two real threads
- Implies no TLS ordering between TLS threads while
mutex is held - But what do to on a violation?
- Cant just throw away changes to memory
- Must undo operations performed in critical section
180Parallelizing Databases using TLS
- Split transactions into threads
- Threads created are large
- 60k instructions
- 16kB of speculative state
- More dependences between threads
- How do we design a machine which can handle these
large threads?
181The Old Way
P
P
P
P
P
P
P
P
Speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Committed state
L3
Memory System
182The Old Way
- Advantages
- Each epoch has its own L1 cache
- Epoch state does not intermix
- Disadvantages
- L1 cache is too small!
- Full cache dead meat
- No shared speculative memory
183The New Way
- L2 cache is huge!
- State of the art in caches, Power5
- 1.92MB 10-way L2
- 32kB 4-way L1
- Shared speculative memory for free
- Keeps TLS logic off of the critical path
184TLS Shared L2 Design
- L1 Write-through write-no-allocate
CullerSinghGupta99 - Easy to understand and reason about
- Writes visible to L2 simplifies shared
speculative memory - L2 cache shared cache architecture with
replication - Rest of memory distributed TLS coherence
185TLS Shared L2 Design
- Explain from the top down
P
P
P
P
P
P
P
P
Cached speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Real speculative state
L3
Memory System
186TLS Shared L2 Design
P
P
P
P
P
P
P
P
Cached speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Real speculative state
L3
Memory System
187Part II Dealing with dependences
188Predictor Design
- How do you design a predictor that
- Identifies violating loads
- Identifies the last store that causes them
- Only triggers when they cause a problem
- Has very very high accuracy
- ???
189Sub-epoch design
- Like checkpointing
- Leave holes in epoch space
- Every 5k instructions start a new epoch
- Uses more cache to buffer changes
- More strain on associativity/victim cache
- Uses more epoch contexts
190Summary
- Supporting large epochs needs
- Buffer state in L2 instead of L1
- Shared speculative memory
- Replication
- Victim cache
- Sub-epochs
191Any questions?