Title: The Single Node B-tree for Highly Concurrent Distributed Data Structures
1The Single Node B-tree for Highly Concurrent
Distributed Data Structures
by Barbara Hohlt
2Why a B-tree DDS?
- To do range queries (the queries need NOT be
degree-3 transaction protected) - Need only sequential scans for related indexed
items (retrieve mail messages 3-50, etc.) - Performance impact illustrated later
3Prototype DDS Distributed B-tree
clients interact with any
client
client
client
client
client
client
client
client
client
client
service front
-
end as all
persistent service state is
in DDS and is consistent
throughout entire cluster
service interacts with
DDS via library library is 2PC
coordinator, handles partitions,
replication, etc., and exports B
-
tree API
brick is durable single
-
node
B
-
tree plus RPC
skels
for
storage
storage
storage
storage
storage
storage
network access brick can be
brick
brick
brick
brick
brick
brick
on same node as service
storage
storage
storage
storage
storage
storage
example of a distributed B
-
tree
brick
brick
brick
brick
brick
brick
partition with 3 replicas in group
4Architecture
service interacts with DDS via library library
is 2PC coordinator, handles partitioning,
replication, etc., and exports B-Tree HT API
5(No Transcript)
6Component Layers
The application layer makes search and insert
requests to a btree instance. The btree
determines what data blocks it needs and fetches
them from the global buffer cache. If the cache
does not have the needed blocks, it fetches them
from the global I/O core, which is transparent to
the btree instance.
7(No Transcript)
8API Flavor
- SN_BtreeCloseRequest, SN_BtreeClosecomplete
- SN_BtreeCreateRequest, Sn_BtreeCreateComplete
- SN_BtreeOpenRequest, SN_Btree OpenComplete
- Sn_BtreeDestroyRequest, SN_BtreeDestroyComplete
- SN_BtreeReadRequest, SN_BtreeReadComplete
- SN_BtreeWriteRequest, SN_BtreeWriteComplete
- SN_BtreeRemoveRequest, SN_BtreeRemoveComplete
9API Flavor, Contd..
- Distributed_BtreeCreateRequest,
Distributed_BtreeCreateComplete - Distributed_BtreeDestroyRequest,
Distributed_BtreeDestroyComplete - Distributed_BtreeReadRequest, Distributed_BtreeRea
dComplete -
- Errors timeout (even after retries),
replica_dead, lockgrab_failed, doesnt exist, etc.
10Evaluation Metrics
- Speedup performance versus resources (data size
fixed) - Scaleup data size versus resources (fixed
performance) - Sizeup performance versus data size
- Throughput total number of reads/writes
completed per second - Latency for satisfying a single request
11Single Node B-tree Performance
12Single Node B-tree Performance
13FSM-based Data Scheduling
- Scheduling is for
- Performance (including fairness, avoiding
starvation) - Correctness/isolation
- This functionality has traditionally resided in
two different modules (kernel schedules threads,
app/database schedules locks). Also, each module
optimized individually - Our claim is there can be significant performance
wins by jointly optimizing both
14How to Achieve Isolation?
- Use threads and locks
- Do careful scheduling (e.g. B-trees)
- Unify all scheduling decisions
- Problem is such a globally optimal scheduling is
hard - In restricted settings, similar to hardware
scoreboarding techniques - A useful lesson for Database Concurrency
- You can choose order of operations to avoid
conflicts (have a prepare/prefetch phase) to
avoid locking across blocking I/O (Lesson Do not
lock if you block) - This can be implemented more naturally with
asynchronous FSMs than with straight-line
threaded code
15Benefits of Using FSMsevents for Concurrency
Control
- Control-flow based concurrency control, as
opposed to lock-based concurrency control - Can avoid wrong scheduling decisions
- Unnecessary locks can be eliminated
- Locks can be released faster
- More flexibility for concurrency-control based on
isolation requirements - Explicit concurrency-control also avoids
deadlocks, priority inversions, race conditions,
and convoy formations
T1
T2
16Benefits of using FSMsQueues for concurrency
control
- Control-flow based concurrency control using FSMs
and queues, as opposed to lock-based concurrency
control - Can avoid wrong scheduling decisions
- Unnecessary locks can be eliminated
- Locks can be released faster
- More flexibility for concurrency-control based on
isolation requirements - Explicit scheduling also avoids deadlocks,
priority inversions, race conditions, and convoy
formations
T1
T2
17The Convoy Problem Illustrated
- Most tasks execute code like lock(b) read(b)
lock(b-gtnext) unlock(b) - Problem is if task T1 blocks on I/O for b4, then
task T2 cannot unlock b3 to acquire a lock on b4,
and task T3 cannot unlock b2 to acquire a lock on
b3, and so on, forming a convoy even though most
blocks are in cache and each task may require
only a finite number of locks.
b1
b2
b3
b4
Locked and blocked on I/O by T1
Locked by T2 waiting for lock on b4
Locked by T4 waiting for lock on b2
Locked by T3 waiting for lock on b3
18Scheduling Based on Data Availability
- Two transaction T1 and T2 request blocks b1, b2,
and b1, b3 respectively and T1 acquires the lock
on b1 first - Problem is if T1 acquires a lock on b2 and
blocks, T2 cannot make progress, even though T2
can access both b1 and b3 - Lesson schedule depending on how data is
available not how requests enter the system
19Scheduling Based on Data Availability (Example of
Misordering)
- Transferring funds from checking to savings.
- Begin(transaction)
- 1 read (checking account)
- 2 read(savings_account)
- 3 read(teller) // in cache
- 4 read(bank) // in cache
- 5 update(savings_account)
- 6 update(checking_account)
- 7 update(teller)
- 8 update(bank)
- End (transaction)
If steps 3 and 4 were swapped with 1 and 2, we
would be blocking while holding locks on the bank
and teller balances. In a global scheduling model
ordering of reads does not matter because a
request does not start execution unless all the
required data in the most probable execution path
is available.
20Distributed Synchronization
- Conventional lock-based implementations
serialize the lock manager code. In the example
above, T1 serializes against T3, although T1 and
T3 should ideally execute concurrently.
Distributed synchronization on distinct queues is
possible in FSMs running on multiprocessors,
without requiring static data partition
21Single Node Btree Brick
22FSM for Non-blocking Fetch
23Splitting node a into nodes a and b
24A Single Node B-tree
25(No Transcript)