Btrees Hashing - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Btrees Hashing

Description:

Hash fn works on search key field of record r. Use its value MOD N to distribute ... increment local depth of original page. allocate new page with new local depth ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 30

Provided by: marily180

Category:

more less

Transcript and Presenter's Notes

Title: Btrees Hashing

1
B-trees - Hashing
2
Review B-trees and B-trees

Multilevel, disk-aware, balanced index methods
primary or secondary
dense or sparse
supports selection and range queries
B-trees most common indexing structure in
databases
all actual values stored on leaf-nodes.
Optimality
space O(N/B), updates O(log B (N/B)),
queries O(log B (N/B)K/B)
(B is the fan out of a node)

3
BTree Example

Root

100
120 150 180
30
3 5 11
120 130
180 200
100 101 110
150 156 179
30 35
4
Btree rules tree of order d

(1) All leaves at same lowest level (balanced
tree)
(2) Pointers in leaves point to records except
for sequence pointer
(3) Number of pointers/keys for Btree

Max Max Min Min ptrs keys ptrs
keys
Non-leaf (non-root)
n
n-1
?n/2?
d ?n/2?- 1
Leaf (non-root)
n
n-1
?(n-1)/2?
?(n-1)/2?
Root
n
n-1
2
1
5
Insert into Btree

(a) simple case
space available in leaf
(b) leaf overflow
(c) non-leaf overflow
(d) new root

(a) Insert key 32

n4
100
30
3 5 11
30 31
7

(a) Insert key 7

n4
100
30
3 5 11
30 31
8

n4
100
120 150 180
180 200
150 156 179
9

(d) New root, insert 45

n4
10 20 30
1 2 3
10 12
20 25
30 32 40
10
Deletion from Btree

(a) Simple case - no example
(b) Coalesce with neighbor (sibling)
(c) Re-distribute keys
(d) Cases (b) or (c) at non-leaf

(b) Coalesce with sibling
Delete 50

n5
10 40 100
10 20 30
40 50
12

(c) Redistribute keys
Delete 50

n5
10 40 100
10 20 30 35
40 50
13

(d) Non-leaf coalesce
Delete 37

n5
25
10 20
30 40
25 26
30 37
40 45
20 22
10 14
1 3
14
Bulk Loading of a B Tree

If we have a large collection of records, and we
want to create a B tree on some field, doing so
by repeatedly inserting records is very slow.
Bulk Loading can be done much more efficiently.
Initialization Sort all data entries, insert
pointer to first (leaf) page in a new (root) page.

Root
Sorted pages of data entries not yet in B tree
15
Bulk Loading (Contd.)

Index entries for leaf pages always entered into
right-most index page just above leaf level.
Assume pages in the rightmost path to have double
page size.
Split when double plus one.

Root
Data entry pages
10
12
20
6
not yet in B tree
3
6
9
10
11
12
13
23
31
36
38
41
44
4
20
22
35
Root
12
Data entry pages
not yet in B tree
6
20
10
23
3
6
9
10
11
12
13
23
31
36
38
41
44
4
20
22
35
16
Summary of Bulk Loading

Option 1 multiple inserts.
Slow.
Does not give sequential storage of leaves.
Option 2 Bulk Loading
Has advantages for concurrency control.
Fewer I/Os during build.
Leaves will be stored sequentially (and linked,
of course).
Can control fill factor on pages.

17
A Note on Order

Order (d) concept replaced by physical space
criterion in practice (at least half-full).
Index pages can typically hold many more entries
than leaf pages.
Variable sized records and search keys mean
different nodes will contain different numbers of
entries.
Even with fixed length fields, multiple records
with the same search key value (duplicates) can
lead to variable-sized data entries (if we use
Alternative (3)).
Many real systems are even sloppier than this ---
only reclaim space when a page is completely
empty.

18
Selection Queries
B-tree is perfect, but.... to answer a
selection query (ssn10) needs to traverse a full
path. In practice, 3-4 block accesses (depending
on the height of the tree,
buffering) Any
better approach?

Yes! Hashing
static hashing
dynamic hashing

19
Hashing

Hash-based indexes are best for equality
selections. Cannot support range searches.
Static and dynamic hashing techniques exist
trade-offs similar to ISAM vs. B trees.

20
Static Hashing

primary pages fixed, allocated sequentially,
never de-allocated overflow pages if needed.
h(k) MOD N bucket to which data entry with key k
belongs. (N of buckets)

0
h(key) mod N
1
key
h
N-1
Primary bucket pages
Overflow pages
21
Static Hashing (Contd.)

Buckets contain data entries.
Hash fn works on search key field of record r.
Use its value MOD N to distribute values over
range 0 ... N-1.
h(key) (a key b) usually works well.
a and b are constants lots known about how to
tune h.
Long overflow chains can develop and degrade
performance.
Extendible and Linear Hashing Dynamic techniques
to fix this problem.

22
Extendible Hashing

Situation Bucket (primary page) becomes full.
Why not re-organize file by doubling of
buckets?
Reading and writing all pages is expensive!
Idea Use directory of pointers to buckets,
double of buckets by doubling the directory,
splitting just the bucket that overflowed!
Directory much smaller than file, so doubling it
is much cheaper. Only one page of data entries
is split. No overflow page!
Trick lies in how hash function is adjusted!

23
Example

Directory is array of size 4.
Bucket for record r has entry with index
global depth least significant bits of h(r)
If h(r) 5 binary 101, it is in bucket
pointed to by 01.
If h(r) 7 binary 111, it is in bucket
pointed to by 11.

2
LOCAL DEPTH
Bucket A
16
4
12
32
GLOBAL DEPTH
2
1
Bucket B
00
13
1
7
5
01
10
2
Bucket C
10
11
we denote r by h(r).
DIRECTORY
24
Handling Inserts

Find bucket where record belongs.
If theres room, put it there.
Else, if bucket is full, split it
increment local depth of original page
allocate new page with new local depth
re-distribute records from original page.
add entry for the new page to the directory

25
Example Insert 21, then 19, 15

21 10101
19 10011
15 01111

LOCAL DEPTH
Bucket A
GLOBAL DEPTH
2
2
1
Bucket B
00
13
1
7
5
21
01
2
10
Bucket C
10
11
DIRECTORY
15
19
7
DATA PAGES
26
Insert h(r)20 (Causes Doubling)
LOCAL DEPTH
Bucket A
GLOBAL DEPTH
2
2
Bucket B
1
5
21
13
00
01
2
10
Bucket C
10
11
2
Bucket D
15
7
19
of Bucket A)
27
Points to Note

20 binary 10100. Last 2 bits (00) tell us r
belongs in either A or A2. Last 3 bits needed to
tell which.
Global depth of directory Max of bits needed
to tell which bucket an entry belongs to.
Local depth of a bucket of bits used to
determine if an entry belongs to this bucket.
When does bucket split cause directory doubling?
Before insert, local depth of bucket global
depth. Insert causes local depth to become gt
global depth directory is doubled by copying it
over and fixing pointer to split image page.

28
Directory Doubling

Why use least significant bits in directory?
Allows for doubling via copying!

6 110
6 110
3
3
2
2
00
1
1
6
10
0
0
6
6
01
1
1
6
11
6
6
vs.
Most Significant
Least Significant
29
Comments on Extendible Hashing

If directory fits in memory, equality search
answered with one disk access else two.
100MB file, 100 bytes/rec, 4K pages contains
1,000,000 records (as data entries) and 25,000
directory elements chances are high that
directory will fit in memory.
Directory grows in spurts, and, if the
distribution of hash values is skewed, directory
can grow large.
Multiple entries with same hash value cause
problems!
Delete If removal of data entry makes bucket
empty, can be merged with split image. If each
directory element points to same bucket as its
split image, can halve directory.