Btrees Hashing - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Btrees Hashing

Description:

Hash fn works on search key field of record r. Use its value MOD N to distribute ... increment local depth of original page. allocate new page with new local depth ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 30
Provided by: marily180
Category:
Tags: btrees | hashing

less

Transcript and Presenter's Notes

Title: Btrees Hashing


1
B-trees - Hashing
2
Review B-trees and B-trees
  • Multilevel, disk-aware, balanced index methods
  • primary or secondary
  • dense or sparse
  • supports selection and range queries
  • B-trees most common indexing structure in
    databases
  • all actual values stored on leaf-nodes.
  • Optimality
  • space O(N/B), updates O(log B (N/B)),
    queries O(log B (N/B)K/B)
  • (B is the fan out of a node)

3
BTree Example
  • Root

100
120 150 180
30
3 5 11
120 130
180 200
100 101 110
150 156 179
30 35
4
Btree rules tree of order d
  • (1) All leaves at same lowest level (balanced
    tree)
  • (2) Pointers in leaves point to records except
    for sequence pointer
  • (3) Number of pointers/keys for Btree

Max Max Min Min ptrs keys ptrs
keys
Non-leaf (non-root)
n
n-1
?n/2?
d ?n/2?- 1
Leaf (non-root)
n
n-1
?(n-1)/2?
?(n-1)/2?
Root
n
n-1
2
1
5
Insert into Btree
  • (a) simple case
  • space available in leaf
  • (b) leaf overflow
  • (c) non-leaf overflow
  • (d) new root

6
  • (a) Insert key 32

n4
100
30
3 5 11
30 31
7
  • (a) Insert key 7

n4
100
30
3 5 11
30 31
8
  • (c) Insert key 160

n4
100
120 150 180
180 200
150 156 179
9
  • (d) New root, insert 45

n4
10 20 30
1 2 3
10 12
20 25
30 32 40
10
Deletion from Btree
  • (a) Simple case - no example
  • (b) Coalesce with neighbor (sibling)
  • (c) Re-distribute keys
  • (d) Cases (b) or (c) at non-leaf

11
  • (b) Coalesce with sibling
  • Delete 50

n5
10 40 100
10 20 30
40 50
12
  • (c) Redistribute keys
  • Delete 50

n5
10 40 100
10 20 30 35
40 50
13
  • (d) Non-leaf coalesce
  • Delete 37

n5
25
10 20
30 40
25 26
30 37
40 45
20 22
10 14
1 3
14
Bulk Loading of a B Tree
  • If we have a large collection of records, and we
    want to create a B tree on some field, doing so
    by repeatedly inserting records is very slow.
  • Bulk Loading can be done much more efficiently.
  • Initialization Sort all data entries, insert
    pointer to first (leaf) page in a new (root) page.

Root
Sorted pages of data entries not yet in B tree
15
Bulk Loading (Contd.)
  • Index entries for leaf pages always entered into
    right-most index page just above leaf level.
  • Assume pages in the rightmost path to have double
    page size.
  • Split when double plus one.

Root
Data entry pages
10
12
20
6
not yet in B tree
3
6
9
10
11
12
13
23
31
36
38
41
44
4
20
22
35
Root
12
Data entry pages
not yet in B tree
6
20
10
23
3
6
9
10
11
12
13
23
31
36
38
41
44
4
20
22
35
16
Summary of Bulk Loading
  • Option 1 multiple inserts.
  • Slow.
  • Does not give sequential storage of leaves.
  • Option 2 Bulk Loading
  • Has advantages for concurrency control.
  • Fewer I/Os during build.
  • Leaves will be stored sequentially (and linked,
    of course).
  • Can control fill factor on pages.

17
A Note on Order
  • Order (d) concept replaced by physical space
    criterion in practice (at least half-full).
  • Index pages can typically hold many more entries
    than leaf pages.
  • Variable sized records and search keys mean
    different nodes will contain different numbers of
    entries.
  • Even with fixed length fields, multiple records
    with the same search key value (duplicates) can
    lead to variable-sized data entries (if we use
    Alternative (3)).
  • Many real systems are even sloppier than this ---
    only reclaim space when a page is completely
    empty.


18
Selection Queries
B-tree is perfect, but.... to answer a
selection query (ssn10) needs to traverse a full
path. In practice, 3-4 block accesses (depending
on the height of the tree,
buffering) Any
better approach?
  • Yes! Hashing
  • static hashing
  • dynamic hashing

19
Hashing
  • Hash-based indexes are best for equality
    selections. Cannot support range searches.
  • Static and dynamic hashing techniques exist
    trade-offs similar to ISAM vs. B trees.

20
Static Hashing
  • primary pages fixed, allocated sequentially,
    never de-allocated overflow pages if needed.
  • h(k) MOD N bucket to which data entry with key k
    belongs. (N of buckets)

0
h(key) mod N
1
key
h
N-1
Primary bucket pages
Overflow pages
21
Static Hashing (Contd.)
  • Buckets contain data entries.
  • Hash fn works on search key field of record r.
    Use its value MOD N to distribute values over
    range 0 ... N-1.
  • h(key) (a key b) usually works well.
  • a and b are constants lots known about how to
    tune h.
  • Long overflow chains can develop and degrade
    performance.
  • Extendible and Linear Hashing Dynamic techniques
    to fix this problem.

22
Extendible Hashing
  • Situation Bucket (primary page) becomes full.
    Why not re-organize file by doubling of
    buckets?
  • Reading and writing all pages is expensive!
  • Idea Use directory of pointers to buckets,
    double of buckets by doubling the directory,
    splitting just the bucket that overflowed!
  • Directory much smaller than file, so doubling it
    is much cheaper. Only one page of data entries
    is split. No overflow page!
  • Trick lies in how hash function is adjusted!

23
Example
  • Directory is array of size 4.
  • Bucket for record r has entry with index
    global depth least significant bits of h(r)
  • If h(r) 5 binary 101, it is in bucket
    pointed to by 01.
  • If h(r) 7 binary 111, it is in bucket
    pointed to by 11.

2
LOCAL DEPTH
Bucket A
16
4
12
32
GLOBAL DEPTH
2
1
Bucket B
00
13
1
7
5
01
10
2
Bucket C
10
11
we denote r by h(r).
DIRECTORY
24
Handling Inserts
  • Find bucket where record belongs.
  • If theres room, put it there.
  • Else, if bucket is full, split it
  • increment local depth of original page
  • allocate new page with new local depth
  • re-distribute records from original page.
  • add entry for the new page to the directory

25
Example Insert 21, then 19, 15
  • 21 10101
  • 19 10011
  • 15 01111

LOCAL DEPTH
Bucket A
GLOBAL DEPTH
2
2
1
Bucket B
00
13
1
7
5
21
01
2
10
Bucket C
10
11
DIRECTORY
15
19
7
DATA PAGES
26
Insert h(r)20 (Causes Doubling)
LOCAL DEPTH
Bucket A
GLOBAL DEPTH
2
2
Bucket B
1
5
21
13
00
01
2
10
Bucket C
10
11
2
Bucket D
15
7
19
of Bucket A)
27
Points to Note
  • 20 binary 10100. Last 2 bits (00) tell us r
    belongs in either A or A2. Last 3 bits needed to
    tell which.
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to.
  • Local depth of a bucket of bits used to
    determine if an entry belongs to this bucket.
  • When does bucket split cause directory doubling?
  • Before insert, local depth of bucket global
    depth. Insert causes local depth to become gt
    global depth directory is doubled by copying it
    over and fixing pointer to split image page.

28
Directory Doubling
  • Why use least significant bits in directory?
  • Allows for doubling via copying!

6 110
6 110
3
3
2
2
00
1
1
6
10
0
0
6
6
01
1
1
6
11
6
6
vs.
Most Significant
Least Significant
29
Comments on Extendible Hashing
  • If directory fits in memory, equality search
    answered with one disk access else two.
  • 100MB file, 100 bytes/rec, 4K pages contains
    1,000,000 records (as data entries) and 25,000
    directory elements chances are high that
    directory will fit in memory.
  • Directory grows in spurts, and, if the
    distribution of hash values is skewed, directory
    can grow large.
  • Multiple entries with same hash value cause
    problems!
  • Delete If removal of data entry makes bucket
    empty, can be merged with split image. If each
    directory element points to same bucket as its
    split image, can halve directory.
Write a Comment
User Comments (0)
About PowerShow.com