Temple University CIS Dept' CIS616 Principles of Data Management - PowerPoint PPT Presentation

1 / 165
About This Presentation
Title:

Temple University CIS Dept' CIS616 Principles of Data Management

Description:

Search Key - attribute or set of attributes used to look up records ... overflow chains may become very long - what to do? ISAM - observations ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 166
Provided by: Vas111
Category:

less

Transcript and Presenter's Notes

Title: Temple University CIS Dept' CIS616 Principles of Data Management


1
Temple University CIS Dept.CIS616 Principles
of Data Management
  • V. Megalooikonomou
  • Indexing and Hashing
  • (based on notes by Silberchatz, Korth, and
    Sudarshan and notes by C. Faloutsos at CMU)

2
General Overview - rel. model
  • Relational model - SQL
  • Formal commercial query languages
  • Functional Dependencies
  • Normalization
  • Physical Design
  • Indexing

3
Indexing- overview
  • primary / secondary indices
  • index-sequential (ISAM)
  • B - trees, B - trees
  • hashing
  • static hashing
  • dynamic hashing

4
Basic Concepts
  • Indexing mechanisms speed up access to desired
    data
  • E.g., author catalog in library
  • Search Key - attribute or set of attributes used
    to look up records
  • An index file consists of records (called index
    entries) of the form
  • Index files typically much smaller than the
    original file
  • Two basic kinds of indices
  • Ordered indices search keys are stored in
    sorted order
  • Hash indices search keys are distributed
    uniformly across buckets using a hash function

search-key
pointer
5
Indexing
  • once the records are stored in a file, how do you
    search efficiently? (e.g., ssn123?)

6
Indexing
7
Indexing
  • once the records are stored in a file, how do you
    search efficiently?
  • brute force retrieve all records, report the
    qualifying ones
  • better use indices (pointers) to locate the
    records directly

8
Indexing main idea
9
Measuring goodness
  • retrieval time?
  • insertion / deletion?
  • space overhead?
  • reorganization?
  • range queries?

10
Main concepts
  • search keys are sorted in the index file and
    point to the actual records
  • primary vs. secondary indices
  • Clustering (sparse) vs
  • non-clustering (dense) indices

11
Indexing
Primary key index on primary key (no duplicates)
12
Indexing
secondary key index duplicates may exist
Address-index
13
Indexing
secondary key index typically, with postings
lists
Postings lists
14
Main concepts contd
  • Clustering ( sparse) index records are
    physically sorted on that key (and not all key
    values are needed in the index)
  • Non-clustering (dense) index the opposite
  • E.g.

15
Indexing- Sparse index
Clustering/sparse index on ssn
gt123
gt456
16
Sparse Index Files
  • Sparse Index contains index records for only
    some search-key values
  • Applicable when records are sequentially ordered
    on search-key
  • To locate a record with search-key value K we
  • Find index record with largest search-key value lt
    K
  • Search file sequentially starting at the record
    to which the index record points
  • Less space and less maintenance overhead for
    insertions and deletions
  • Generally slower than dense index for locating
    records
  • Good tradeoff sparse index with an index entry
    for every block in file, corresponding to least
    search-key value in the block

17
Indexing Dense Index
Non-clustering / dense index
18
Summary
  • All combinations are possible
  • at most one sparse/clustering index
  • as many as desired dense indices
  • usually one primary-key index (maybe
    clustering) and a few secondary-key indices
    (non-clustering)

19
Indexing- overview
  • primary / secondary indices
  • index-sequential (ISAM)
  • B - trees, B - trees
  • hashing
  • static hashing
  • dynamic hashing

20
ISAM
  • What if index is too large to search
    sequentially?
  • ? use a multilevel index

21
ISAM
22
ISAM - observations
  • if index is too large, store it on disk and keep
    index-on-the-index
  • usually two levels of indices,
  • one first-level entry per disk block (why? )

23
ISAM - Multilevel Index
24
ISAM - observations
  • What about insertions/deletions?

25
ISAM - observations
  • What about insertions/deletions?

overflows
124 peterson fifth ave.
Problems?
26
ISAM - observations
  • What about insertions/deletions?

overflows
124 peterson fifth ave.
  • overflow chains may become very long - what to
    do?

27
ISAM - observations
  • What about insertions/deletions?

overflows
124 peterson fifth ave.
  • overflow chains may become very long - thus
  • shut-down reorganize
  • start with 80 utilization

28
ISAM - observations
  • if index is too large, store it on disk and keep
    index on the index (in memory)
  • usually two levels of indices, one first- level
    entry per disk block
  • typically, blocks 80 full initially (why? what
    are potential problems / inefficiencies?)

29
So far
  • indices (like ISAM) suffer in the presence of
    frequent updates
  • sequential scan using primary index is efficient,
    but a sequential scan using a secondary index is
    expensive
  • each record access may fetch a new block from
    disk
  • alternative indexing structure B - trees

30
Overview
  • primary / secondary indices
  • multilevel (ISAM)
  • B - trees, B - trees
  • hashing
  • static hashing
  • dynamic hashing

31
B-trees
  • the most successful family of index schemes
    (B-trees, B-trees, B-trees)
  • can be used for primary/secondary,
    clustering/non-clustering index
  • they are balanced n-way search trees

32
B-trees
  • Disadvantage of indexed-sequential files
    performance degrades as file grows, since many
    overflow blocks get created. Periodic
    reorganization of entire file is required
  • Advantage of B-tree index files
  • automatic self-reorganization with small, local,
    changes, in the face of insertions and deletions.
    Reorganization of entire file is not required
  • Disadvantage of B-trees
  • extra insertion and deletion overhead, space
    overhead
  • Advantages of B-trees outweigh disadvantages,
    and they are used extensively

33
B-trees
  • E.g., B-tree of order 3 (i.e., at most 3 pointers
    from each node)

34
B-tree properties
  • each node, in a B-tree of order n
  • key order
  • at most n pointers
  • at least n/2 pointers (except root)
  • all leaves at the same level
  • if number of pointers is k, then node has exactly
    k-1 keys

35
Properties
  • block aware nodes each node -gt disk page
  • O(log (N)) for everything! (ins/del/search)
  • typically, if N 50 - 100, then 2 - 3 levels
  • utilization gt 50, guaranteed on average 69

36
Queries
  • Algorithm for exact match query?
  • (e.g., ssn8?)

37
Queries
  • Algorithm for exact match query?
  • (e.g., ssn8?)

6
9
lt6
gt9
lt9
gt6
3
1
7
13
38
Queries
  • Algorithm for exact match query?
  • (e.g., ssn8?)

6
9
lt6
gt9
lt9
gt6
3
1
7
13
39
Queries
  • Algorithm for exact match query?
  • (e.g., ssn8?)

6
9
lt6
gt9
lt9
gt6
3
1
7
13
40
Queries
  • Algorithm for exact match query?
  • (e.g., ssn8?)

6
9
lt6
H steps ( disk accesses)
gt9
lt9
gt6
3
1
7
13
41
Queries
  • Algorithm for exact match query?
  • (e.g., ssn8?)

42
Queries
  • what about range queries? (e.g., 5ltsalarylt8)
  • Proximity/ nearest neighbor searches? (e.g.,
    salary 8 )

43
Queries
  • what about range queries? (e.g., 5ltsalarylt8)
  • Proximity/ nearest neighbor searches? (e.g.,
    salary 8 )

44
Queries
  • what about range queries? (e.g., 5ltsalarylt8)
  • Proximity/ nearest neighbor searches? (eg.,
    salary 8 )

45
B-trees Insertion
  • Insert in leaf
  • on overflow, push middle up (recursively)
  • split preserves B - tree properties

46
B-trees
  • Easy case Tree T0 insert 8

47
B-trees
  • Tree T0 insert 8

8
48
B-trees
  • Hardest case Tree T0 insert 2

2
49
B-trees
  • Hardest case Tree T0 insert 2

6
9
2
1
7
3
push middle up
50
B-trees
  • Hardest case Tree T0 insert 2

Ovf push middle
2
6
9
7
51
B-trees
  • Hardest case Tree T0 insert 2

6
Final state
9
2
7
52
B-trees - insertion
  • Q What if there are two middles? (e.g., order 4)
  • A either one is fine

53
B-trees Insertion
  • Insert in leaf on overflow, push middle up
    (recursively propagate split)
  • split preserves all B - tree properties (!!)
  • notice how it grows height increases when root
    overflows splits
  • Automatic, incremental re-organization (contrast
    with ISAM!)

54
Pseudo-code
INSERTION OF KEY K find the correct leaf
node L if ( L overflows ) split
L, by pushing the middle key upstairs to parent
node P if (P overflows)
repeat the split recursively else
add the key K in node L / maintaining
the key order in L /
55
Overview
  • primary / secondary indices
  • multilevel (ISAM)
  • B trees
  • Dfn, Search, insertion, deletion
  • B - trees
  • hashing

56
Deletion
  • Rough outline of algorithm
  • Delete key
  • on underflow, may need to merge
  • In practice, some implementors just allow
    underflows to happen

57
B-trees Deletion
  • Easiest case Tree T0 delete 3

58
B-trees Deletion
  • Easiest case Tree T0 delete 3

59
B-trees Deletion
  • Case1 delete a key at a leaf no underflow
  • Case2 delete non-leaf key no underflow
  • Case3 delete leaf-key underflow, and rich
    sibling
  • Case4 delete leaf-key underflow, and poor
    sibling

60
B-trees Deletion
  • Case1 delete a key at a leaf no underflow
    (delete 3 from T0)

61
B-trees Deletion
  • Case2 delete a key at a non-leaf no underflow
    (e.g., delete 6 from T0)

Delete promote, i.e
62
B-trees Deletion
  • Case2 delete a key at a non-leaf no underflow
    (e.g., delete 6 from T0)

Delete promote, i.e.
63
B-trees Deletion
  • Case2 delete a key at a non-leaf no underflow
    (eg., delete 6 from T0)

Delete promote, i.e.
3
64
B-trees Deletion
  • Case2 delete a key at a non-leaf no underflow
    (eg., delete 6 from T0)

FINAL TREE
9
3
lt3
gt9
lt9
gt3
1
7
13
65
B-trees Deletion
  • Case2 delete a key at a non-leaf no underflow
    (eg., delete 6 from T0)
  • Q How to promote?
  • A pick the largest key from the left sub-tree
    (or the smallest from the right sub-tree)
  • Observation
  • Every deletion eventually becomes a deletion of
    a leaf key

66
B-trees Deletion
  • Case2 delete a key at a non-leaf no underflow
    (eg., delete 6 from T0)

Delete promote, i.e.
3
67
B-trees Deletion
  • Case1 delete a key at a leaf no underflow
  • Case2 delete non-leaf key no underflow
  • Case3 delete leaf-key underflow, and rich
    sibling
  • Case4 delete leaf-key underflow, and poor
    sibling

68
B-trees Deletion
  • Case3 underflow rich sibling (eg., delete 7
    from T0)

Delete borrow, ie
69
B-trees Deletion
  • Case3 underflow rich sibling (eg., delete 7
    from T0)

Delete borrow, ie
Rich sibling
70
B-trees Deletion
  • Case3 underflow rich sibling
  • rich can give a key, without underflowing
  • borrowing a key always THROUGH the PARENT!

71
B-trees Deletion
  • Case3 underflow rich sibling (eg., delete 7
    from T0)

Delete borrow, ie
Rich sibling
NO!!
72
B-trees Deletion
  • Case3 underflow rich sibling (eg., delete 7
    from T0)

Delete borrow, ie
73
B-trees Deletion
  • Case3 underflow rich sibling (eg., delete 7
    from T0)

Delete borrow, ie
6
74
B-trees Deletion
  • Case3 underflow rich sibling (eg., delete 7
    from T0)

Delete borrow, through the parent
FINAL TREE
3
9
lt3
gt9
lt9
gt3
6
1
13
75
B-trees Deletion
  • Case1 delete a key at a leaf no underflow
  • Case2 delete non-leaf key no underflow
  • Case3 delete leaf-key underflow, and rich
    sibling
  • Case4 delete leaf-key underflow, and poor
    sibling

76
B-trees Deletion
  • Case4 underflow poor sibling (eg., delete 13
    from T0)

77
B-trees Deletion
  • Case4 underflow poor sibling (eg., delete 13
    from T0)

78
B-trees Deletion
  • Case4 underflow poor sibling (eg., delete 13
    from T0)

A merge w/ poor sibling
79
B-trees Deletion
  • Case4 underflow poor sibling (eg., delete 13
    from T0)
  • Merge, by pulling a key from the parent
  • exact reversal from insertion split and push
    up, vs. merge and pull down
  • Ie.

80
B-trees Deletion
  • Case4 underflow poor sibling (eg., delete 13
    from T0)

A merge w/ poor sibling
6
lt6
gt6
3
1
7
9
81
B-trees Deletion
  • Case4 underflow poor sibling (eg., delete 13
    from T0)

FINAL TREE
6
lt6
gt6
3
1
7
9
82
B-trees Deletion
  • Case4 underflow poor sibling
  • -gt pull key from parent, and merge
  • Q What if the parent underflows?
  • A repeat recursively

83
B-tree deletion - pseudocode
  • DELETION OF KEY K
  • locate key K, in node N
  • if( N is a non-leaf node)
  • delete K from N
  • find the immediately largest key K1
  • / which is guaranteed to be on a leaf
    node L /
  • copy K1 in the old position of K
  • invoke this DELETION routine on K1 from
    the leaf node L
  • else
  • / N is a leaf node /
  • ... (next slide..)

84
B-tree deletion - pseudocode
  • / N is a leaf node /
  • if( N underflows )
  • let N1 be the sibling of N
  • if( N1 is "rich") / ie., N1 can
    lend us a key /
  • borrow a key from N1 THROUGH the
    parent node
  • else / N1 is 1 key away from
    underflowing /
  • MERGE pull the key from the parent
    P,
  • and merge it with the keys of N
    and N1 into a new node
  • if( P underflows) repeat
    recursively

85
B-trees in practice
  • In practice
  • no empty leaves
  • pointers to records

theory
86
B-trees in practice
  • In practice
  • no empty leaves
  • pointers to records

6
9
practice
lt6
gt9
lt9
gt6
3
1
7
13
87
B-trees in practice
  • In practice

88
B-trees in practice
  • In practice, the formats are
  • leaf nodes (v1, rp1, v2, rp2, vn, rpn)
  • Non-leaf nodes (p1, v1, rp1, p2, v2, rp2, )

89
Overview
  • primary / secondary indices
  • multilevel (ISAM)
  • B trees
  • B - trees
  • hashing

90
B trees - Motivation
  • B-tree print keys in sorted order

91
B trees - Motivation
  • B-tree needs back-tracking how to avoid it?

92
Solution B - trees
  • Facilitate sequential ops
  • They string all leaf nodes together
  • AND
  • Replicate keys from non-leaf nodes, to make sure
    every key appears at the leaf level !!

93
B trees
6
9
lt6
gt9
lt9
gt6
3
7
13
1
6
9
94
B-Trees (Cont.)
A B-tree is a rooted tree satisfying the
following properties
  • All paths from root to leaf are of the same
    length
  • Each node that is not a root or a leaf has
    between n/2 and n children
  • A leaf node has between (n1)/2 and n1 values
  • Special cases
  • If the root is not a leaf, it has at least 2
    children
  • If the root is a leaf (that is, there are no
    other nodes in the tree), it can have between 0
    and (n1) values

95
B-Tree Node Structure
  • Typical node
  • Ki are the search-key values
  • Pi are pointers to children (for non-leaf nodes)
    or pointers to records or buckets of records (for
    leaf nodes).
  • The search-keys in a node are ordered
  • K1 lt K2 lt K3 lt . . . lt Kn1

96
Leaf Nodes in B-Trees - Properties
  • For i 1, 2, . . ., n1, pointer Pi either
    points to a file record with search-key value Ki,
    or to a bucket of pointers to file records, each
    record having search-key value Ki. Only need
    bucket structure if search-key does not form a
    primary key.
  • If Li, Lj are leaf nodes and i lt j, Lis
    search-key values are less than Ljs search-key
    values
  • Pn points to next leaf node in search-key order

97
Non-Leaf Nodes in B-Trees - Properties
  • Non leaf nodes form a multi-level sparse index on
    the leaf nodes. For a non-leaf node with m
    pointers
  • All the search-keys in the subtree to which P1
    points are less than K1
  • For 2 ? i ? n 1, all the search-keys in the
    subtree to which Pi points have values greater
    than or equal to Ki1 and less than Km1

98
B-Tree vs B-Tree
  • B-tree (above) and B-tree (below) on same data

99
B tree insertion
  • INSERTION OF KEY K
  • insert search-key value to L such that the
    keys are in order
  • if ( L overflows)
  • split L
  • insert (ie., COPY) smallest search-key
    value
  • of new node to parent node P
  • if (P overflows)
  • repeat the B-tree split procedure
    recursively
  • / Notice the B-TREE split NOT the B
    -tree /

100
B-tree insertion contd
  • / ATTENTION
  • a split at the LEAF level is handled by COPYING
    the middle key upstairs
  • A split at a higher level is handled by PUSHING
    the middle key upstairs
  • /

101
B trees - insertion
Eg., insert 8
6
9
lt6
gt9
lt9
gt6
3
7
13
1
6
9
102
B trees - insertion
Eg., insert 8
6
9
lt6
gt9
lt9
gt6
3
7
13
1
6
9
8
103
B trees - insertion
Eg., insert 8
6
9
lt6
gt9
lt9
gt6
8
3
7
13
1
6
9
COPY middle upstairs
104
B trees - insertion
Eg., insert 8
6
9
lt6
gt9
7
lt9
gt6
3
1
6
COPY middle upstairs
105
B trees - insertion
Eg., insert 8
Non-leaf overflow just PUSH the middle
6
9
lt6
gt9
7
lt9
gt6
3
1
6
COPY middle upstairs
106
B trees - insertion
7
lt7
gt7
Eg., insert 8
6
lt6
lt9
gt9
gt6
3
1
6
FINAL TREE
107
B-Trees vs B-Trees
  • Advantages of B-Tree indices
  • May use less tree nodes than a corresponding
    B-Tree.
  • Sometimes possible to find search-key value
    before reaching leaf node.
  • Disadvantages of B-Tree indices
  • Only small fraction of all search-key values are
    found early
  • Non-leaf nodes are larger, so fan-out is reduced.
    Thus B-Trees typically have greater depth than
    corresponding B-Tree
  • Insertion and deletion more complicated than in
    B-Trees
  • Implementation is harder than B-Trees.
  • Typically, advantages of B-Trees do not out weigh
    disadvantages

108
B-tree
  • In B-trees, worst case util. 50, if we have
    just split all the pages
  • how to increase the utilization of B - trees?
  • with B - trees!

109
B-trees and B-trees
  • E.g., Tree T0 insert 2

2
110
B-trees deferred split!
  • Instead of splitting, LEND keys to sibling!
  • (through PARENT, of course!)

2
111
B-trees deferred split!
  • Instead of splitting, LEND keys to sibling!
  • (through PARENT, of course!)

FINAL TREE
7
2
112
B-trees deferred split!
  • Notice shorter, more packed, faster tree
  • Its a rare case, where space utilization and
    speed improve together
  • BUT What if the sibling has no room for our
    lending?

113
B-trees deferred split!
  • BUT What if the sibling has no room for our
    lending?
  • A 2-to-3 split get the keys from the sibling,
    pool them with ours (and a key from the parent),
    and split in 3.
  • Details too messy (and even worse for deletion)

114
Conclusions
  • all B tree variants can be used for any type of
    index primary/secondary, sparse (clustering), or
    dense (non-clustering)
  • All have excellent, O(logN) worst-case
    performance for ins/del/search
  • Its the prevailing indexing method

115
Indexing- overview
  • ISAM and B-trees
  • Hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

116
(Static) Hashing
  • Problem find EMP record with ssn123
  • Q What if disk space was free, and time was at
    premium?

117
Hashing
  • A Brilliant idea key-to-address transformation

0 page
123 Smith Main str
123 page
999,999,999
118
Hashing
  • Since space is NOT free
  • use M, instead of 999,999,999 slots
  • hash function h(key) slot-id

119
Hashing
  • Typically each hash bucket is a page, holding
    many records

120
Hashing
  • Notice could have clustering, or non-clustering
    versions

121
Hashing
  • Notice could have clustering, or non-clustering
    versions

122
Indexing- overview
  • ISAM and B-trees
  • hashing
  • hashing functions
  • size of hash table
  • collision resolution
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics

123
Design decisions
  • 1) formula h() for hashing function
  • 2) size of hash table M
  • 3) collision resolution method

124
Design decisions - functions
  • Goal
  • uniform spread of keys over hash buckets
  • Popular choices
  • Division hashing
  • Multiplication hashing

125
Division hashing
  • h(x) (axb) mod M
  • eg., h(ssn) (ssn) mod 1,000
  • gives the last three digits of ssn
  • M size of hash table - choose a prime number,
    defensively (why?)

126
Division hashing
  • eg., M2 hash on driver-license number (dln),
    where the last digit is gender (0/1 M/F)
  • in an army unit with predominantly male soldiers
  • Thus avoid cases where M and keys have common
    divisors -- prime M guards against that!

127
Multiplication hashing
  • h(x) fractional-part-of ( x f ) M
  • f golden ratio ( 0.618... ( sqrt(5)-1)/2 )
  • In general, we need an irrational number
  • Advantage M need not be a prime number
  • But f must be irrational

128
Other hashing functions
  • quadratic hashing (bad)
  • ...
  • conclusion use division hashing

129
Design decisions
  • 1) formula h() for hashing function
  • 2) size of hash table M
  • 3) collision resolution method

130
Size of hash table
  • eg., 50,000 employees, 10 employee-records /
    page
  • Q M?? pages/buckets/slots

131
Size of hash table
  • eg., 50,000 employees, 10 employees/page
  • Q M?? pages/buckets/slots
  • A utilization 90 and
  • M prime number
  • Eg., in our case M closest prime to 50,000/10 /
    0.9 5,555

132
Design decisions
  • 1) formula h() for hashing function
  • 2) size of hash table M
  • 3) collision resolution method

133
Collision resolution
  • Q what is a collision?
  • A ??

134
Collision resolution
0 page
FULL
h(123)
123 Smith Main str.
M
135
Collision resolution
  • Q what is a collision?
  • A ??
  • Q why worry about collisions/overflows? (recall
    that buckets are 90 full)
  • A e.g. bank account balances between 0 and
    10,000 and between 90,000 and 100,000

136
Collision resolution
  • open addressing
  • linear probing (ie., put to next slot/bucket)
  • re-hashing
  • separate chaining (ie., put links to overflow
    pages)

137
Collision resolution
linear probing
0 page
FULL
h(123)
123 Smith Main str.
M
138
Collision resolution
re-hashing
0 page
h1()
FULL
h(123)
123 Smith Main str.
h2()
M
139
Collision resolution
separate chaining
FULL
123 Smith Main str.
140
Design decisions - conclusions
  • function division hashing
  • h(x) ( axb ) mod M
  • size M 90 util. prime number.
  • collision resolution separate chaining
  • easier to implement (deletions!)
  • no danger of becoming full

141
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

142
Hashing vs B-trees
  • Hashing offers
  • speed ! ( O(1) avg. search time)
  • ..but B-trees offer

143
Hashing vs B-trees
  • but B-trees offer
  • key ordering
  • range queries
  • proximity queries
  • sequential scan
  • O(log(N)) guarantees for search, ins./del.
  • graceful growing/shrinking

144
Hashing vs B-trees
  • thus
  • B-trees are implemented in most systems
  • footnotes
  • hashing is not (why not?)

145
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

146
Indexing in SQL
  • create index ltindex-namegt on ltrelation-namegt
    (ltattribute-listgt)
  • create unique index ltindex-namegt on
    ltrelation-namegt (ltattribute-listgt)
  • (in the case that the search key is a
    candidate key)
  • drop index ltindex-namegt

147
Indexing in SQL
  • e.g.,
  • create index ssn-index
  • on STUDENT (ssn)
  • or (e.g., on TAKES(ssn,cid, grade) )
  • create index sc-index
  • on TAKES (ssn, c-id)

148
Indexing- overview
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics (theoretical interest)
  • dynamic hashing
  • multi-attribute indexing

149
Problem with static hashing
  • problem overflow?
  • problem underflow? (under-utilization)

150
Solution Dynamic/extendible hashing
  • Idea shrink / expand hash table on demand..
  • ... ? dynamic hashing
  • Details how to grow gracefully, on overflow?
  • Many solutions - One of them extendible hashing

151
Extendible hashing
0 page
FULL
h(123)
123 Smith Main str.
M
152
Extendible hashing
0 page
solution split the bucket in two
FULL
h(123)
123 Smith Main str.
M
153
Extendible hashing
  • in detail
  • keep a directory, with ptrs to hash-buckets
  • Q how to divide contents of bucket in two?
  • A hash each key into a very long bit string
    keep only as many bits as needed
  • Eventually

154
Extendible hashing
directory
0001...
0111...
00...
01...
10101...
10...
10011...
10110...
11...
1101...
101001...
155
Extendible hashing
directory
0001...
0111...
00...
01...
10101...
10...
10011...
10110...
11...
1101...
101001...
156
Extendible hashing
directory
0001...
0111...
00...
01...
10101...
10...
split on 3-rd bit
10011...
10110...
11...
101001...
1101...
157
Extendible hashing
directory
0001...
0111...
00...
01...
new page / bucket
10...
10011...
10101...
11...
101001...
10110...
1101...
158
Extendible hashing
directory (doubled)
new page / bucket
159
Extendible hashing
0001...
0111...
10011...
1101...
BEFORE
AFTER
160
Extendible hashing
  • Summary
  • directory doubles on demand
  • or halves, on shrinking files
  • needs local and global depth (see book)
  • Mainly, of theoretical interest - same for
  • linear hashing of Litwin
  • order preserving
  • perfect hashing (no collisions!)

161
Indexing- overview
  • ISAM and B-trees
  • Hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing

162
multiple-key access
  • How to support queries on multiple attributes,
    like
  • gradegt3 and course415
  • Major motivation Geographic Information systems
    (GIS)

163
multiple-key access
y
x
164
multiple-key access
  • Typical query
  • Find cities within x miles from Philadelphia
  • thus, we want to store nearby cities on the same
    disk page

165
multiple-key access
y
x
166
multiple-key access
y
x
167
multiple-key access - R-trees
y
x
168
multiple-key access - R-trees
  • R-trees very successful for GIS
  • (along with z-ordering)
  • more details at advanced topics, later

169
Indexing- overview
industry workhorse
  • ISAM and B-trees
  • hashing
  • Hashing vs B-trees
  • Indices in SQL
  • Advanced topics
  • dynamic hashing
  • multi-attribute indexing
Write a Comment
User Comments (0)
About PowerShow.com