Title: Indexing Techniques
1Indexing Techniques
2The Problem
- What can we introduce to make search more
efficient? - Indices!
- What is an index?
3Definitions
- Index an auxiliary data structure to speed up
record retrieval - Search key the field/s of a table which is/are
indexed - Storage index files that contain index records
- Each entry storing
- Actual data record
- or, search key value k and record ID ltk,ridgt
- or, search key value k and list of records IDs
ltk,rid listgt - Types ordered and unordered (hash) indices
4Types of Ordered Indices (1/3)
- Assuming ordered data files
- Depending on which field is indexed
- Primary index search key is ordering key field
- Pointer for each page
- Secondary index search key is non ordering field
secondary
primary
5Types of Ordered Indices (2/3)
- Depending on the density of index records
- Dense index an index record for each distinct
search key value, ie every record - Sparse index index records for only some search
key values - search key value for first record in page
- pointer to page
dense
sparse
6Types of Ordered Indices (3/3)
- Ordering field is nonkey (may have duplicates)
- Clustered index
- Unclustered index
clustered
unclustered
7Indices Exercise
- 215 records
- 128 bytes/record
- 210 bytes/page
- ordered file equality search on ordering field,
unspanned organization - without an index
- with a primary index
- on field of size 12 bytes
- assume pointer 4 bytes long
8Multi-level Indices (1/2)
- If access using first-level index is still
expensive - Build a sparse index on the first-level index
- Multi-level Index
- Fan-out index blocking factor
first-level index
second-level index
9Multi-level Indices (2/2)
- 26 index records/page (fan-out)
- 215 index records
- 1st-level
- 29 pages
- 2nd-level
- 29 index records
- 23 pages
- 3rd-level
- 23 index records
- 1 page
- 1 lt 215 / (26)t
- t ceil(log26 215 ) 3
- t ceil(logfoindex-records)
10Dynamic multi-level indices
- So far assumed indices are physically ordered
files - expensive insertions and deletions
- Dynamic multi-level indices
- B trees
- B trees
11Tree-structured Indices
- For each node K1 lt K2 lt Kq-1
- For each value X in subtree pointed to by Pi
- Ki-1lt X lt Ki, 1ltiltq
- X lt Ki, i1
- Ki-1lt X, iq
P1
K1
Ki-1
Pi
Ki
Kq-1
Pq
X
X
X
12B tree
- Problems empty nodes, unbalanced trees
- solution B trees
13B tree Definition
- Each node ltP1,ltK1, Pr1gt, P2,,ltKq-1, Prq-1gt, Pqgt
- Pi tree pointer, Ki search value, Pri data
pointer - For each node K1 lt K2 lt Kq-1
- For each value X in subtree pointed to by Pi
- Ki-1lt X lt Ki, 1ltiltq
- X lt Ki, i1
- Ki-1lt X, iq
- Each node at most q pointers
- B tree is order q
- Each node at least ceil(q/2) tree pointers
- except from root
- Internal node with p pointers has p-1 values
- All leaves at the same level
- balanced tree
14B tree Example
5
8
ø
1
ø
3
ø
ø
6
ø
7
ø
ø
9
ø
12
ø
tree pointer
data pointer
ø
null pointer
15B tree
- Most implementations of B tree are B tree
- Data pointers only in leaves
- more entries in internal nodes than regular B
trees - less internal nodes
- less levels
- faster access
16B tree Definition
- Internal nodes ltP1,K1, P2,, Pq-1, Kq-1, Pqgt
- Leaf nodes ltltK1, Pr1gt, ltK2, Pr2gt,,ltKp-1,
Prp-1gt, Pnextgt - Pri points a data records or block of pointers of
such records - leaf order
120
150
180
120
130
180
200
17B tree Search
- At each level, find smallest Ki larger than
search key - Follow associated pointer Pi
100
30
30
35
120
130
180
200
18B tree Insert
- Nodes may overflow or underflow
- Ignoring overflow or underflow
- Inserting data record with with search key value
k - find leaf node
- if k found
- add record to file, create indirect block if
there isnt one - add record pointer to indirect block
- if k not found
- add data record to file
- insert record pointer in leaf node (all search
keys in order)
19B tree Delete
- Ignoring overflow or underflow
- Find leaf node with search key value k
- Find data record pointer, delete record
- delete index record
- and indirect block, if any, if empty
20B tree Simple Insert
42
21B tree Leaf Overflow (1/2)
22B tree Leaf Overflow (2/2)
- first ceil(n/2) in existing node, rest in new
leaf node - n314
100
k lt 100
9
30
3
5
30
35
42
9
11
120
130
180
200
23B tree Internal Node Overflow (1/3)
100
k lt 100
9
30
3
5
30
35
42
9
11
120
130
180
200
210
24B tree Internal Node Overflow (2/3)
100
k lt 100
9
30
3
5
30
35
42
9
11
120
130
180
200
205
210
25B tree Internal Node Overflow (3/3)
100
150
k lt 100
9
30
3
5
30
35
42
9
11
120
130
180
200
205
210
26B tree New Root (1/2)
120
130
180
200
205
210
27B tree New Root (2/2)
120
130
180
200
205
210
28Index Insert Exercise
9
30
3
5
30
35
42
9
11
29B tree Delete
- Simple delete case
- Underflow case
- redistribute records
- coalesce with siblings
- update parents
30B tree Simple Delete (1/2)
120
130
180
200
205
210
215
31B tree Simple Delete (2/2)
100
101
120
130
180
200
205
210
215
32B tree Delete Redistribution (1/2)
100
101
120
130
180
200
205
210
215
33B tree Delete Redistribution (2/2)
- Redistribute entries
- left or right sibling
179
205
100
101
120
130
150
156
179
200
205
210
34B tree Delete Coalesce (1/4)
179
205
100
101
120
130
150
156
179
200
205
210
215
35B tree Delete Coalesce (2/4)
- Leaf updated
- No redistribution
- sibling coalesce
179
205
100
120
130
150
156
179
200
205
210
215
36B tree Delete Coalesce (3/4)
- Leaf updated
- No redistribution
- sibling coalesce
179
205
100
120
130
150
156
179
200
205
210
215
37B tree Delete Coalesce (4/4)
205
100
120
130
150
156
179
200
205
210
215
38Hashing Techniques
39Static Hashing (1/2)
- Store records in buckets with overflow chains
- Allocate a fixed number of buckets M
- Problems
- small M
- long overflow chains, slow search-delete-insert
40Static Hashing (2/2)
- Problems
- large M
- wasted space, slow scan
null
null
h
null
41Dynamic Hashing
- Splitting and coalescing buckets as the database
grows-shrinks - One scheme Extendible Hashing
- Hash function generates large values, eg 32 bits
- use i bits, change i as database size changes
- If overflow, double the number of buckets
- use i1 bits of the hash function
- but, expensive read all pages M and distribute
records in 2M pages - solution use a directory and double the size of
the directory - only split bucket that overflowed
42Extendible Hashing (1/4)
h(18) 10010
18
43Extendible Hashing (2/4)
2
16
20
A
2
00
2
01
1
B
10
h(4) 00100
2
11
18
C
2
3
7
D
44Extendible Hashing (3/4)
3
16
A
2
00
2
01
1
B
10
2
11
18
C
2
3
7
D
3
20
4
A1
45Extendible Hashing (4/4)
3
- Global Depth
- Local Depth
- If bucket full
- split bucket
- increment LD
- If GDLD
- increment GD
- double directory
16
A
3
000
2
001
1
B
010
2
011
18
C
100
101
2
110
3
7
D
3
111
20
4
A1
46Extendible Hashing Delete
- If deletion make bucket empty
- merge with split image
- If directory pointers point to same bucket as
split image - directory halved
47Extendible Hashing Summary
- Avoids overflow pages
- Directory can get large
- Key search requires just 2 page reads
- Space utilization fluctuates
- 59-90 for uniformly distributed records
48Extendible Hashing Exercise
- Initially GD LD 1
- M 2 buckets
- Hash function h(k) k mod 2i
- inserts 14, 18, 22, 3, 9
- deletes 9, 22, 3
1
12
8
1
00
1
01
5