INDEXING (CHAPTER 11)

About This Presentation

Title:

INDEXING (CHAPTER 11)

Description:

Title: PowerPoint Presentation Author: Ramin Shahriari Last modified by: shahram Created Date: 8/11/2002 5:04:14 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 60

Provided by: RaminSh1

Category:

more less

Transcript and Presenter's Notes

Title: INDEXING (CHAPTER 11)

1
INDEXING (CHAPTER 11)
2
TOPICS

Basic concepts
Hashing
B-tree

3
INTRODUCTION

Review

E-R data model
Conceptual
Logical
Relational data model SQL
relation a file Org. of records on a disk
page Organization of attributes within a
record Index Files
Physical
4
Software Architecture of a DBMS
Query Parser
Query Optimizer
Query Interpretor
Relational Algebra operators ?, ?, ?, ?, ?, ?,
?, ?, ?
Index structures
Abstraction of records
Buffer Pool Manager
File System
5
Implementation of ?
SS Name Age Salary dno
1 Joe 24 20000 2
2 Mary 20 25000 3
3 Bob 22 27000 4
4 Kathy 30 30000 5
5 Shideh 4 4000 1

Emp table
?Salary30,000(Employee)
Process the select operator using a file scan
(linear scan)
F1 Open the file corresponding to Employee
P read first page of F1
While P is not null
For each record in P, if the record satisfies the
selection predicate then produce as output
P read next page of F1 / P becomes null when
EoF is reached /

SS Name Age Salary dno
4 Kathy 30 30000 5
6
Implementation of ?
SS Name Age Salary dno
1 Joe 24 20000 2
2 Mary 20 25000 3
3 Bob 22 27000 4
4 Kathy 30 30000 5
5 Shideh 4 4000 1

Emp table
?Salary30,000(Employee)
Process the select operator using a file scan
(linear scan)
F1 Open the file corresponding to Employee
P read first page of F1
While P is not null
For each record in P, if the record satisfies the
selection predicate then produce as output
P read next page of F1

SS Name Age Salary dno
4 Kathy 30 30000 5
Fetch the page from disk if not in the buffer
pool
7
Implementation of ?
SS Name Age Salary dno
1 Joe 24 20000 2
2 Mary 20 25000 3
3 Bob 22 27000 4
4 Kathy 30 30000 5
5 Shideh 4 4000 1

Emp table
?Salary30,000(Employee)
Process the select operator using a file scan
(linear scan)
F1 Open the file corresponding to Employee
P read first page of F1
While P is not null
For each record in P, if the record satisfies the
selection predicate then produce as output
P read next page of F1

SS Name Age Salary dno
4 Kathy 30 30000 5
Header

8
TERMINOLOGY

An exact match selection predicate
?Salary30,000(Employee) , ?FirstNameShideh(Emp
loyee)
A range selection predicate ?Salarygt30,000(Employ
ee) , ?Salarylt30,000(Employee), ?Salarygt30,000
and Salary lt 32,000 (Employee)

9
INTRODUCTION (Cont)

Motivation Speed-up those queries that reference
only a small portion of the records in a file.
Analogy Catalog cards in the library (more than
one index).
Evaluation
1. Access time (find)
2. Insertion time (find add)
3. Deletion time (find delete)
4. Space overhead
Search-key The attribute (or set of attributes)
used to lookup records in a file
Primary index The index whose search key
specifies the sequential order of the records
within a file.
Secondary index The index whose search key does
not specify the sequential order of the records
within a file.

10
Indexing

Mapping of records to a finite space using a
function (computation)
E.g., http//www.realmagic.net/dp/1-1.htm

11
K F(K)
10 10-19
11 11-29
12 12-39
13 13-49
14 14-59
15 15-69
16 16-79
17 17-89
18 18-99
19 19-109
20 20-218
21 21-318
22 22-418

29 29-1118
30 30-327
31 31-427

39 39-1227
K F(K)
40 40-436
..
49 49-1336
50 50-545

59 59-1445
.. ..
69 69-1554

79 79-1664

89 89-1772
..
99 99-1881
12
KEY OBSERVATIONS

A deterministic function (analytical) to map a
large set of key values, potentially infinite in
size, to a set that is small and finite.
The deterministic function might be a randomizing
function making it appropriate for processing
exact-match selection predicate.
To process range selection predicates, the
deterministic function should be
order-preserving.

13
INTRODUCTION (Cont)

Example
Assume, size of disk page 2 data records 5
index records.
Indexing or not indexing?
SELECT age SELECT age
FROM personnel FROM personnel
WHERE name Alice WHERE name Don

14
INTRODUCTION (Cont)

Example
Assume, size of disk page 2 data records 5
index records.
Primary vs. Secondary
SELECT name SELECT age
FROM personnel FROM personnel
WHERE state Ohio WHERE name David

15
INTRODUCTION (Cont)

Example (page 2 data 5 index)
Exact match vs. Range
SELECT name SELECT name
FROM personnel FROM personnel
WHERE state California WHERE state gt
Alaska and
state lt
Florida
Speedup by employing binary search (is it
possible?)

16
Dense Index Files

Dense index Index record appears for every
search-key value in the file.

17
Example of Sparse Index Files
18
Multilevel Index
19
HASHING

Hash function
K the set of all search key values
V the set of all bucket address
h(K) K V
K is large (perhaps infinite) but set of
search-key values actually stored in the database
is much smaller than K.
Fast lookup To find Ki, search the bucket with
h(Ki) address.

20
HASHING (Cont)

Example
K salary (set of all 6 digit integers)
V 1000 buckets addressed from 0 to 999
h(k) k mod 1000.
SELECT name
FROM personnel
WHERE salary 120,100
To find a 120,100 salary, we should search bucket
number 100.
Hash is only appropriate for Exact match queries.
A bad hash function maps the value to a subset of
(or a few) buckets (e.g., h(k) k mod 10.

21
HASHING (Cont)

Clustered Hash Index
The index structure and its buckets are
represented as a file (say file.hash)
The relation is stored in file.hash (I.e., each
entry in file.hash corresponds to a record in
relation)
Assuming no duplicates the record can be
accessed in 1 IO.
Non-clustered Hash Index
The index structure and its buckets are
represented as a file (say file.hash)
The relation remains intact
Each entry in file.hash has the following format
(search-key value, RID)
Assuming no duplicates the record can be
accessed in 2 IO.

22
HEAP FILE ORGANIZATION

Assume a student table Student(name, age, gpa,
major)
t(Student) 16
P(Student) 4

Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chris, 22, 3.9, CS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
Shideh, 16, 4, CS
23
Non-Clustered Hash Index

A non-clustered hash index on the age attribute
with 4 buckets,
h(age) age B

(24, (1, 2))
(20, (4,3))
(32, (3,1))
(16, (4,4))
(21, (1, 1))
(20, (1,3))
(24, (3,3))
(17, (2,4))
(28, (4,2))
(29, (3,2))
0
1
2
(18, (1, 4))
3
(22, (2,2))
(19, (2, 1))
(22, (4,1))
(19, (3, 4))
(18, (2,3))
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chris, 22, 3.9, CS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
Shideh, 16, 4, CS
24
Clustered Hash Index

A clustered hash index on the age attribute with
4 buckets,
h(age) age B

Mary, 24, 3, ECE
Shideh, 16, 4, CS
Louis, 32, 4, LS
Leila, 20, 3.5, LS
Bob, 21, 3.7, CS
Tom, 20, 3.2, EE
James, 24, 3.1, ME
Vera, 17, 3.9, EE
Chad, 28, 2.3, LS
Martha, 29, 3.8, CS
0
1
2
Kathy, 18, 3.8, LS
3
Lam, 22, 2.8, ME
Kane, 19, 3.8, ME
Chris, 22, 3.9, CS
Pat, 19, 2.8, EE
Chang, 18, 2.5, CS
25
Non-Clustered Hash Index

A non-clustered hash index on the age attribute
with 4 buckets,
h(age) age B
Pointers are page-ids

500
(24, (1, 2))
(20, (4,3))
1001
(32, (3,1))
(16, (4,4))
(21, (1, 1))
(20, (1,3))
(24, (3,3))
(17, (2,4))
(28, (4,2))
(29, (3,2))
0
500
706
1
1001
2
(18, (1, 4))
706
101
3
101
(22, (2,2))
(19, (2, 1))
(22, (4,1))
(19, (3, 4))
(18, (2,3))
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chris, 22, 3.9, CS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
Shideh, 16, 4, CS
26
Clustered Hash Index (SEQUENTIAL LAYOUT)

A clustered hash index on the age attribute with
4 buckets,
h(age) age 4
When the number of buckets are known in advance,
the system may assume a sequentially laid file to
eliminate the need for the hash directory.

Shideh, 16, 4, CS
Leila, 20, 3.5, LS
James, 24, 3.1, ME
Mary, 24, 3, ECE
Bob, 21, 3.7, CS
Kathy, 18, 3.8, LS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Lam, 22, 2.8, ME
Pat, 19, 2.8, EE
Vera, 17, 3.9, EE
Tom, 20, 3.2, EE
Chris, 22, 3.9, CS
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Chang, 18, 2.5, CS
27
Clustered Hash Index (SEQUENTIAL LAYOUT)

A clustered hash index on the age attribute with
4 buckets,
h(age) age 4
When the number of buckets are known in advance,
the system may assume a sequentially laid file to
eliminate the need for the hash directory.

Offset (bucket-id 1) times page size is for
bucket-id
Shideh, 16, 4, CS
Leila, 20, 3.5, LS
James, 24, 3.1, ME
Offset 0 is for bucket 0
Offset Page Size is for bucket 1
Mary, 24, 3, ECE
Bob, 21, 3.7, CS
Kathy, 18, 3.8, LS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Lam, 22, 2.8, ME
Pat, 19, 2.8, EE
Vera, 17, 3.9, EE
Tom, 20, 3.2, EE
Chris, 22, 3.9, CS
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Chang, 18, 2.5, CS
28
Block address on disk
Bucket Number
0

1
2
M-2
M-1
29
Example of Non-Clustered Hash Index
30
Overflow buckets
Main buckets
340
981
0

Record pointer

460
Record pointer
1
181
Record pointer
2
Record pointer
321

551

Record pointer
761
91
Record pointer
Record pointer
Record pointer
22

72
9
522
Record pointer
31
HASHING (Cont)

Loading factor
B of buckets, S of records per bucket, R
of records in the relation
loading - factor R / (BS)
The loading factor should not exceed 80, if that
happens, double B and re-hash.
Why a bucket might overflow?
Heavy loading of the file
Poor hash functions
Statistical peculiarities
If a bucket overflows?
Chaining chain an empty bucket to the bucket
that overflows.
Open addressing If bucket h(k) is full, store
the record in h(k) 1, if that is also full, try
h(k) 2, and so on.
Two hash functions If bucket h(k) is full, store
the record in h(k).

32
HASHING (Cont)

Problem The file grows and shrinks over time.
Hence, how one should choose the hash function
1. Based on current file size
performance degradation as DB grows
2. Based on anticipated file size waste
space initially (and reduced buffer hits)
3. Periodical reorganization time
consuming
3.1. Choose new hash function
3.2. Recompute hash value on every record
3.3. Generate new bucket assignments
Solution
Dynamic hash functions dynamic modification of h
to accommodate growth and shrinkage of the DB.
(e.g., extendible hashing)

33
HASHING (Cont)

Extendible hashing
Choose a hash function (h) such that it results
in a b (b 32) bit binary number.
The directory has a header that contains its
depth, d.
Each directory entry points to a hash bucket.
Buckets are created on demand, as records are
inserted.
Each bucket contains a local depth used to find
data.

Directory depth
bucket
directory
00
01
10
siblings
11
34
HASHING (Cont)

Extendible hashing (continued)
Every time a bucket overflows, its local depth is
increased. If the local depth is greater
than the depth of the directory, the directorys
depth is increased, causing the directory to
double in size.
Each directory entry has one sibling or buddy.
Two entries are buddies if they have identical
bit patterns except for the dth bit.
Every time a bucket overflows, its local depth is
increased.
If the local depth is greater than the depth of
the directory, then the directorys depth is
increased, causing the directory to double in
size.
A bucket can overflow at any desired loading
factor. That is, a split might happen every time
a bucket is 80 full.

35
HASHING (Cont)

Retrieval with Extendible hashing
Retrieve (K0)
Calculate h h(K0)
Read depth d of the directory
Interpret the d initial bits of h as an integer
base 2, term this r.
Retrieve the bucket pointed to by the rth entry
Find the record in this bucket
5.1. If a hashing technique is used to organize
the records in a bucket, use the d bits defined
on that bucket
5.2. If necessary, follow the collision
resolution scheme within this bucket.

36
HASHING (Cont)

Insertion with Extendible hashing
Insert (K0)
Apply the first four steps of Retrieve (K0) to
find bucket b.
If the insertion of K0 into b result in no
overflow then Insert K0 into b and return
Otherwise, obtain a new bucket b
Set the local depth of b and b to equal (local
depth of b 1)
If the new depth is NOT greater than the depth of
the directory
5.1. Distinguish between b and b using their
new d and set the appropriate entry(ies) of the
directory to point to each
5.2. Rehash the entries in bucket b and
assign each individual entry to the appropriate
bucket b or b
5.3. Insert (K0)
If the new depth is greater than the depth of the
directory
6.1 Increase the depth of the directory,
doubling its size
6.2. Set each entry and its buddy to point to
the old bucket that it was pointing to
6.3. Rehash the entries in bucket b and assign
each individual entry to the appropriate bucket b
or b
6.4. Insert (K0)

37
HASHING (Cont)

Deletion with Extendible hashing
Delete (K0)
Apply the first four steps of Retrieve (K0) to
find bucket b.
If K0 is not b then return with value no found
Otherwise, delete the entry corresponding to K0
If the sum of the number of entries on this page
and its sibling page are below the size of a
bucket then
4.1. Copy the entries in the two buckets
into one bucket b
4.2. Depth of b (depth of b - 1)
4.3. Free bucket b
4.4. Locate the two hash directory
entries pointing to b and its buddy. Set these
two pointers to b
4.5. If every pointer in the directory
equals its sibling pointer then decrease the
depth of the directory by one and set each
entry in an obvious manner.

38
Use of Extendable Hash Structure Example
Initial Hash structure, bucket size 2
39
Example (Cont.)

Hash structure after insertion of one Brighton
and two Downtown records

0
1
40
Example (Cont.)
Hash structure after insertion of Mianus record
00
01
10
11
41
Example (Cont.)
000
001
010
011
100
101
110
111
Hash structure after insertion of three
Perryridge records
42
Example (Cont.)

Hash structure after insertion of Redwood and
Round Hill records

000
001
010
011
100
101
110
111
43
Extendible Hashing

A hash-index structure might be represented as a
file.
The directory might be represented in a disk
page.
Each budget is represented as a disk page
(including the overflow pages).
Assuming the disk page size is 128 Kilobytes, an
index structure with 100 pages would consist of
1 page for the index directory assuming each
entry in the directory is an RID (disk page
number, slot number), it consists of two integers
(4 bytes each) for a total of 8 bytes. Thus, the
directory is 800 bytes long and fits in one disk
page.
100 pages one for each bucket.
Directory occupies less than 1 of the total file
size.
Conclusion The size of the hash-index file is
dominated by the number of buckets.

44
HASHING (Cont)

Extendible hashing
The insertion algorithm of extendible hashing
might crash when

45
HASHING (Cont)

Hashing vs. Indexing
Hashing is appropriate for exact match queries
(cannot support range queries)
SELECT A1, A2,
FROM r
WHERE (Ai c)
Indexing is appropriate for both range and exact
match queries
SELECT A1, A2,
FROM r
WHERE (Ai lt c1) and (Ai gt c2)

46
Example

Suppose that we are using extendable hashing on a
file that contains records with the following
search key values
2, 3, 5, 7, 11, 17, 19, 23, 29, 31
Show the extendible hash structure for a 5 bit
machine assuming hash function is h(x) x mod 8
and buckets can hold three records

47
B-TREE

B-tree is a multi-level tree structured
directory
Clustered Leaf nodes contain the records,
themselves.

48
B-TREE (Cont)

Non-clustered Leaf nodes contain the pairs (P,
K), where P is a pointer to the record in the
file and K is a search-key.

49
B-TREE (Cont)

Leaf nodes
Maintain between to n-1 values per
leaf.
If i lt j then Ki lt Kj
Every search-key value in the file appears in
some leaf node.
Suppose Li and Lj are two leaves and i lt j, then
every search value in Li is less than every
search value in Lj.

P1
K1
P2
. . .
Pn-1
Kn-1
Pn
5
7
10
15
17
18
50
B-TREE (Cont)

Internal nodes
Maintain between to n pointers per
internal node
root is an exception It must have more than one
pointer.
Suppose a node with m pointers and 2lt i lt m
Pi points to subtree containing search-key values
lt Ki and gt Ki-1.
Pm points to subtree containing search-key values
gt Km-1.
P1 points to subtree containing search-key values
lt K1.

51
B-TREE (Cont)

Lookup
Find 7 4 Ios
Find 4-20 5 IOs (assuming primary index), 9 IOs
(assuming secondary index)
More than 10 selection it is more efficient to
do sequential scan (do not use the secondary
index).
Example 10,000 records, select 1000 of them,
1000 records per disk page (Sequential search
10 IOs, Secondary index potentially 1000 IOs)

52
B-TREE (Cont)

Analysis
B in B-tree stands for Balanced. i.e., the
length of every path from the root to a leaf node
is the same.
Hence, good performance for lookup, insertion,
and deletion
K number of search key values in a file, then
the path is lt log (K).
K 1,000,000, and 10 lt n lt 100 then at most 3
to 9 nodes be accessed.
Insertion and Deletion should not destroy the
balance of the tree.
References
J. Jannink, Inplementing Deletion in B-Trees,
SIGMOD RECORD, Volume 24, Number 1 (March 1995),
pages 33-38.
D. Comer, The Ubiquitous B-tree, ACM Computing
Surveys, Volume 11, Number 2 (June 1979), pages
121-137

53
B-TREE (Cont)
n 4 Internal nodes 2 to 4 pointers Leaf
nodes 2 to 3 values
8
25
Insert 41
10
20
4
7
30
40
41
Insert 47
30
40
41
47
8
25
41
10
20
4
7
30
40
41
47
54
B-TREE (Cont)
Insert 50
Insert 52
41
47
50
52
8
25
41
50
41
25
8
50
10
20
4
7
30
40
41
47
50
52
55
B-TREE (Cont)
Delete 20
30
underflow
8
41
50
4
7
30
40
41
47
50
52
10
30
41
50
10
4
7
41
47
50
30
40
52
56
HEAP FILE ORGANIZATION

Assume a student table Student(name, age, gpa,
major)
t(Student) 16
P(Student) 4

A non-clustered secondary B-tree on the gpa
attribute

3.6
(3.7, (1, 1))
(3.9, (4,1))
(2.3, (4, 2))
(3, (1,2))
(3.8, (3,2))
(3.9, (2,4))
(2.5, (2,3))
(3.1, (3,3))
(3.8, (2,1))
(4, (3,1))
(2.8, (2,2))
(3.2, (1,3)
(3.8, (1,4))
(2.8, (3,4))
(4, (4,4))
(3.5, (4,3))
Bob, 21, 3.7, CS
Kane, 19, 3.8, ME
Louis, 32, 4, LS
Chris, 22, 3.9, CS
Mary, 24, 3, ECE
Lam, 22, 2.8, ME
Martha, 29, 3.8, CS
Chad, 28, 2.3, LS
Tom, 20, 3.2, EE
Chang, 18, 2.5, CS
James, 24, 3.1, ME
Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
Pat, 19, 2.8, EE
Shideh, 16, 4, CS
58
Non-Clustered Primary B-Tree

A non-clustered primary B-tree on the gpa
attribute

3.6
(3.7, (3, 1))
(3.9, (4,1))
(2.3, (1, 1))
(3, (2,1))
(3.8, (3,2))
(3.9, (4,2))
(2.5, (1,2))
(3.1, (2,2))
(3.8, (3,3))
(4, (4,3))
(2.8, (1,3))
(3.2, (2,3)
(3.8, (3,4))
(2.8, (1,4))
(4, (4,4))
(3.5, (2,4))
Bob, 21, 3.7, CS
Mary, 24, 3, ECE
Chris, 22, 3.9, CS
Chad, 28, 2.3, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
James, 24, 3.1, ME
Chang, 18, 2.5, CS
Kane, 19, 3.8, ME
Lam, 22, 2.8, ME
Louis, 32, 4, LS
Tom, 20, 3.2, EE
Martha, 29, 3.8, CS
Pat, 19, 2.8, EE
Leila, 20, 3.5, LS
Shideh, 16, 4, CS
59
Clustered B-Tree

A clustered B-tree on the gpa attribute

2.9
3.6
3.8
Bob, 21, 3.7, CS
Mary, 24, 3, ECE
Chris, 22, 3.9, CS
Chad, 28, 2.3, LS
Kathy, 18, 3.8, LS
Vera, 17, 3.9, EE
James, 24, 3.1, ME
Chang, 18, 2.5, CS
Kane, 19, 3.8, ME
Lam, 22, 2.8, ME
Louis, 32, 4, LS
Tom, 20, 3.2, EE
Martha, 29, 3.8, CS
Pat, 19, 2.8, EE
Leila, 20, 3.5, LS
Shideh, 16, 4, CS

Write a Comment

User Comments (0)

About PowerShow.com

INDEXING (CHAPTER 11) - PowerPoint PPT Presentation

INDEXING (CHAPTER 11)

Title: PowerPoint Presentation Author: Ramin Shahriari Last modified by: shahram Created Date: 8/11/2002 5:04:14 AM Document presentation format: On-screen Show – PowerPoint PPT presentation