Indexes and B-Trees - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Indexes and B-Trees

Description:

'If I had eight hours to chop down a tree, I'd spend six sharpening my ax. ... It will be in class, at the usual time. Next Tuesday's class will be a review ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 51
Provided by: RaghuRama129
Category:

less

Transcript and Presenter's Notes

Title: Indexes and B-Trees


1
Indexes and B-Trees
  • Lecture 9
  • R G Chapters 8 9

If I had eight hours to chop down a tree, I'd
spend six sharpening my ax. Abraham
Lincoln
2
Administrivia
  • Homework 1 Due Tonight, 10pm
  • Homework 2 Available Today, Due 3 weeks from
    today
  • Midterm Exam 1 will be a week from Thursday
  • It will be in class, at the usual time
  • Next Tuesdays class will be a review
  • More SQL Exercises on the Class Website

3
Review
  • Last two weeks
  • Formal Query Languages Rel. Algebra Calculus
  • Actual Query Language SQL
  • This week Indexes
  • Tree Indexes (HW2)
  • Hash Indexes
  • Next week
  • Review
  • Midterm 1

4
Review Division A/B
B1
B2
B3
A/B1
A/B2
A/B3
A
5
Review Use NOT Exists for Division
Recall X/Y means only give me X tuples that
have a match in Y.
Find sailors whove reserved all boats. X set
of sailors and Y set of all boats with
reservations.
SELECT S.sname FROM Sailors S WHERE NOT EXISTS
(SELECT B.bid FROM Boats B WHERE NOT
EXISTS (SELECT R.bid FROM
Reserves R WHERE R.bidB.bid
AND R.sidS.sid))
Find Sailors S such that ...
there is no boat B...
without a reservation by Sailor S
6
Division
SELECT S.sname FROM Sailors S WHERE NOT EXISTS
(SELECT B.bid FROM Boats B WHERE NOT
EXISTS (SELECT R.bid FROM
Reserves R WHERE R.bidB.bid
AND R.sidS.sid))
101
103
1
3
2
Reserves
Sailors
sid bid day
1 103 9/12
2 103 9/13
3 103 9/14
3 101 9/12
1 103 9/13
Boats
R
sid sname rating age
1 Frodo 7 22
2 Bilbo 2 39
3 Sam 8 27
bid bname color
101 Nina red
103 Pinta blue
S
R
B
S
R
B
R
S
R
7
Null Values
  • Values are sometimes
  • unknown (e.g., a rating has not been assigned)
    or
  • inapplicable (e.g., no spouses name).
  • SQL provides a special value null for such
    situations.
  • The presence of null complicates many issues.
    E.g.
  • Special operators needed to check if value is/is
    not null.
  • ratinggt8 - true or false when rating is null?
    What about AND, OR and NOT connectives?
  • Need a 3-valued logic (true, false and unknown).
  • Meaning of constructs must be defined carefully.
    (e.g., WHERE clause eliminates rows that dont
    evaluate to true.)
  • New operators (in particular, outer joins)
    possible/needed.

8
Null Values 3 Valued Logic
(null gt 0) (null 1) (null 0) null AND true
is null
is null
is null
is null
AND T F Null
T
F
NULL
OR T F Null
T
F
NULL
T
T
F
Null
T
T
F
Null
F
F
F
T
Null
F
T
Null
Null
Null
9
Null Values in SQL
  • Where clause must evaluate to true
  • IS NULL operator, e.g. where name is null
  • IS NOT NULL operator
  • Outer Joins Left, Right, Full

10
SELECT s.sid, s.name, r.bidFROM Sailors s LEFT
OUTER JOIN Reserves rON s.sid r.sid

11
SELECT r.sid, b.bid, b.nameFROM Reserves r RIGHT
OUTER JOIN Boats bON r.bid b.bid

12
Review Buffer Management and Files
  • Storage of Data
  • Fields, either fixed or variable length...
  • Stored in Records...
  • Stored in Pages...
  • Stored in Files
  • If data wont fit in RAM, store on Disk
  • Need Buffer Pool to hold pages in RAM
  • Different strategies decide what to keep in pool

13
Today File Organization
  • How to keep pages of records on disk
  • but must support operations
  • scan all records
  • search for a record id RID
  • search for record(s) with certain values
  • insert new records
  • delete old records

14
Alternative File Organizations
  • Many alternatives exist, tradeoffs for each
  • Heap files
  • Suitable when typical access is file scan of all
    records.
  • Sorted Files
  • Best for retrieval in search key order
  • Also good for search based on search key
  • Indexes Organize records via trees or hashing.
  • Like sorted files, speed up searches for search
    key fields
  • Updates are much faster than in sorted files.

15
Indexes
  • Often want to get records byvalues in one or more
    fields, e.g.,
  • Find all students in the CS department
  • Find all students with a gpa gt 3
  • An index on a file is a
  • Disk-based data structure
  • Speeds up selections on the search key fields for
    the index.
  • Any subset of the fields of a relation can be
    index search key
  • Search key is not the same as key
  • (e.g. doesnt have to be unique ID).
  • An index
  • Contains a collection of key/data entry pairs
  • Supports efficient retrieval of all records with
    a given search key value k.

16
Index Classification
  • What selections does it support?
  • What does index actually store?
  • 3 alternatives
  • Data record with key value k
  • ltk, rid of data recordgt
  • ltk, list of rids of data recordsgt
  • Clustered vs. Unclustered Indexes
  • Single Key vs. Composite Indexes
  • Tree-based, hash-based, other
  • Can have multiple (different) indexes per file.
  • E.g. file sorted by age, with a hash index on
    salary and a Btree index on name.

17
First Question to Ask About an Index
  • What kinds of selections does it support?
  • Selections of form field ltopgt constant
  • Equality selections (op is )
  • Range selections (op is one of lt, gt, lt, gt,
    BETWEEN)
  • More exotic selections
  • 2-dimensional ranges (east of Berkeley and west
    of Truckee and North of Fresno and South of
    Eureka)
  • Or n-dimensional
  • 2-dimensional distances (within 2 miles of Soda
    Hall)
  • Or n-dimensional
  • Ranking queries (10 restaurants closest to
    VLSB)
  • Regular expression matches, genome string
    matches, etc.
  • One common n-dimensional index R-tree

18
What data is held by the index?
  • Alternative 1 Actual data record (with key
    value k)
  • Index structure is file organization for data
    records (like Heap files or sorted files).
  • At most one index on a table can use Alternative
    1.
  • Saves pointer lookups
  • Can be expensive to maintain with insertions and
    deletions.

19
What data is held by the index? (Contd.)
  • Alternative 2
  • ltk, ridgt
  • and Alternative 3
  • ltk, list of ridsgt
  • Easier to maintain than Alt 1.
  • At most one index can use Alternative 1 any
    others must use Alternatives 2 or 3.
  • Alternative 3 more compact than Alternative 2,
    but leads to variable sized data entries even if
    search keys are of fixed length.
  • Even worse, for large rid lists the data entry
    might have to span multiple pages!

20
Clustered and Unclustered
  • Clustered vs. unclustered
  • If order of data records is the same as, or
    close to, order of index data entries, then
    called clustered index.
  • A file can be clustered on at most one search
    key.
  • Cost to retrieve data records with index varies
    greatly based on whether index clustered or not!
  • Alternative 1 implies clustered, but not
    vice-versa.

21
Clustered vs. Unclustered Index
  • Suppose that Alternative (2) is used for data
    entries, and that the data records are stored in
    a Heap file.
  • To build clustered index, first sort the Heap
    file (with some free space on each block for
    future inserts).
  • Overflow blocks may be needed for inserts.
    (Thus, order of data recs is close to, but not
    identical to, the sort order.)

Index entries
UNCLUSTERED
CLUSTERED
direct search for
data entries
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
22
Unclustered vs. Clustered Indexes
  • What are the tradeoffs????
  • Clustered Pros
  • Efficient for range searches
  • May be able to do some types of compression
  • Possible locality benefits (related data?)
  • Clustered Cons
  • Expensive to maintain (on the fly or sloppy with
    reorganization)

23
Hash-Based Indexes
  • Good for equality selections.
  • Index is a collection of buckets. Bucket
    primary page plus zero or more overflow pages.
  • Hashing function h
  • h(r) bucket in which record r belongs.
  • h looks at the search key fields of r.
  • If Alternative (1) is used, the buckets contain
    the data records otherwise, they contain ltkey,
    ridgt or ltkey, rid-listgt pairs.

24
B Tree Indexes
Non-leaf
Pages
Leaf
Pages
  • Leaf pages contain data entries, and are chained
    (prev next)
  • Non-leaf pages contain index entries and direct
    searches

index entry
P
K
P
K
P
P
K
m
0
1
2
1
m
2
25
Comparing File Organizations
  • Heap files (random order insert at eof)
  • Sorted files, sorted on ltage, salgt
  • Clustered B tree file, Alternative (1), search
    key ltage, salgt
  • Heap file with unclustered B tree index on
    search key ltage, salgt
  • Heap file with unclustered hash index on search
    key ltage, salgt

26
Operations to Compare
  • Scan Fetch all records from disk
  • Fetch all records in sorted order
  • Equality search
  • Range selection
  • Insert a record
  • Delete a record

27
Cost Model for Analysis
  • I/O cost 150,000 times more than hash function
  • We ignore CPU costs, for simplicity
  • B The number of data pages
  • R Number of records per page
  • F Fanout of B-tree
  • Average-case analysis based on several
    simplistic assumptions.
  • Good enough to show the overall trends!

28
Assumptions in Our Analysis
  • Heap Files
  • Equality selection on key exactly one match.
  • Sorted Files
  • Files compacted after deletions.
  • Indexes
  • Alt (2), (3) data entry size 10 size of
    record
  • Hash No overflow buckets.
  • 80 page occupancy gt File size 1.25 data size
  • Tree 67 occupancy (this is typical).
  • Implies file size 1.5 data size

29
I/O Cost of Operations
Heap File
Scan all records B
Get all in sort order 4B
Equality Search 0.5 B
Range Search B
Insert 2
Delete 0.5B 1
B Number of data pages (packed) R Number of
records per page S Time required for equality
search
30
I/O Cost of Operations
Sorted File
Scan all records B
Get all in sort order B
Equality Search log2 B
Range Search S matching pages
Insert S B
Delete S B
B Number of data pages (packed) R Number of
records per page S Time required for equality
search
31
I/O Cost of Operations
Clustered Tree
Scan all records 1.5 B
Get all in sort order 1.5 B
Equality Search logF (1.5 B)
Range Search S matching pages
Insert S 1
Delete 0.5B 1
B Number of data pages (packed) R Number of
records per page F Fanout of B-Tree S Time
required for equality search
32
I/O Cost of Operations
Unclustered Tree
Scan all records B (ignore index)
Get all in sort order 4B (ignore index)
Equality Search logF (.15 B) 1
Range Search S matching records
Insert S 2
Delete S 2
B Number of data pages (packed) R Number of
records per page F Fanout of B-Tree S Time
required for equality search
33
I/O Cost of Operations
Hash Index
Scan all records B (ignore index)
Get all in sort order 4B (ignore index)
Equality Search 2
Range Search B (ignore index)
Insert 4
Delete S 2
B Number of data pages (packed) R Number of
records per page S Time required for equality
search
34
I/O Cost of Operations
B The number of data pages R Number of
records per page F Fanout of B-Tree S Time
required for equality search Dont Use Index
Heap File Sorted File Clustered Tree Unclustered Tree Hash Index
Scan all records B B 1.5 B B B
Get all in sort order 4B B 1.5 B 4B 4B
Equality Search 0.5 B log2 B logF (1.5 B) logF (.15 B) 1 2
Range Search B S matching pages S matching pages S matching records B
Insert 2 S B S 1 S 2 4
Delete 0.5B 1 S B 0.5B 1 S 2 S 2
35
Index Selection Guidelines
  • Attributes in WHERE clause are candidates for
    index keys.
  • Exact match condition suggests hash index.
  • Range query suggests tree index.
  • Clustering is especially useful for range
    queries can also help on equality queries if
    there are many duplicates.
  • Multi-attribute search keys should be considered
    when a WHERE clause contains several conditions.
  • Order of attributes is important for range
    queries.
  • Such indexes sometimes enable index-only
    strategies
  • For index-only strategies, clustering is not
    important!
  • Choose indexes that benefit as many queries as
    possible.
  • Since only one index can be clustered per table,
    choose it based on important queries that would
    benefit the most from clustering.

36
B Tree The Most Widely Used Index
  • Supports equality and range-searches efficiently.
  • Insert/delete at log F N cost keep tree
    height-balanced. (F fanout, N leaf pages)
  • Minimum 50 occupancy (except for root). Each
    node contains d lt m lt 2d entries. The
    parameter d is called the order of the tree.

Index Entries
(Direct search)
Data Entries
("Sequence set")
37
Example B Tree
  • Search begins at root, and key comparisons direct
    it to a leaf (as in ISAM).
  • Search for 5, 15, all data entries gt 24 ...
  • Based on the search for 15, we know it is not
    in the tree!

38
B Trees in Practice
  • Typical order 100. Typical fill-factor 67.
  • average fanout 133
  • Typical capacities
  • Height 4 1334 312,900,700 records
  • Height 3 1333 2,352,637 records
  • Can often hold top levels in buffer pool
  • Level 1 1 page 8 Kbytes
  • Level 2 133 pages 1 Mbyte
  • Level 3 17,689 pages 133 MBytes

39
Inserting a Data Entry into a B Tree
  • Find correct leaf L.
  • Put data entry onto L.
  • If L has enough space, done!
  • Else, must split L (into L and a new node L2)
  • Redistribute entries evenly, copy up middle key.
  • Insert index entry pointing to L2 into parent of
    L.
  • This can happen recursively
  • To split index node, redistribute entries evenly,
    but push up middle key. (Contrast with leaf
    splits.)
  • Splits grow tree root split increases height.
  • Tree growth gets wider or one level taller at
    top.

40
Example B Tree - Inserting 8
Root
24
30
17
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
41
Example B Tree - Inserting 8
Root
17
24
30
13
5
2
3
39
19
20
22
24
27
29
33
34
38
7
5
8
14
16
  • Notice that root was split, leading to increase
    in height.
  • In this example, we can avoid split by
    re-distributing entries however,
    this is usually not done in practice.

42
Inserting 8 into Example B Tree
Entry to be inserted in parent node.
  • Observe how minimum occupancy is guaranteed in
    both leaf and index pg splits.
  • Note difference between copy-up and push-up be
    sure you understand the reasons for this.

(Note that 5 is
s copied up and

5
continues to appear in the leaf.)
3
5
2
7
8
Entry to be inserted in parent node.
(Note that 17 is pushed up and only
17
appears once in the index. Contrast

this with a leaf split.)
5
24
30
13
43
Deleting a Data Entry from a B Tree
  • Start at root, find leaf L where entry belongs.
  • Remove the entry.
  • If L is at least half-full, done!
  • If L has only d-1 entries,
  • Try to re-distribute, borrowing from sibling
    (adjacent node with same parent as L).
  • If re-distribution fails, merge L and sibling.
  • If merge occurred, must delete entry (pointing to
    L or sibling) from parent of L.
  • Merge could propagate to root, decreasing height.

44
B-Tree Demo
45
Example Tree (including 8) Delete 19 and 20
...
Root
17
24
30
13
5
39
2
3
19
20
22
24
27
38
7
5
8
29
33
34
14
16
  • Deleting 19 is easy.

46
Example Tree (including 8) Delete 19 and 20
...
Root
17
27
30
13
5
2
3
39
33
34
38
7
5
8
22
24
27
29
14
16
  • Deleting 19 is easy.
  • Deleting 20 is done with re-distribution. Notice
    how middle key is copied up.

47
... And Then Deleting 24
  • Must merge.
  • Observe toss of index entry (on right), and
    pull down of index entry (below).

30
39
22
27
38
29
33
34
Root
13
5
30
17
3
39
2
7
22
38
5
8
27
29
33
34
14
16
48
Summary
  • Alternative file organizations, tradeoffs for
    each
  • If selection queries are frequent, sorting the
    file or building an index is important.
  • Hash-based indexes only good for equality search.
  • Sorted files and tree-based indexes best for
    range search also good for equality search.
    (Files rarely kept sorted in practice B tree
    index is better.)
  • Index is a collection of data entries plus a way
    to quickly find entries with given key values.

49
Summary (Contd.)
  • Data entries can be actual data records, ltkey,
    ridgt pairs, or ltkey, rid-listgt pairs.
  • Choice orthogonal to indexing technique used to
    locate data entries with a given key value.
  • Can have several indexes on a given file of data
    records, each with a different search key.
  • Indexes can be
  • clustered, unclustered
  • B-tree, hash table, etc.

50
Summary (Contd.)
  • Understanding the nature of the workload for the
    application, and the performance goals, is
    essential to developing a good design.
  • What are the important queries and updates? What
    attributes/relations are involved?
  • Indexes must be chosen to speed up important
    queries (and perhaps some updates!).
  • Index maintenance overhead on updates to key
    fields.
  • Choose indexes that can help many queries, if
    possible.
  • Build indexes to support index-only strategies.
  • Clustering is an important decision only one
    index on a given relation can be clustered!
  • Order of fields in composite index key can be
    important.
Write a Comment
User Comments (0)
About PowerShow.com