BTrees and Sorting - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

BTrees and Sorting

Description:

... replaced by physical space criterion in practice ('at least half ... if comparison of attribute that's indexed, look up matches in index & return those ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 30

Provided by: zack4

Category:

Tags: btrees | criteria | matching | of | sorting

more less

Transcript and Presenter's Notes

Title: BTrees and Sorting

1
B-Trees and Sorting

Zachary G. Ives
University of Pennsylvania
CIS 550 Database Information Systems
November 26, 2009

2
Speeding Operations over Data

Three general data organization techniques
Indexing
Sorting
Hashing

3
B Tree The DB Worlds Favorite Index

Insert/delete at log F N cost
(F fanout, N leaf pages)
Keep tree height-balanced
Minimum 50 occupancy (except for root).
Each node contains d lt m lt 2d entries. d is
called the order of the tree.
Supports equality and range searches efficiently.

Index Entries
(Direct search)
Data Entries
("Sequence set")
4
Example B Tree

Search begins at root, and key comparisons direct
it to a leaf.
Search for 5, 15, all data entries gt 24 ...

Root
30
17
24
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34

Based on the search for 15, we know it is not
in the tree!

5
B Trees in Practice

Typical order 100. Typical fill-factor 67.
average fanout 133
Typical capacities
Height 4 1334 312,900,700 records
Height 3 1333 2,352,637 records
Can often hold top levels in buffer pool
Level 1 1 page 8 Kbytes
Level 2 133 pages 1 Mbyte
Level 3 17,689 pages 133 MBytes

6
Inserting Data into a B Tree

Find correct leaf L.
Put data entry onto L.
If L has enough space, done!
Else, must split L (into L and a new node L2)
Redistribute entries evenly, copy up middle key.
Insert index entry pointing to L2 into parent of
L.
This can happen recursively
To split index node, redistribute entries evenly,
but push up middle key. (Contrast with leaf
splits.)
Splits grow tree root split increases height.
Tree growth gets wider or one level taller at
top.

7
Inserting 8 into Example B Tree

Observe how minimum occupancy is guaranteed in
both leaf and index pg splits.
Recall that all data items are in leaves, and
partition values for keys are in intermediate
nodes
Note difference between copy-up and push-up.

8
Inserting 8 Example Copy up
Root
24
30
17
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
Want to insert here no room, so split copy up
8
Entry to be inserted in parent node.
(Note that 5 is copied up and
5
continues to appear in the leaf.)
3
5
2
7
8
9
Inserting 8 Example Push up
Need to split node push up
Root
24
30
17
13
5
39
3
19
20
22
24
27
38
2
14
16
29
33
34
5
7
8
Entry to be inserted in parent node.
(Note that 17 is pushed up and only appears once
in the index. Contrast this with a leaf split.)
17
5
24
30
13
10
Deleting Data from a B Tree

Start at root, find leaf L where entry belongs.
Remove the entry.
If L is at least half-full, done!
If L has only d-1 entries,
Try to re-distribute, borrowing from sibling
(adjacent node with same parent as L).
If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to
L or sibling) from parent of L.
Merge could propagate to root, decreasing height.

11
B Tree Summary

B tree and other indices ideal for range
searches, good for equality searches.
Inserts/deletes leave tree height-balanced logF
N cost.
High fanout (F) means depth rarely more than 3 or
4.
Almost always better than maintaining a sorted
file.
Typically, 67 occupancy on average.
Note Order (d) concept replaced by physical
space criterion in practice (at least
half-full).
Records may be variable sized
Index pages typically hold more entries than
leaves

12
Other Kinds of Indices

Multidimensional indices
R-trees, kD-trees,
Text indices
Inverted indices
Structural indices
Object indices access support relations, path
indices
XML and graph indices dataguides, 1-indices,
d(k) indices
Describe parent-child, path relationships

13
Speeding Operations over Data

Three general data organization techniques
Indexing
Sorting
Hashing

14
Technique II Sorting

Pass 1 Read a page, sort it, write it
Can use a single page to do this!
Pass 2, 3, , etc.
Requires a minimum of 3 pages

INPUT 1
OUTPUT
INPUT 2
Disk
Disk
Main memory buffers
15
Two-Way External Merge Sort

Divide and conquer sort into subfiles and merge
Each pass we read write every page
If N pages in the file, we need dlog2(N)e 1
passes to sort the data, yielding a cost of
2Ndlog2(N)e 1

Input file
3,4
6,2
9,4
8,7
5,6
3,1
2
PASS 0
1-page runs
1,3
2
3,4
5,6
2,6
4,9
7,8
PASS 1
4,7
1,3
2,3
2-page runs
8,9
5,6
2
4,6
PASS 2
2,3
4,4
1,2
4-page runs
6,7
3,5
6
8,9
PASS 3
1,2
2,3
3,4
8-page runs
4,5
6,6
7,8
9
16
General External Merge Sort

How can we utilize more than 3 buffer pages?

To sort a file with N pages using B buffer pages
Pass 0 use B buffer pages. Produce dN / Be
sorted runs of B pages each
Pass 2, , etc. merge B-1 runs

INPUT 1
. . .
. . .
INPUT 2
. . .
OUTPUT
INPUT B-1
Disk
Disk
B Main memory buffers
17
Cost of External Merge Sort

Number of passes 1dlogB-1 dN / Bee
Cost 2N ( of passes)
With 5 buffer pages, to sort 108 page file
Pass 0 d108/5e 22 sorted runs of 5 pages each
(last run is only 3 pages)
Pass 1 d22/4e 6 sorted runs of 20 pages each
(final run only uses 8 pages)
Pass 2 d6/4e 2 sorted runs, 80 pages and 28
pages
Pass 3 Sorted file of 108 pages

18
Speeding Operations over Data

Three general data organization techniques
Indexing
Sorting
Hashing

19
Technique 3 Hashing

A familiar idea, which we just saw for hash
files
Requires good hash function (may depend on
data)
Distribute data across buckets
Often multiple items in same bucket (buckets
might overflow)
Hash indices can be built along the same lines as
what we discussed
The difference they may be unclustered as well
as clustered
Types
Static
Extendible (requires directory to buckets can
split)
Linear (two levels, rotate through split bad
with skew)
We wont get into detail because of time, but see
text

20
Making Use of the Data IndicesQuery Execution

Query plans exec strategies
Basic principles
Standard relational operators
Querying XML

21
Making Use of the Data IndicesQuery Execution

Query plans exec strategies
Basic principles
Standard relational operators
Querying XML

22
Query Plans

Data-flow graph of relational algebra operators
Typically determined by optimizer

JoinPressRel.Symbol EastCoast.CoSymbol
JoinPressRel.Symbol Clients.Symbol
ProjectCoSymbol
SelectClient Atkins
SELECT FROM PressRel p, Clients C WHERE
p.Symbol c.Symbol AND c.Client Atkins
AND c.Symbol IN (SELECT CoSymbol FROM EastCoast)
Scan PressRel
ScanEastCoast
Scan Clients
23
Iterator-Based Query Execution

Execution begins at root
open, next, close
Propagate calls to children
May call multiple child nexts
Efficient scheduling resource usage
Can you think of alternatives and their benefits?

JoinPressRel.Symbol EastCoast.CoSymbol
JoinPressRel.Symbol Clients.Symbol
ProjectCoSymbol
SelectClient Atkins
Scan PressRel
Scan Clients
ScanEastCoast
24
Execution Strategy Issues

Granularity parallelism
Pipelining vs. blocking
Materialization

JoinPressRel.Symbol EastCoast.CoSymbol
JoinPressRel.Symbol Clients.Symbol
ProjectCoSymbol
SelectClient Atkins
Scan PressRel
ScanEastCoast
Scan Clients
25
Basic Principles

Many DB operations require reading tuples, tuple
vs. previous tuples, or tuples vs. tuples in
another table
Techniques generally used
Iteration for/while loop comparing with all
tuples on disk
Index if comparison of attribute thats
indexed, look up matches in index return those
Sort/merge iteration against presorted data
(interesting orders)
Hash build hash table of the tuple list, probe
the hash table
Must be able to support larger-than-memory data

26
Basic Operators

One-pass operators
Scan
Select
Project
Multi-pass operators
Join
Various implementations
Handling of larger-than-memory sources
Semi-join
Aggregation, union, etc.

27
1-Pass Operators Scanning a Table

Sequential scan read through blocks of table
Index scan retrieve tuples in index order
May require 1 seek per tuple! When?
Cost in page reads b(T) blocks, r(T) tuples
b(T) pages for sequential scan
Up to r(T) for index scan if unclustered index
Requires memory for one block

28
1-Pass Operators Select (s)

Typically done while scanning a file
If unsorted no index, check against predicate
Read tuple
While tuple doesnt meet predicate
Read tuple
Return tuple
Sorted data can stop after particular value
encountered
Indexed data apply predicate to index, if
possible
If predicate is
conjunction may use indexes and/or scanning loop
above (may need to sort/hash to compute
intersection)
disjunction may use union of index results, or
scanning loop