Integration Wrapup Indexing and Sorting - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Integration Wrapup Indexing and Sorting

Description:

You should have a good understanding of how you will be using the ... Bitmap indices (a bit indicates a value) Multidimensional indices. R-trees, kD-trees, ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 36

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Integration Wrapup Indexing and Sorting

1
Integration Wrap-upIndexing and Sorting

Zachary G. Ives
University of Pennsylvania
CIS 550 Database Information Systems
November 17, 2005

Some slide content courtesy of Raghu Ramakrishnan
2
Reminders

You should have a good understanding of how you
will be using the database in your project
How inverted indices (word -gt document tables)
will be useful in answering queries
How you will manage user IDs and preferences
Etc.
Homework 5 due on Tuesday

3
An Alternate Integration ApproachThe
Information Manifold (Levy et al.)

When you integrate something, you have some
conceptual model of the integrated domain
Define that as a basic frame of reference,
everything else as a view over it
Local as View
May have overlapping/incomplete sources
Define each source as the subset of a query over
the mediated schema
We can use selection or join predicates to
specify that a source contains a range of values
ComputerBooks() ? Books(Title, , Subj), Subj
Computers

4
The Local-as-View Model

The basic model is the following
Local sources are views over the mediated
schema
Sources have the data mediated schema is
virtual
Sources may not have all the data from the domain
open-world assumption
The system must use the sources (views) to answer
queries over the mediated schema

5
Query Answering

Assumption conjunctive queries, set semantics
Suppose we have a mediated schema author(aID,
isbn, year), book(isbn, title, publisher)
Suppose we have the query
q(a, t) - author(a, i, _), book(i, t, p)
and sources
s1(a,t) ? author(a, i, _), book(i, t, p), t
123
s5(a, t, p) ? author(a, i, _), book(i,t), p
SAMS
We want to compose the query with the source
mappings but theyre in the wrong direction!
Yet everything in s1, s5 is an answer to the
query!

6
Answering Queries Using Views

Numerous recently-developed algorithms for these
Inverse rules Duschka et al.
Bucket algorithm Levy et al.
MiniCon Pottinger Halevy
Also related chase and backchase Popa,
Tannen, Deutsch
Requires conjunctive queries

7
Summary of Data Integration

Local-as-view integration has replaced
global-as-view as the standard
More robust way of defining mediated schemas and
sources
Mediated schema is clearly defined, less likely
to change
Sources can be more accurately described
Methods exist for query reformulation, including
inverse rules
Integration requires standardization on a single
schema
Can be hard to get consensus
Today we have peer-to-peer data integration,
e.g., Piazza Halevy et al., Orchestra Ives et
al., Hyperion Miller et al.
Some other aspects of integration were addressed
in related papers
Overlap between sources coverage of data at
sources
Semi-automated creation of mappings and wrappers
Data integration capabilities in commercial
products BEAs Liquid Data, IBMs WebSphere
Information Integrator, numerous packages from
middleware companies

8
Performance What Governs It?

Speed of the machine of course!
But also many software-controlled factors that we
must understand
Caching and buffer management
How the data is stored physical layout,
partitioning
Auxiliary structures indices
Locking and concurrency control (well talk about
this later)
Different algorithms for operations query
execution
Different orderings for execution query
optimization
Reuse of materialized views, merging of query
subexpressions answering queries using views
multi-query optimization

9
Our General Emphasis

Goal cover basic principles that are applied
throughout database system design
Use the appropriate strategy in the appropriate
place
Every (reasonable) algorithm is good somewhere
And a corollary database people reinvent a lot
of things and add minor tweaks

10
Storing Tuples in Pages
t1

Tuples
Many possible layouts
Dynamic vs. fixed lengths
Ptrs, lengths vs. slots
Tuples grow down, directories grow up
Identity and relocation
Objects and XML are harder
Horizontal, path, vertical partitioning
Generally no algorithmic way of deciding
Generally want to leave some space for insertions

t2
t3
11
Alternatives for Organizing Files

Many alternatives, each ideal for some situation,
and poor for others
Heap files for full file scans or frequent
updates
Data unordered
Write new data at end
Sorted Files if retrieved in sort order or want
range
Need external sort or an index to keep sorted
Hashed Files if selection on equality
Collection of buckets with primary overflow
pages
Hashing function over search key attributes

12
Model for Analyzing Access Costs

We ignore CPU costs, for simplicity
p(T) The number of data pages in table T
r(T) Number of records in table T
D (Average) time to read or write disk page
Measuring number of page I/Os ignores gains of
pre-fetching blocks of pages thus, I/O cost is
only approximated.
Average-case analysis based on several
simplistic assumptions.

Good enough to show the overall trends!

13
Approximate Cost of Operations

No overflow buckets, 80 page occupancy

Several assumptions underlie these (rough)
estimates!

14
Speeding Operations over Data

Recall that were interested in how to get good
performance in answering queries
The first consideration is how the data is made
accessible to the DBMS
We saw different arrangements of the tablesHeap
(unsorted) files, sorted files, and hashed files
Today we look further at 3 core concepts that are
used to efficiently support sort- and hash-based
access to data
Indexing
Sorting
Hashing

15
Technique I Indexing

An index on a file speeds up selections on the
search key attributes for the index (trade space
for speed).
Any subset of the fields of a relation can be the
search key for an index on the relation.
Search key is not the same as key (minimal set of
fields that uniquely identify a record in a
relation).
An index contains a collection of data entries,
and supports efficient retrieval of all data
entries k with a given key value k.
Generally the entries of an index are some form
of node in a tree but should the index contain
the data, or pointers to the data?

16
Alternatives for Data Entry k in Index

Three alternatives for where to put the data
Data record wherever key value k appears
Clustered ? fast lookup
Index is large only 1 can exist
ltk, rid of data record with search key value kgt,
OR
ltk, list of rids of data records with search key
kgt
Can have secondary indices
Smaller index may mean faster lookup
Often not clustered ? more expensive to use
Choice of alternative for data entries is
orthogonal to the indexing technique used to
locate data entries with a given key value k

rid row id, conceptually a pointer
17
Classes of Indices

Primary vs. secondary primary has the primary
key
Most DBMSs automatically generate a primary index
when you define a primary key
Clustered vs. unclustered order of records and
index are approximately the same
Alternative 1 implies clustered, but not
vice-versa
A file can be clustered on at most one search key
Dense vs. Sparse dense has index entry per data
value sparse may skip some
Alternative 1 always leads to dense index Why?
Every sparse index is clustered!
Sparse indexes are smaller however, some useful
optimizations are based on dense indexes

18
Clustered vs. Unclustered Index

Suppose Index Alternative (2) used, with pointers
to records stored in a heap file
Perhaps initially sort data file, leave some gaps
Inserts may require overflow pages
Consider how these strategies affect disk caching
and access

Index entries
UNCLUSTERED
CLUSTERED
direct search for
data entries
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
19
B Tree The DB Worlds Favorite Index

Insert/delete at log F N cost
(F fanout, N leaf pages)
Keep tree height-balanced
Minimum 50 occupancy (except for root).
Each node contains d lt m lt 2d entries. d is
called the order of the tree.
Supports equality and range searches efficiently.

Index Entries
(Direct search)
Data Entries
("Sequence set")
20
Example B Tree

Search begins at root, and key comparisons direct
it to a leaf.
Search for 5, 15, all data entries gt 24 ...

Root
30
17
24
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34

Based on the search for 15, we know it is not
in the tree!

21
B Trees in Practice

Typical order 100. Typical fill-factor 67.
average fanout 133
Typical capacities
Height 4 1334 312,900,700 records
Height 3 1333 2,352,637 records
Can often hold top levels of tree in buffer pool
Level 1 1 page 8 KB
Level 2 133 pages 1 MB
Level 3 17,689 pages 133 MB
Level 4 2,352,637 pages 18 GB
Nearly O(1) access time to data for equality
or range queries!

22
Inserting Data into a B Tree

Find correct leaf L.
Put data entry onto L.
If L has enough space, done!
Else, must split L (into L and a new node L2)
Redistribute entries evenly, copy up middle key.
Insert index entry pointing to L2 into parent of
L.
This can happen recursively
To split index node, redistribute entries evenly,
but push up middle key. (Contrast with leaf
splits.)
Splits grow tree root split increases height.
Tree growth gets wider or one level taller at
top.

23
Inserting 8 Example Copy up
Root
24
30
17
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
Want to insert here no room, so split copy up
8
Entry to be inserted in parent node.
(Note that 5 is copied up and
5
continues to appear in the leaf.)
3
5
2
7
8
24
Inserting 8 Example Push up
Need to split node push up
Root
24
30
17
13
5
39
3
19
20
22
24
27
38
2
14
16
29
33
34
5
7
8
Entry to be inserted in parent node.
(Note that 17 is pushed up and only appears once
in the index. Contrast this with a leaf split.)
17
5
24
30
13
25
Deleting Data from a B Tree

Start at root, find leaf L where entry belongs.
Remove the entry.
If L is at least half-full, done!
If L has only d-1 entries,
Try to re-distribute, borrowing from sibling
(adjacent node with same parent as L).
If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to
L or sibling) from parent of L.
Merge could propagate to root, decreasing height.

26
B Tree Summary

B tree and other indices ideal for range
searches, good for equality searches.
Inserts/deletes leave tree height-balanced logF
N cost.
High fanout (F) means depth rarely more than 3 or
4.
Almost always better than maintaining a sorted
file.
Typically, 67 occupancy on average.
Note Order (d) concept replaced by physical
space criterion in practice (at least
half-full).
Records may be variable sized
Index pages typically hold more entries than
leaves

27
There are Many Other Kinds of Indices

Other value indices
Bitmap indices (a bit indicates a value)
Multidimensional indices
R-trees, kD-trees,
Text indices
Inverted indices (as youre defining in your
project)
Structural indices
Object indices access support relations, path
indices
XML and graph indices dataguides, 1-indices,
d(k) indices
These describe connectivity between nodes or
objects

28
Speeding Operations over Data

Three general data organization techniques
Indexing
Sorting
Hashing

29
Technique II Sorting

Pass 1 Read a page, sort it, write it
Can use a single page to do this!
Pass 2, 3, , etc.
Requires a minimum of 3 pages

INPUT 1
OUTPUT
INPUT 2
Disk
Disk
Main memory buffers
30
Two-Way External Merge Sort

Divide and conquer sort into subfiles and merge
Each pass we read write every page
If N pages in the file, we need dlog2(N)e 1
passes to sort the data, yielding a cost of
2Ndlog2(N)e 1

Input file
3,4
6,2
9,4
8,7
5,6
3,1
2
PASS 0
1-page runs
1,3
2
3,4
5,6
2,6
4,9
7,8
PASS 1
4,7
1,3
2,3
2-page runs
8,9
5,6
2
4,6
PASS 2
2,3
4,4
1,2
4-page runs
6,7
3,5
6
8,9
PASS 3
1,2
2,3
3,4
8-page runs
4,5
6,6
7,8
9
31
General External Merge Sort