Overview of Relational Database Systems - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Overview of Relational Database Systems

Description:

Assume rocking scan is used to read pages of S. The I/O cost is. N M ( N / K1 - 1) (M - K2) ... (keep the duplicate rows): Scanning each tuple once. CPU ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 29

Provided by: Informatio313

Category:

more less

Transcript and Presenter's Notes

Title: Overview of Relational Database Systems

1
Overview of Relational Database Systems

From Physical Storage to
Query Optimization

2
The Relational Data Model

Table attribute
attribute column of a table
tuple row of a table
superkey, key, primary key, candidate key

3
Two Example Relations
Employee
Department
4
The Core Relational Algebra

Selection (?)
Projection (?)
Cartesian Product (?)
Union (?)
Set Difference (-)
Set Intersection (?)
Join (?c )
Division (?)

5
SQL
select Employee.Name, Department.Phone from
Employee, Department where Employee.Dept
Sales and Employee.Dept Department.Name
? Employee.Name,Department.Phone (?
Employee.DeptSales and Employee.DeptDepartment
.Name (Employee ? Department))
SQL is more powerful than relational algebra
because it can do many things, such as computing
aggregates for different groups of tuples, that
relational algebra cannot do.
6
Query Processing An Overview
select from R, S, T where R.A gt a and R.B S.B
and S.C T.C
A possible execution plan for this query 1.
Perform selection ?Agta(R) based on a sequential
scan of the tuples of R. Let R1 denote the result
of ?Agta(R). 2. Perform join R1 ?R1.BS.B S using
the sort merge join algorithm. Let R2 denote the
result of the join. 3. Perform join R2 ?R2.CT.C
T using the nested loop join algorithm.
7
Query Optimization An Overview

The goal of query optimization is to find an
execution plan which can be evaluated with
minimum cost, I.e., an optimal plan.
In centralized database, the query evaluation
cost I/O cost (dominant) CPU cost
I/O cost can be cut down by employing B trees
hashing, etc.
The of equivalent execution plans for a given
query (m!)km where there are m operations in
the query each operation can be evaluated in k
different ways.
The search space for query optimization the set
of all equivalent execution plans.
We dont want to spend too much time to find a
plan as query processing time time to find a
plan time to execute the plan.

8
Two-level database architecture

Main memory
fast access (0.01?s per instruction for 100mips
CPU)
small capacity
volatile
expensive

Secondary storage (disk)
slow access (about 10ms per disk block)
large capacity
nonvolatile
cheap

9
Secondary Storage

Typical disk drives
spindle
actuator
read/write heads
tracks
cylinders
pages (sectors / blocks)
page address (cyl, trk, pg)

The cost of transferring one page betn memory
disk sum of
seek time (for track)
rotational delay / latency (for page)
block transfer time rotation time x (page size
/ track size), why?

10
Storage Hierarchy

A page I/O is sequential if transferring data
incurs neither seek time nor rotational delay
otherwise, the page I/O is random. So try to
reduce of random page I/Os!
Other architectures include
One-level architecture all data in the main
memory database system are kept in main memory.
Three-level architecture Tertiary storage like
optical disks or tape drives is needed for
applications such as EOS/DIS (earth observation
system/data information system) which stores as
many as 10 perabytes (1 million GB) of data.

11
B Tree

Each node is stored as a page
Its a balanced tree
Every nonroot internal node is at least half full
The leaf node pointers provide a way to access
the tuples in an ordered manner and are useful
for evaluating range conditions

Acceptable overhead on insertion, deletion, and
space
B tree indices
B tree file organization
B tree indices

Exercise Read the algorithms for B/B-tree
search, insertion, deletion by yourself from a
data org/database book.
12
Hashing

The idea is to build a hash table that contains
an index entry for each tuple in the relation and
to use a hash function h( ) to quickly identify
each entry in the hash table.
The hash table consists of many buckets
(identified by their bucket numbers) which
corresponds to one or more disk pages.
Each bucket consists of a of index entries of
the form (a, P), where a is the attribute value
of some tuple and P is a tuple pointer, pointing
to the tuple on disk.

Exercise Read the implementation techniques for
static hashing dynamic hashing from a data
org/database book.
13
Reminder !!!
B?-trees Hashing are the most fundamental
techniques in organizing retrieving information
efficiently. Make sure you understand their
details.
14
Evaluating Join
Among the three most frequently used relational
operations (I.e., selection, projection, join),
join is the most expensive operation.
Consider R ?R.AS.B S. Let n m are the numbers
of tuples in R S, respectively, and N M are
the sizes of R S in pages, respectively. We
also assume M ? N.
15
Nested Loop Join Algorithm
for each tuple x in R for each tuple y in S if
xA yB then return (x, y)
outer relation inner relation
CPU cost is always O(nm)
To optimize the I/O cost gt to minimize the
of I/O pages
16
Nested Loop I/O Cost
for each K1 pages P of R for each K2 pages Q of
S for each tuple x in P for each tuple y in Q if
xA yB then return (x, y)
Assume K1 K2 K buffer pages available and K1
? N and K2 ? M. Assume rocking scan is used to
read pages of S.
The I/O cost is N M (? N / K1? - 1) (M - K2)
(A)
Min when K1 minN, K-1
When S is the outer relation, the I/O cost is M
N (? M / K2? - 1) (N - K1)
Min when K2 minM, K-1
(B)
17
Nested Loop I/O Cost (Cont)
Min(A) is more likely to be greater than min(B),
I.e., to minimize the of I/O pages, use the
smaller relation as the outer relation and let it
use as many buffer pages as needed (minM, K-1).
To minimize the of I/O operations initiated and
say when R is used as the outer relation, the
of I/O operations needed is ? N/K1 ? ? M/K2 ?
( ? N/K1 ? - 1)( ? M/K2 ? - 1) ? N/K1 ??
M/K2 ? 1 Ignoring the ceilings, it reaches the
min when K1 K2 K/2. So to minimize the of
I/O operations, it does not matter which of R S
is used as the outer relation.
18
Sort Merge Join Algorithm
1. Sort the two relations in ascending order of
their respective joining attributes, I.e., sort R
on A and sort S on B if they are not already
sorted. 2. Perform a merge join. Two cases to
consider a) the values under at least one
joining attribute are distinct, b) both
attributes have repeating values.
Worst case (none is sorted nearly all values of
the joining attributes are the same) CPU cost
O(n log n m log m nm) I/O cost O(N log N
M log M C(R,S)) where C(R,S) is the cost of
performing (R x S) which is the same cost as
nested loop algorithm.
Best case (sorted no repeated value) CPU cost
O(n m) I/O cost O(N M)
19
Hash Join Algorithm

1. Build a hash table for the smaller relation S
based on the joining attribute. Rather than
putting the pointers to tuples in the buckets,
the tuples themselves are placed in the buckets.
2. Use the larger relation R to probe the hash
table to perform the join. The probe process is
described below
for each tuple x in R
hash on the joining attribute using the same
hash function used above to find a bucket in the
hash table
if the bucket is nonempty
for every tuple y in the found bucket
if xA yB then return (x, y)

CPU cost O(m n b) where b avg of tuples
per bucket I/O cost O(N M)
20
Comparison of Join Algorithms

Hash join is very efficient but is only
applicable to equijoin.
Sort merge join performs better than nested loop
when both relations are large, esp. true if one
or both relations are already sorted on the
joining attributes.
Nested loop join performs well when one relation
is large one is small. A special case is when
the smaller relation can be entirely held in main
memory. In this case, both relations need to be
read in only once. When combined with the index
on the joining attribute of the (larger) inner
relation, excellent performance can yield.

21
Evaluating Selection
?A op a (R)
A is an attribute a is a constant op ? , ?, lt,
?, gt, ?
If op is ?, then most tuples of R are likely to
satisfy the condition gt sequential scanning. So
for the next few OHPs, we assume op is not ?.
22
Selectivity
Defn The selectivity of A op a on R, denoted as
S A op a (R), is the percentage of the tuples of
R that satisfy A op a.
Most commercial systems maintain detailed
statistics about the values of each attribute.
These statistics are typically in the form of a
histogram. For poor systems without histograms
but just some statistics, SAa(R) est 1 /
dist(A) SAgta(R) est (max(A) - a) / (max(A) -
min(A)) where dist(A) is the number of distinct
values of A in R.
23
Selection Cost (without fast access path)
Let k be nS A op a (R), I.e., the of tuples in
R that satisfy A op a. Suppose fast access path
is unavailable or not used.
Case 1 (A is sorted) use binary search. CPU
cost O(log n k). I/O cost O(log N
?(k/n)N?) where N is of pages to hold R.
Case 2 (A is not sorted) need sequential scan on
all tuples. CPU cost O(n). I/O cost O(N).
24
Selection Cost(with fast access path)
Case 1 (Sorted A-values) I.e., the fast access
path is a clustered index. Btree takes constant
steps to get the first match. So the CPU cost is
O(k), the I/O cost is O(?(k/n)N?).
Case 2 (Unsorted A-values) CPU cost is still
O(k). The I/O cost is bounded by O(mink, N).
25
Evaluating Projection
? A1,,At (R) where A1,,At are attributes of R.
Case 1 (keep the duplicate rows) Scanning each
tuple once. CPU cost is O(n). I/O cost is O(N).
Case 2 (select distinct) Step 1 as above. Step
2 sort the result. Step 3 remove adjacent
duplicates. CPU cost is dominated by sorting O(n
log n). The I/O cost is dominated by step 1 2.
The I/O cost for step 1 is O(N). Let W be the
size of the step 1 result (W n(? i1..t
length(Ai))/PageSize). The I/O cost for step 2 is
O(W log W).
26
Is the time to find an optimal plan too expensive?

If the queries are submitted many times, spending
more effort to find an (nearly) optimal execution
plan is worthwhile.
For some special types of queries (chain query,
star query, tree query), an optimal execution
plan can be found in a reasonable amount of time.
For general queries, either heuristics are used
to find a reasonably optimal plan or a reduced
search space is used. E.g., in System R, only
execution plans corresponding to left deep join
trees are considered.

27
Algebra-Based Optimization(based on heuristic
rules)

Perform selection as early as possible
Replace Cartesian products by joins whenever
possible
If there are several joins, perform the most
restrictive joins first
Project out useless attributes early

28
Cost Estimation-Based Optimization (e.g. IBM DB2)
Idea For each query, enumerate all possible
execution plans. For each plan, estimate the
cost. Finally choose the one with lowest
estimated cost.

Difficulties
Too many possible execution plans to enumerate.
It may be too difficult to estimate the cost of
each execution plan accurately, mostly because of
the chain operations (as need to estimate the
size of the result from previous operation(s)).

Write a Comment

User Comments (0)