Query Execution

About This Presentation

Title:

Query Execution

Description:

Query Execution Optimizing Performance – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 51

Provided by: thomog

Category:

more less

Transcript and Presenter's Notes

Title: Query Execution

1
Query Execution

Optimizing Performance

2
Resolving an SQL query

Since our SQL queries are very high level, the
query processor must do a lot of additional
processing to supply all of the missing details.
In practice, an SQL query is translated
internally into a relational algebra expression.
One advantage of using relational algebra is that
it makes alternative forms of a query easier to
explore.
The different algebraic expressions for a query
are called logical query plans.
We will focus first on the methods for executing
the operations of the relational algebra.
Then we will focus on how transform logical query
plans.

3
Preview

Parsing read SQL, output relational algebra tree
Query rewrite Transform tree to a form, which is
more efficient to evaluate
Physical plan generation
select implementation for each operator in
tree,
and for passing results up the tree.
In this chapter we will focus on the
implementation for each operator.

4
Relational Algebra recap

RA union, intersection and difference correspond
to UNION, INTERSECT, and EXCEPT in SQL
Selection ? corresponds to the WHERE-clause in
SQL
Projection ? corresponds to SELECT-clause
Product ? corresponds to FROM-clause
Joins corresponds to JOIN, NATURAL JOIN, and
OUTER JOIN in the SQL2 standard
Duplicate elimination ? corresponds to DISTINCT
in SELECT-clause
Grouping ? corresponds to GROUP BY
Sorting ? corresponds to ORDER BY

5
Expression trees
MovieStar(name,addr,gender,birthdate) StarsIn(titl
e, year, starName) SELECT title, birthdate FROM
MovieStar, StarsIn WHERE year 1996 AND Gender
F AND starName name
6
How to generate such alternative expression trees
will be Chapter 16.

Join method?
Can we pipeline the result of one or both
selections, and avoid storing the result on disk
temporarily?
Are there indexes on MovieStar.gender and/or
StarsIn.year that will make the ?'s efficient?

7
Relational algebra for real SQL

Keep in mind the following fact
A relation in algebra is a set, while a relation
in SQL is probably a bag
In short, a bag allows duplicates.
Not surprisingly, this can effect the cost of
related operations.

8
Bag union, intersection, and difference

Card(t,R) means the number of occurrences of
tuple t in relation R
Card(t, R?S) Card(t,R) Card(t,S)
Card(t,R?S) minCard(t,R), Card(t,S)
Card(t,RS) maxCard(t,R)Card(t,S), 0
Example R A,B,B, S C,A,B,C
R ? S A,A,B,B,B,C,C
R ? S A,B
R S B

9
Beware Bag Laws ! Set Laws

Not all algebraic laws that hold for sets also
hold for bags.
For one example, the commutative law for union
R ? S S ? R does hold for bags.
Since addition is commutative, adding the number
of times that tuple x appears in R and S doesnt
depend on the order of R and S.
Set union is idempotent, meaning that S ? S S.
However, for bags, if x appears n times in S,
then it appears 2n times in S ? S.
Thus S ? S ! S in general.

10
Physical operators

Physical query plans are built from physical
operators.
In most cases, the physical operators are direct
implementations of the relational algebra
operators.
However, there are several other physical
operators for various supporting tasks.
Table-scan (the most basic operation we want to
perform in a physical query plan)
Index-scan (E.g. if we have an index on some
relation R we can retrieve the blocks of R by
using the index)
Sort-scan (takes a relation R and a specification
of the attributes on which the sort is to be
made, and produces R in sorted order)

11
Computational Model

When comparing algorithms for the same operations
we do not consider the cost of writing the
output.
Because the cost of writing the output on the
disk depends on the size of the result, not on
the way the result was computed. In other words,
it is the same for any computational alternative.
Also, we can often pipeline the result to other
operators when the result is constructed in main
memory. So a final output phase may not even be
required.

12
Cost parameters

In order to evaluate or estimate the cost of
query resolution, we need to identify the
relevant parameters.
Typical cost parameters include
R the relation on disk
M number of main memory buffers available
(1buffer 1block)
B(R) number of blocks of R
T(R) number of tuples of R
V(R, a) number of distinct values in column a
of R
V(R, L) number of different tuples in R (where
L is a list of attributes or columns)
Simple cost estimate
Basic scan B(R)
2PMMS 3B(R)
Recall that final output is not counted

13
Algorithms for implementing RA-operators

Classification of algorithms
Sort based methods
Hash based methods
Index based methods
Degree of difficultness of algorithms
One pass (when one relation can fit into main
memory)
Two pass (when no relation can fit in main
memory, but again the relations are not very
extremely large)
Multi pass (when the relations are very extremely
large)
Classification of operators
Tuple-at-a-time, unary operations(s, p)
Full-relation, unary operations (d, g)
Full-relation, binary operations (union, join,)

14
Selection --

The condition C might involve
Arithmetic (,-, ?) or string operators such as
LIKE
Comparison between terms, e.g. a lt b or ab 10.
Boolean connectives AND, OR, and NOT
Example R

a b ---- 2 3 4 5 2 3
a b ---- 0 1 2 3 4 5 2 3
a b ---- 4 5
One pass, tuple-at-a-time
15
Projection --

Argument L of ? is a sequence of elements of the
following form
A single attribute in R, or
An expression x ? y, where x and y are attribute
names, or
An expression E ? z, where E is an expression
involving attributes in R and z is a new
attribute name not in R
Example R

a b c ------ 0 1 2 0 1 2 3 4 5
a x ---- 0 3 0 3 3 9
x y ---- 1 1 1 1 1 1
One pass, tuple-at-a-time
16
One pass, tuple-at-a-time

Selection and projection
Cost B(R) or T(R)
(if the relation is
not clustered)
Space requirement
M ? 1 block
Principle
Read one block (or one tuple if the relation is
not clustered) at a time
Filter in or out the tuples of this block.

17
Duplicate Elimination

R1 ?(R2).
R1 consists of one copy of each tuple that
appears in R2 one or more times.

R A B 1 2 3 4 1 2
One pass, unary full-relation operations
18
One pass, unary full-relation operations

Duplicate elimination
for each tuple decide
seen before ignore
new output
Principle
It is the first time we have
seen this tuple, in which case we copy it to the
output.
We have seen the tuple before,in which case we
must not output this tuple.
We need a Main Memory hash-table to be efficient.
Requirement ?

19
One pass, unary full-relation operations
20
One pass, unary full-relation operations
21
One pass, unary full-relation operations

Grouping Accumulate the information on groups in
main memory.
Need to keep one entry for each value of the
grouping attributes, through a main memory
search structure (hash table).
Then, we need for each group to keep an
aggregated value (or values if the query asks for
more than one aggregation).
For MIN/MAX we keep the min or max value seen so
far for the group.
For COUNT aggregation we keep a counter which is
incremented each time we encounter a tuple
belonging to the group.
For SUM, we add the value if the tuple belongs to
the group.
For AVG?
MM requirement.
Typically, a (group) tuple will be smaller than a
tuple of the input relation,
Typically, the number of groups will be smaller
than the number of tuples in the input relation.
The number of groups is
Requirement ? lt M

22
One pass, binary operators

Requirement min(B(R),B(S)) M
Exception bag union
Cost B(R) B(S)
Assume R is larger than S.
How to perform the operations below
Set union, set intersection, set difference
Bag intersection, bag difference
Cartesian product, natural join
All these operators require reading the smaller
of the relations into main memory using there a
search scheme (e.g. main memory hash table) for
easy search and insertion.

23
Set Union

Let R and S be sets.
We read S into M-1 buffers of main memory.
All these tuples are also copied to the output.
We then read each block of R into the Mth buffer,
one at a time.
For each tuple t of R we see if t is in S, and if
not, we copy t to output.
At the end we output S from main mem.

24
Set Intersection

Let R and S be sets.
The result will be set.
We read S into M-1 buffers of main memory.
We then read each block of R into the M-th
buffer, one at a time.
For each tuple t of R we see if t is in S, and if
so, we copy t to output.

25
Set Difference

Let R and S be sets.
Since difference is not a commutative operator,
we must distinguish between R-S and S-R.
Read S into M-1 buffers of main memory.
Then read each block of R into the Mth buffer,
one at a time.
To compute R-S
for each tuple t of R we see if t is not in S,
and if so, we copy t to output.
To compute S-R
for each tuple t of R we see if t is is in S, we
delete t from S in such a case. At the end we
output those tuples of S that remain.

26
Bag Intersection

Let R and S be bags.
Read S into M-1 buffers of main memory.
Also, associate with each tuple a count, which
initially measures the number of times the tuple
occurs in S.
Then read each block of R into the M-th buffer,
one at a time.
For each tuple t of R we see if t is in S. If not
we ignore it.
Otherwise, if the counter is greater than zero,
we output t and decrement the counter.

27
Bag Difference

We read S into M-1 buffers of main memory.
Also, we associate with each tuple a count, which
initially measures the number of times the tuple
occur in S.
We then read each block of R into the M-th
buffer, one at a time.
To compute S-R
for each tuple t of R we see if t is is in S, we
decrement its counter.
At the end we output those tuples of S that
remain with counter positive.
To compute R-S
we may think of the counter c for tuple t as
having c reasons to not output t.
Now, when we process a tuple of R we check to see
if that tuple appears in S. If not we output t.
Otherwise, we check to see the counter c of t. If
it is 0 we output t.
If not, we dont output t, and we decrement c.

28
Product

We read S into M-1 buffers of main memory. No
special structure is needed.
We then read each block of R into the M-th
buffer, one at a time. And combine each tuple
with all the tuples of S.

29
(No Transcript)
30
Natural Join

We read S into M-1 buffers of main memory and
build a search structure where the search key is
the shared attributes X of R and S.
We then read each block of R into the M-th
buffer, one at a time. For each tuple t of R we
look for tuples t in S, and if tX uX, we
copy t.u to the output.

31
Nested-Loop joins

one-and-a-half pass method, since one relation
will be read just once.
Tuple-Based Nested-loop Join Algorithm
FOR each tuple s in S DO
FOR each tuple r in R DO
IF r and s join to make a tuple t THEN
output t
Improvement to Take Advantage of Disk I/O Model
Instead of retrieving tuples of R, T(S) times,
load memory with as many tuples of S as can fit,
and match tuples of R against all S tuples in
memory.

32
Block-based nested loops

Assume B(S) B(R), and B(S) gt M
Read M-1 blocks of S into main memory and compare
to all of R, block by block
FOR each chunk of M-1 blocks of S DO
FOR each block b of R DO
FOR each tuple t of b DO
find the tuples of S in memory that join
with t
output the join of t with each of these
tuples

33
Example

B(R) 1000, B(S) 500, M 101
Outer loop iterates 5 times
At each iteration we read M-1 (i.e. 100) blocks
of S and all of R (i.e. 1000) blocks.
Total time 5(100 1000) 5500 I/Os
Question What if we reversed the roles of R and
S?
We would iterate 10 times, and in each we would
read 100500 blocks, for a total of 10(100500)
6000 I/Os.
Compare with one-pass join, if it could be done!
We would need 1500 disk I/Os if B(S) ? M-1

34
Analysis of blocks nested loops

Number of disk I/Os
B(S)/(M-1) (M-1 B(R))
or
B(S) (B(S)B(R))/(M-1)
or approximately (B(S)B(R))/M

35
Summary of one-pass algorithms
36
Two-pass algorithms based on sorting

This special case of multi-pass algorithms is
sufficient for most of the relation sizes.
Main idea for unary operations on R
Suppose B(R) ? M (main memory size in blocks)
First pass
Read M blocks of R into Main Memory
Sort the content of Main Memory
Write the sorted result (sublist/run) into M
blocks on disk.
Second pass create final result

37
Duplicate elimination ? using sorting

In the second phase (merging) we dont sort but
copy each tuple just once.
We can do that because the identical tuples will
show up at the same time, i.e. they will be all
the first ones at the buffers (for the sorted
sublists).
As usual, if one buffer gets empty we refill it.

38
Duplicate-Elimination using Sorting Example

Assume M3, each buffer holds 2 records and
relation R consists of the following 17 tuples
2, 5, 2, 1, 2, 2, 4, 5, 4, 3, 4, 2, 1, 5, 2, 1,
3
After the first pass the following sorted
sub-lists are created
1, 2, 2, 2, 2, 5
2, 3, 4, 4, 4, 5
1, 1, 2, 3, 5
In the second pass we dedicate a memory buffer to
each sub-list.

39
Example (Contd)
40
Example (Contd)
41
Example (Contd)
42
Analysis of ?(R)

2B(R) to create sorted sublists, B(R) to read
each sublist in phase 2. Total 3B(R)
How large can R be?
There can be no more than M sublists since we
need one buffer for each one.
So, B(R)/M M, (B(R)/M is the number of
sublists)
i.e. B(R) M2
To compute ?(R) we need at least sqrt(B(R))
blocks of Main Memory.

43
Sort-based ?, ?, -
Example set union.

Create sorted sublists of R and S
Use input buffers for sorted sublists of R and S,
one buffer per sublist.
Output each tuple once.
We can do that since all the identical tuples
appear at the same time.

Analysis 3(B(R) B(S)) disk I/Os
Condition B(R) B(S) M2
Similar algorithms for sort based intersection
and difference (bag or set versions).

44
Simple sort-based join

For R(X,Y) S(Y,Z) with M buffers of memory
Completely sort R on Y, sort S on Y
Merge phase
Use 2 input buffers 1 for R, 1 for S.
Pick tuple t with smallest Y value in the buffer
for R
If t doesnt match with the first tuple in the
buffer for S, then just remove t.
Otherwise, read all the tuples from R with the
same Y value as t and put them in the M-2 part of
the memory.
When the input buffer for R is exhausted fill it
again and again.
Then, read the tuples of S that match. For each
one we produce the join of it with all the tuples
of R in the M-2 part of the memory.

45
Example of sort join

B(R) 1000, B(S) 500, M 101
To sort R, we need 4B(R) I/Os, same for S.
Number of I/Os 4(B(R) B(S))
Doing the join in the merge phase
Number of I/Os B(R) B(S)
Total disk I/Os 5(B(R) B(S)) 7500
Memory Requirement?
To be able to do the sort, we should have B(R)
M2 and B(S) M2
Recall for nested-loop join, we needed 5500 disk
I/Os, but the memory requirement was quadratic
(it is linear, here), i.e., nested-loop join is
not good for joining relations that are much
larger than main memory.

46
Potential problem ...
S(Y, Z) --------- a z1 a z2 ...
a zm
R(X , Y) ----------- x1 a x2 a
xn a
What if Size of n tuples gt M-2 and Size of m
tuplesgt M-2?

If the tuples from R (or S) with the same value y
of Y do not fit in M-2 buffers, then we use all
M-2 buffers to do a nested-loop join on the
tuples with Y-value y from both relations.
Observe that we can smoothly continue with the
nested loop join when we see that the R tuples
with Y-value y do not fit in M-2 buffers.

47
Can We Improve on Sort Join?

Do we really need the fully sorted files?
Suppose we are not worried about many common Y
values

48
A more efficient sort-based join

Suppose we are not worried about many common Y
values
Create Y-sorted sublists of R and S
Bring first block of each sublist into a buffer
(assuming we have at most M sublists)
Find smallest Y-value from heads of buffers. Join
with other tuples in heads of buffers, use other
possible buffers, if there are many tuples with
the same Y values.

Disk I/O 3(B(R) B(S))
Requirement B(R) B(S) M2

49
Example

B(R) 1000, B(S) 500, M 101
Total of 15 sorted sublists
If too many tuples join on a value Y, use the
remaining 86 MM buffers for a one pass join on Y
Total cost 3(1000 500) 4500 disk I/Os
M2 10201 gt B(R) B(S), so the requirement is
satisfied

50
Summary of sort-based algorithms

Write a Comment

User Comments (0)