Gitte Christensen - PowerPoint PPT Presentation

About This Presentation
Title:

Gitte Christensen

Description:

Title: SQL Query Optimisation Subject: Ian Smith's Master Author: Dept of Computing Sciences Last modified by: Gitte Christensen Created Date: 9/15/1997 9:06:54 AM – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 65
Provided by: Deptof66
Category:

less

Transcript and Presenter's Notes

Title: Gitte Christensen


1
Managing External Data 3
  • Gitte Christensen
  • Dyalog Ltd

2
Relational Algebra
  • Relational Algebra is
  • the formal description of how a relational
    database operates
  • the mathematics which underpin SQL
    operations.
  • Operators in relational algebra are not
    necessarily the same as SQL operators, even if
    they have the same name.

3
Terminology
  • Relation - a set of tuples.
  • Tuple - a collection of attributes which describe
    some real world entity.
  • Attribute - a real world role played by a named
    domain.
  • Domain - a set of atomic values.
  • Set - a mathematical definition for a collection
    of objects which contains no duplicates.

4
Operators - Write
  • INSERT - provides a list of attribute values for
    a new tuple in a relation. This operator is the
    same as SQL.
  • DELETE - provides a condition on the attributes
    of a relation to determine which tuple(s) to
    remove from the relation. This operator is the
    same as SQL.
  • MODIFY - changes the values of one or more
    attributes in one or more tuples of a relation,
    as identified by a condition operating on the
    attributes of the relation. This is equivalent to
    SQL UPDATE.

5
Operators - Retrieval
  • There are two groups of operations
  • Mathematical set theory based relations UNION,
    INTERSECTION, DIFFERENCE, and CARTESIAN PRODUCT.
  • Special database operations SELECT (not the
    same as SQL SELECT), PROJECT, and JOIN.

6
Relational SELECT
  • SELECT is used to obtain a subset of the tuples
    of a relation that satisfy a select condition.
  • For example, find all employees born after 1st
    Jan 1950
  • SELECT dob gt 01/JAN/1950 (employee)

7
Relational PROJECT
  • The PROJECT operation is used to select a subset
    of the attributes of a relation by specifying the
    names of the required attributes.
  • For example, to get a list of all employees
    surnames and employee numbers
  • PROJECT surname,empno (employee)

8
SELECT and PROJECT
SELECT and PROJECT can be combined together. For
example, to get a list of employee numbers for
employees in department number 1
9
Set Operations - semantics
  • Consider two relations R and S.
  • UNION of R and Sthe union of two relations is a
    relation that includes all the tuples that are
    either in R or in S or in both R and S. Duplicate
    tuples are eliminated.
  • INTERSECTION of R and Sthe intersection of R and
    S is a relation that includes all tuples that are
    both in R and S.
  • DIFFERENCE of R and Sthe difference of R and S
    is the relation that contains all the tuples that
    are in R but that are not in S.

10
SET Operations - requirements
  • For set operations to function correctly the
    relations R and S must be union compatible. Two
    relations are union compatible if
  • they have the same number of attributes
  • the domain of each attribute in column order is
    the same in both R and S.

11
UNION Example
12
INTERSECTION Example
13
DIFFERENCE Example
14
CARTESIAN PRODUCT
  • The Cartesian Product is also an operator which
    works on two sets. It is sometimes called the
    CROSS PRODUCT or CROSS JOIN.
  • It combines the tuples of one relation with all
    the tuples of the other relation.

15
CARTESIAN PRODUCT Example
16
JOIN Operator
  • JOIN is used to combine related tuples from two
    relations
  • In its simplest form the JOIN operator is just
    the cross product of the two relations.
  • As the join becomes more complex, tuples are
    removed within the cross product to make the
    result of the join more meaningful.
  • JOIN allows you to evaluate a join condition
    between the attributes of the relations on which
    the join is undertaken.
  • The notation used is
  • R JOIN join condition S

17
JOIN Example
18
Natural Join
  • Invariably the JOIN involves an equality test,
    and thus is often described as an equi-join. Such
    joins result in two attributes in the resulting
    relation having exactly the same value. A
    natural join will remove the duplicate
    attribute(s).
  • In most systems a natural join will require that
    the attributes have the same name to identify the
    attribute(s) to be used in the join. This may
    require a renaming mechanism.
  • If you do use natural joins make sure that the
    relations do not have two attributes with the
    same name by accident.

19
OUTER JOINs
  • Notice that much of the data is lost when
    applying a join to two relations. In some cases
    this lost data might hold useful information. An
    outer join retains the information that would
    have been lost from the tables, replacing missing
    data with nulls.
  • There are three forms of the outer join,
    depending on which data is to be kept.
  • LEFT OUTER JOIN - keep data from the left-hand
    table
  • RIGHT OUTER JOIN - keep data from the right-hand
    table
  • FULL OUTER JOIN - keep data from both tables

20
OUTER JOIN Example 1
21
OUTER JOIN Example 2
22
SQL query optimisation
  • Optimisation Concept
  • Implementation of Rel. Algebra Operations
  • Oracle Query Execution Plans.
  • Btree Indexing
  • Rtree Indexing

23
SQL query optimisation
Parse and Translate
Relational algebra expression
SQL query
Optimisation using data statistics
Query result
Execution plan
Evaluate against Database
24
Optimisation steps
  • Parse
  • check SQL syntax and columns tables valid
  • Translate
  • SQL into relational algebra expression
  • Select most efficient query execution plan
  • minimisation of the input/output and cpu
    requirements
  • Evaluate expression
  • Call required code modules

25
Example SQL query
S(sno,sname,status,scity) SP(sno,pno,qty)
P(pno,pname,colour,weight,pcity) Get supplier
name for suppliers who supply red
parts. SQL select sname from S,P,SP where S.sno
SP.sno and P.pno Sp.pno and colour red
26
Example SQL query
  • How would you do this?

27
Query as relational algebra expression and graph
(((((( P restrict colour red ) project pno
) join SP pno pno) project sno ) join S
sno sno ) project sname )
project join
project join
project restrict
P SP S
28
Relational algebra transformations
  • The following transformations can be applied
    without regard to the actual data values stored
    in the tables referenced in the SQL query.
  • They are being stated to justify some of the
    common manipulations of the relational algebra
    during optimisation.

29
Distributive transformation
Distributive Law f(A o B) f(A) o
f(B) e.g. restrict(A join B) restrict (A)
join restrict (B) Restrict distributes over
union, intersection, difference and join. For
join the restriction must, at its most complex,
consist of the AND of two separate conditions one
for each operand. Project distributes over union,
intersection, join. For join the join attributes
must be included in the projection. Thus
restricts / projects can often be done before
joins.
30
Commutative transformation
Commutative Law The operator o is commutative if
- A o B B o A for all values of A and
B Union, intersection and join are
commutative. E.g S join SP SP join S Thus
optimiser can choose best order typically the
smallest table is used to drive the join
processing.
31
Associative transformation
Associative Law Operator o is associative if - A
o ( B o C ) ( A o B ) o C for all A,B,C
values e.g. A join ( B join C ) ( A join B )
join C Union, intersection and join are
associative. Thus for a 3 table join the
optimiser can choose any sequence of pairs of
joins.
32
Semantic transformations
Referential Integrity Based. select pno from SP,S
where SP.sno S.sno If declared S.sno as PK
and SP.sno as FK then optimise to- select pno
from SP Application of Negator Functions where
not (status gt 20) this cannot use an index scan
so must use table scan but if transform to
status lt 20 - now can use index scan Assumes
that an index on the status column already exists
33
Will the results be the same?
select sname from s,sp where s,sno sp.sno and
sp.pno p2 select sname from sp,s where
sp.pno p2 and sp.sno s.sno select sname
from s where s.sno in ( select sp.sno from
sp where sp.pno p2) Explain. What is
the implication for performance of the three?
34
Will the results be the same?
35
Yes - all produce the same result rows
  • join s to sp can be transformed into join sp to s
    as join operation is commutative
  • restrict can be distributed past the join thus
    first two order of operations can be ignored
  • set membership can be converted to a join
  • optimisation will convert all 3 queries the same
    query so all must have the same performance -
    user can chose which to use

36
Break
37
SQL query optimisation
38
Algorithm selection
  • The relational algebra operations typically have
    multiple code implementations. It is necessary
    for the optimiser to select the most appropriate
    for the current circumstances.
  • For example the presence of a direct access path
    to the data required will be preferred to a
    complete table scan if only 10 -20 of the table
    is to be accessed.

39
Restriction implementations
Full table scan or table scan with early
termination All blocks must be searched unless
an early termination condition is found e.g row
found and no duplicates or data is ordered and
already passed the required data location. Btree
Non Clustered Index Search Tree search to
starting key then linear scan of sorted
keys. Assume to access each record needs a disk
block read.
40
Multiple restrictions
Conjunction Andd together If only some
columns are indexed then use most selective index
to locate the data record and check out the other
restriction conditions in the buffer. If all
columns are indexed then can get rowids of all
records meeting the restriction conditions and
perform an intersection. Use the result set of
rowids to locate the data records. Disjunction
Ord together If all columns indexed get all
rowids and perform union. Use the result set of
rowids to locate the data records.Else must scan
table.
41
Join implementations
  • All give the same final result set of rows as
    defined by the join operation.
  • Algorithm and cost differ.
  • Nested loop algorithm will function with any join
    condition I.e lt, lt, , gt, gt
  • Hash algorithm only possible with
  • Relation S 500 rows 50 rows per disk block
  • Relation R 100000 rows 50 rows per disk block

42
Join - Block Nested Loop Inner file pass 1
S Outer File Pass.Block Buffers Pass.Block
R Inner File
1.1 2.1 1.2 2.2 1.3 2.3
1.1 1.2
Join col values equal
43
Join - Block Nested LoopInner file pass 2
S Outer File Pass.Block Buffers Pass.Block
R Inner File
1.1 2.1 1.2 2.2 1.3 2.3
1.1 1.2
Join col values equal
44
Join - Block Nested Loop - Cost
  • Cost
  • No. Outer Loop Blocks No. Inner Loop blocks
    No.
    Outer Loop Blocks
  • Use smallest table as outer loop
  • i.e. (500/50 100000/50 ) 500/50 20010
  • Reduce cost of inner loop by having -
  • index built on inner loop join column which is
    also a Primary Key - lookup avoiding full file
    scan assume needs 1 disk read per row in outer
    file
  • (500 / 50 500 ) 510

45
Equi Join - Hash example
Relation R 91 1 40 900
Disk Hash Buckets R 0 1

Disk Hash Buckets S 0 1
Relation S 40 101 90 1
1000 91
Value allocated to 0 if even else allocated to
1 if odd
46
Equi Join - Hash example
Relation R 91 1 40 900
Disk Hash Buckets R 0 40 900
1 1 91
Disk Hash Buckets S 0 40 90 1000 1
101 1 91
Relation S 40 101 90 1
1000 91
Value allocated to 0 if even else allocated to
1 if odd
47
Equi Join - Hash cost
  • Only need to compare record values from matching
    hash buckets to find which records need joining.
    Assumes mechanism to retrieve required data rows.
  • Cost
  • Read files in for hashing Write out to allocated
    hash bucket Read in each pair of buckets for the
    search for matching join values.
  • 3 ( No Blocks of R No of blocks of S)
  • 3 ( 2000 10 ) 6030

48
Join before restrict Example
S(sno,sname,status,scity) 100 rows 20 rows/disk
block SP(sno,pno,qty) 10000 rows 100/disk
block 50 with pno p2 select from S,SP where
pno p2 and S.sno SP.sno Approach 1 Cost
in Block I/O restrict write out after
restrict 1 join nested loop
5 (1005) 505 S SP Total 506
49
Restrict before join Example
Approach 2 Cost in Block I/O write out
after join 1 join nested
loop in RAM full scan of S
5 restrict full scan of
SP 100 S SP Total 106
50
Three table join query
branch(b_name, b_city, manager, b_address) 100
rows account(b_name, a_number, balance)
10 000 rows account_holder(h_name,a_number,h_addre
ss) 50 000 rows select h_name, balance from
account a, account_holder h, branch b where
a.a_number h.a_number and a.b_name
b.b_name and b.city Leicester and balance gt
10000
51
Optimised query plan
project(h_name,balance)
join (a_number) project(a_number,balance)
join (b_name) project (b_name)
project(h_name,a_number) restrict (b_city
Leicester) restrict(balancegt10000) branc
h account account_holder
52
Three table join query plan justification
53
Three table join query plan justification
  • Projects and restricts distributed past joins
  • join branch and account as pair of smallest
    tables and therefore assume that this will
    provide the smallest result set.

54
Oracle query demo tables
Tables arel ( col1 number(3) constraint pk_arel
primary key, col2 number(2), col3
varchar(400)) Btree index col1 col2 900
rows brel (col1 number(3) constraint pk_brel
primary key, col2 number(2)) Btree index col1
col2 900 rows
55
Data distribution arel (1)
Check data distribution of arel by - select
colX , count() from arel group by colX for X
1 i.e col1 get 900 groups with count of 1
thus 900 unique values no duplicates for X 3
i.e. col3 get 1 group with count of 900
thus 1 value duplicated 900 times Both above
have a uniform distribution.
56
Data distribution arel (2)
for X 2 i.e col2 get - COL2 COUNT() COL2
COUNT() 1 120 8 60 2
120 9 30 3 120 10
30 4 120 15 30 5 60 20
30 6 60 29 30 7
60 30 30 A non uniform distribution of
values in col2
57
Analysis of arel table to provide Oracle with
statistics
analyze table arel compute statistics
COLUMN_NAME NUM_DISTINCT NUM_NULLS
NUM_BUCKETS COL1 900 0
1 COL2 14 0 1 COL3
1 0 1 As col2is known
to have a non uniform distribution best to
generate a histogram i.e need more than 1 bucket
58
Analysis of arel col2
analyze table arel compute statistics for columns
col2 Now show endpoint data generated using a
query against a system table NUMBER
VALUE NUMBER VALUE 120 1 720
8 240 2 750 9 360 3 780
10 480 4 810 15 540 5 840
20 600 6 870 29 660
7 900 30
59
Query execution plan
  • Oracle uses the explain plan for select
    statement to generate a query plan
  • Results stored in a table generated in your
    schema by running a script utlxplan
  • Results are displayed from the table by running
    the script utlxpls

60
Oracle query Restrict Project via indexed PK
col1
Restrict Project explain plan for select
col1,col2 from arel where col1 56 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 39 2 TABLE
ACCESS BY INDEX ROWID AREL 1 39 2
INDEX UNIQUE SCAN PK_AREL 1 1
Query buffer reads 3
61
Oracle query Restrict Project via non indexed
col3
Restrict Project explain plan for select
col1,col2 from arel where col3 x start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 900 200K 35 TABLE
ACCESS FULL AREL 900 200K 35 Query
buffer reads 229
62
Oracle query Restrict Project via indexed col2
(1)
Restrict Project explain plan for select
col1,col2 from arel where col2 30 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 30 1K 31 TABLE
ACCESS BY INDEX ROWID AREL 30 1K 31
INDEX RANGE SCAN AREL_COL2 30
1 Query buffer reads 34
63
Oracle query Restrict Project via indexed col2
(2)
Restrict Project explain plan for select
col1,col2 from arel where col2 1 start utlxpls
/ Operation Name Rows Bytes Cost SELECT
STATEMENT 120 4K 35 TABLE ACCESS
FULL AREL 120 4K 35 Query buffer
reads 236 Why has it changed just because
using a different row value?
64
Oracle query Project Join all rows arel brel
Project Join explain plan for select count()
from arel a, brel b where a.col1 b.col1 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 23 2SORT
AGGREGATE 1 23 NESTED LOOPS
900 20K 2
INDEX FAST FULL SCAN PK_AREL 900 11K
INDEX UNIQUE SCAN PK_BREL 900 8K
Query buffer reads 912
65
Oracle query Restrict Join via indexed cols (1)
Restrict Join Query buffer reads 4 explain
plan for select count() from arel a, brel b
where a.col1 b.col1 and a.col1 500 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 36 1 SORT
AGGREGATE 1 36 NESTED LOOPS 1 36 1
INDEX UNIQUE SCAN PK_BREL 1 10 1 INDEX
UNIQUE SCAN PK_AREL 1 26 1
66
Oracle query Restrict Join via indexed cols (2)
explain plan for Query buffer reads 64 select
count() from arel a, brel b where a.col1
b.col1 and a.col2 10 Operation Name
Rows Bytes Cost SELECT STATEMENT 1 36
31 SORT AGGREGATE 1 36 NESTED LOOPS 30
1K 31 TABLE ACCESS BY INDEX
ROWID AREL INDEX RANGE SCAN AREL_COL2
30 1 INDEX UNIQUE SCAN PK_BREL 900
8K
67
Oracle query Restrict Join via indexed cols (3)
Restrict Join Query buffer reads
351 explain plan for select count() from arel
a, brel b where a.col1 b.col1 and a.col2
1 Operation Name Rows Bytes Cost SELECT
STATEMENT 1 36 35 SORT AGGREGATE
1 36 NESTED LOOPS 120 4K 35
TABLE ACCESS FULL AREL 120 3K 35 INDEX
UNIQUE SCAN PK_BREL 900 8K
68
Oracle query Restrict Project via indexed col2
(3)
Restrict Project explain plan for select
col1,col2 from brel where col2 30 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 30 300 2 TABLE
ACCESS FULL BREL 30 300 2 Query buffer
reads 18 Previous query against
arel used indexed access. Why not for brel?
69
Oracle query Restrict Project via indexed col2
(4)
analyze table brel delete statistics explain
plan for select col1,col2 from brel where col2
1 start utlxpls / Operation Name Rows
Bytes Cost SELECT STATEMENT TABLE ACCESS BY
INDEX ROWID BREL INDEX RANGE
SCAN BREL_COL2 Now no statistics
so chooses index access
70
Optimiser estimation methods
  • Hard coded values in the optimiser algorithm
  • e.g restriction 1 rows on , 10 on gt
  • No consideration of actual data values
  • Statistics based on number of different distinct
    values in column
  • assumes uniform distribution
  • Statistics based on histogram of count of unique
    values.

71
Importance of estimates
  • The use of an index is only efficient if less
    than 20 of the data is to be returned.
  • Each row access may require a separate disk read.
    Thus same disk block may be retrieved more than
    once if buffer space limited. More efficient to
    execute a full table scan.
  • Cost of statistics collection needs to be traded
    off against improved performance of queries based
    on more accurate estimates.

72
Database Implementations
73
Relational Databases
  • The Conceptual model has become the Physical
    model
  • Row based storage
  • No ordering of data
  • Row based searches
  • Very many comparisons is required to find and
    manipulate data

74
Relational Databases/APL
  • Current APL/K/J databases
  • all relational
  • all Column stores
  • kdb
  • Vstar
  • flipdb
  • TakeCare

75
Relational Databases/APL
  • The relational model holds
  • Row stores are implemented with great engineering
    skills
  • Use them if you can

76
Relational Databases/APL
  • but
  • The physical laws still apply
  • Some problems might need different solutions

77
Semistructured data
78
title year length filmType studioName starName
Star Wars 1977 124 color Fox Carrie Fisher
Star Wars 1977 124 color Fox Mark Hamill
Star Wars 1977 124 color Fox Harrison Ford
Mighty Ducks 1991 104 color Disney Emilio Estevez
Wayne's World 1992 95 color Paramount Dana Carvey
Wayne's World 1992 95 color Paramount Mike Meyers
79
Root
star
movie
star
starIn
mh
sw
cf
starOf
year
name
title
city
address
address
name
street
Star Wars
1977
Mark Hamill
Oak
street
Brentwood
Carrie Fisher
street
city
city
starOf
Hollywood
Maple
Locust
Malibu
starIn
80
Multidimensional data
81
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com