Title: Gitte Christensen
1Managing External Data 3
- Gitte Christensen
- Dyalog Ltd
2Relational Algebra
- Relational Algebra is
- the formal description of how a relational
database operates - the mathematics which underpin SQL
operations. - Operators in relational algebra are not
necessarily the same as SQL operators, even if
they have the same name.
3Terminology
- Relation - a set of tuples.
- Tuple - a collection of attributes which describe
some real world entity. - Attribute - a real world role played by a named
domain. - Domain - a set of atomic values.
- Set - a mathematical definition for a collection
of objects which contains no duplicates.
4Operators - Write
- INSERT - provides a list of attribute values for
a new tuple in a relation. This operator is the
same as SQL. - DELETE - provides a condition on the attributes
of a relation to determine which tuple(s) to
remove from the relation. This operator is the
same as SQL. - MODIFY - changes the values of one or more
attributes in one or more tuples of a relation,
as identified by a condition operating on the
attributes of the relation. This is equivalent to
SQL UPDATE.
5Operators - Retrieval
- There are two groups of operations
- Mathematical set theory based relations UNION,
INTERSECTION, DIFFERENCE, and CARTESIAN PRODUCT. - Special database operations SELECT (not the
same as SQL SELECT), PROJECT, and JOIN.
6Relational SELECT
- SELECT is used to obtain a subset of the tuples
of a relation that satisfy a select condition. - For example, find all employees born after 1st
Jan 1950 - SELECT dob gt 01/JAN/1950 (employee)
7Relational PROJECT
- The PROJECT operation is used to select a subset
of the attributes of a relation by specifying the
names of the required attributes. - For example, to get a list of all employees
surnames and employee numbers - PROJECT surname,empno (employee)
8SELECT and PROJECT
SELECT and PROJECT can be combined together. For
example, to get a list of employee numbers for
employees in department number 1
9Set Operations - semantics
- Consider two relations R and S.
- UNION of R and Sthe union of two relations is a
relation that includes all the tuples that are
either in R or in S or in both R and S. Duplicate
tuples are eliminated. - INTERSECTION of R and Sthe intersection of R and
S is a relation that includes all tuples that are
both in R and S. - DIFFERENCE of R and Sthe difference of R and S
is the relation that contains all the tuples that
are in R but that are not in S.
10SET Operations - requirements
- For set operations to function correctly the
relations R and S must be union compatible. Two
relations are union compatible if - they have the same number of attributes
- the domain of each attribute in column order is
the same in both R and S.
11UNION Example
12INTERSECTION Example
13DIFFERENCE Example
14CARTESIAN PRODUCT
- The Cartesian Product is also an operator which
works on two sets. It is sometimes called the
CROSS PRODUCT or CROSS JOIN. - It combines the tuples of one relation with all
the tuples of the other relation.
15CARTESIAN PRODUCT Example
16JOIN Operator
- JOIN is used to combine related tuples from two
relations - In its simplest form the JOIN operator is just
the cross product of the two relations. - As the join becomes more complex, tuples are
removed within the cross product to make the
result of the join more meaningful. - JOIN allows you to evaluate a join condition
between the attributes of the relations on which
the join is undertaken. - The notation used is
- R JOIN join condition S
17JOIN Example
18Natural Join
- Invariably the JOIN involves an equality test,
and thus is often described as an equi-join. Such
joins result in two attributes in the resulting
relation having exactly the same value. A
natural join will remove the duplicate
attribute(s). - In most systems a natural join will require that
the attributes have the same name to identify the
attribute(s) to be used in the join. This may
require a renaming mechanism. - If you do use natural joins make sure that the
relations do not have two attributes with the
same name by accident.
19OUTER JOINs
- Notice that much of the data is lost when
applying a join to two relations. In some cases
this lost data might hold useful information. An
outer join retains the information that would
have been lost from the tables, replacing missing
data with nulls. - There are three forms of the outer join,
depending on which data is to be kept. - LEFT OUTER JOIN - keep data from the left-hand
table - RIGHT OUTER JOIN - keep data from the right-hand
table - FULL OUTER JOIN - keep data from both tables
20OUTER JOIN Example 1
21OUTER JOIN Example 2
22SQL query optimisation
- Optimisation Concept
- Implementation of Rel. Algebra Operations
- Oracle Query Execution Plans.
- Btree Indexing
- Rtree Indexing
23SQL query optimisation
Parse and Translate
Relational algebra expression
SQL query
Optimisation using data statistics
Query result
Execution plan
Evaluate against Database
24Optimisation steps
- Parse
- check SQL syntax and columns tables valid
- Translate
- SQL into relational algebra expression
- Select most efficient query execution plan
- minimisation of the input/output and cpu
requirements - Evaluate expression
- Call required code modules
25Example SQL query
S(sno,sname,status,scity) SP(sno,pno,qty)
P(pno,pname,colour,weight,pcity) Get supplier
name for suppliers who supply red
parts. SQL select sname from S,P,SP where S.sno
SP.sno and P.pno Sp.pno and colour red
26Example SQL query
27Query as relational algebra expression and graph
(((((( P restrict colour red ) project pno
) join SP pno pno) project sno ) join S
sno sno ) project sname )
project join
project join
project restrict
P SP S
28Relational algebra transformations
- The following transformations can be applied
without regard to the actual data values stored
in the tables referenced in the SQL query. - They are being stated to justify some of the
common manipulations of the relational algebra
during optimisation.
29Distributive transformation
Distributive Law f(A o B) f(A) o
f(B) e.g. restrict(A join B) restrict (A)
join restrict (B) Restrict distributes over
union, intersection, difference and join. For
join the restriction must, at its most complex,
consist of the AND of two separate conditions one
for each operand. Project distributes over union,
intersection, join. For join the join attributes
must be included in the projection. Thus
restricts / projects can often be done before
joins.
30Commutative transformation
Commutative Law The operator o is commutative if
- A o B B o A for all values of A and
B Union, intersection and join are
commutative. E.g S join SP SP join S Thus
optimiser can choose best order typically the
smallest table is used to drive the join
processing.
31Associative transformation
Associative Law Operator o is associative if - A
o ( B o C ) ( A o B ) o C for all A,B,C
values e.g. A join ( B join C ) ( A join B )
join C Union, intersection and join are
associative. Thus for a 3 table join the
optimiser can choose any sequence of pairs of
joins.
32Semantic transformations
Referential Integrity Based. select pno from SP,S
where SP.sno S.sno If declared S.sno as PK
and SP.sno as FK then optimise to- select pno
from SP Application of Negator Functions where
not (status gt 20) this cannot use an index scan
so must use table scan but if transform to
status lt 20 - now can use index scan Assumes
that an index on the status column already exists
33Will the results be the same?
select sname from s,sp where s,sno sp.sno and
sp.pno p2 select sname from sp,s where
sp.pno p2 and sp.sno s.sno select sname
from s where s.sno in ( select sp.sno from
sp where sp.pno p2) Explain. What is
the implication for performance of the three?
34Will the results be the same?
35Yes - all produce the same result rows
- join s to sp can be transformed into join sp to s
as join operation is commutative - restrict can be distributed past the join thus
first two order of operations can be ignored - set membership can be converted to a join
- optimisation will convert all 3 queries the same
query so all must have the same performance -
user can chose which to use
36Break
37SQL query optimisation
38Algorithm selection
- The relational algebra operations typically have
multiple code implementations. It is necessary
for the optimiser to select the most appropriate
for the current circumstances. - For example the presence of a direct access path
to the data required will be preferred to a
complete table scan if only 10 -20 of the table
is to be accessed.
39 Restriction implementations
Full table scan or table scan with early
termination All blocks must be searched unless
an early termination condition is found e.g row
found and no duplicates or data is ordered and
already passed the required data location. Btree
Non Clustered Index Search Tree search to
starting key then linear scan of sorted
keys. Assume to access each record needs a disk
block read.
40Multiple restrictions
Conjunction Andd together If only some
columns are indexed then use most selective index
to locate the data record and check out the other
restriction conditions in the buffer. If all
columns are indexed then can get rowids of all
records meeting the restriction conditions and
perform an intersection. Use the result set of
rowids to locate the data records. Disjunction
Ord together If all columns indexed get all
rowids and perform union. Use the result set of
rowids to locate the data records.Else must scan
table.
41Join implementations
- All give the same final result set of rows as
defined by the join operation. - Algorithm and cost differ.
- Nested loop algorithm will function with any join
condition I.e lt, lt, , gt, gt - Hash algorithm only possible with
- Relation S 500 rows 50 rows per disk block
- Relation R 100000 rows 50 rows per disk block
42Join - Block Nested Loop Inner file pass 1
S Outer File Pass.Block Buffers Pass.Block
R Inner File
1.1 2.1 1.2 2.2 1.3 2.3
1.1 1.2
Join col values equal
43Join - Block Nested LoopInner file pass 2
S Outer File Pass.Block Buffers Pass.Block
R Inner File
1.1 2.1 1.2 2.2 1.3 2.3
1.1 1.2
Join col values equal
44Join - Block Nested Loop - Cost
- Cost
- No. Outer Loop Blocks No. Inner Loop blocks
No.
Outer Loop Blocks - Use smallest table as outer loop
- i.e. (500/50 100000/50 ) 500/50 20010
- Reduce cost of inner loop by having -
- index built on inner loop join column which is
also a Primary Key - lookup avoiding full file
scan assume needs 1 disk read per row in outer
file - (500 / 50 500 ) 510
45Equi Join - Hash example
Relation R 91 1 40 900
Disk Hash Buckets R 0 1
Disk Hash Buckets S 0 1
Relation S 40 101 90 1
1000 91
Value allocated to 0 if even else allocated to
1 if odd
46Equi Join - Hash example
Relation R 91 1 40 900
Disk Hash Buckets R 0 40 900
1 1 91
Disk Hash Buckets S 0 40 90 1000 1
101 1 91
Relation S 40 101 90 1
1000 91
Value allocated to 0 if even else allocated to
1 if odd
47Equi Join - Hash cost
- Only need to compare record values from matching
hash buckets to find which records need joining.
Assumes mechanism to retrieve required data rows. - Cost
- Read files in for hashing Write out to allocated
hash bucket Read in each pair of buckets for the
search for matching join values. - 3 ( No Blocks of R No of blocks of S)
- 3 ( 2000 10 ) 6030
48Join before restrict Example
S(sno,sname,status,scity) 100 rows 20 rows/disk
block SP(sno,pno,qty) 10000 rows 100/disk
block 50 with pno p2 select from S,SP where
pno p2 and S.sno SP.sno Approach 1 Cost
in Block I/O restrict write out after
restrict 1 join nested loop
5 (1005) 505 S SP Total 506
49Restrict before join Example
Approach 2 Cost in Block I/O write out
after join 1 join nested
loop in RAM full scan of S
5 restrict full scan of
SP 100 S SP Total 106
50Three table join query
branch(b_name, b_city, manager, b_address) 100
rows account(b_name, a_number, balance)
10 000 rows account_holder(h_name,a_number,h_addre
ss) 50 000 rows select h_name, balance from
account a, account_holder h, branch b where
a.a_number h.a_number and a.b_name
b.b_name and b.city Leicester and balance gt
10000
51Optimised query plan
project(h_name,balance)
join (a_number) project(a_number,balance)
join (b_name) project (b_name)
project(h_name,a_number) restrict (b_city
Leicester) restrict(balancegt10000) branc
h account account_holder
52Three table join query plan justification
53Three table join query plan justification
- Projects and restricts distributed past joins
- join branch and account as pair of smallest
tables and therefore assume that this will
provide the smallest result set.
54Oracle query demo tables
Tables arel ( col1 number(3) constraint pk_arel
primary key, col2 number(2), col3
varchar(400)) Btree index col1 col2 900
rows brel (col1 number(3) constraint pk_brel
primary key, col2 number(2)) Btree index col1
col2 900 rows
55Data distribution arel (1)
Check data distribution of arel by - select
colX , count() from arel group by colX for X
1 i.e col1 get 900 groups with count of 1
thus 900 unique values no duplicates for X 3
i.e. col3 get 1 group with count of 900
thus 1 value duplicated 900 times Both above
have a uniform distribution.
56Data distribution arel (2)
for X 2 i.e col2 get - COL2 COUNT() COL2
COUNT() 1 120 8 60 2
120 9 30 3 120 10
30 4 120 15 30 5 60 20
30 6 60 29 30 7
60 30 30 A non uniform distribution of
values in col2
57Analysis of arel table to provide Oracle with
statistics
analyze table arel compute statistics
COLUMN_NAME NUM_DISTINCT NUM_NULLS
NUM_BUCKETS COL1 900 0
1 COL2 14 0 1 COL3
1 0 1 As col2is known
to have a non uniform distribution best to
generate a histogram i.e need more than 1 bucket
58Analysis of arel col2
analyze table arel compute statistics for columns
col2 Now show endpoint data generated using a
query against a system table NUMBER
VALUE NUMBER VALUE 120 1 720
8 240 2 750 9 360 3 780
10 480 4 810 15 540 5 840
20 600 6 870 29 660
7 900 30
59Query execution plan
- Oracle uses the explain plan for select
statement to generate a query plan - Results stored in a table generated in your
schema by running a script utlxplan - Results are displayed from the table by running
the script utlxpls
60Oracle query Restrict Project via indexed PK
col1
Restrict Project explain plan for select
col1,col2 from arel where col1 56 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 39 2 TABLE
ACCESS BY INDEX ROWID AREL 1 39 2
INDEX UNIQUE SCAN PK_AREL 1 1
Query buffer reads 3
61Oracle query Restrict Project via non indexed
col3
Restrict Project explain plan for select
col1,col2 from arel where col3 x start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 900 200K 35 TABLE
ACCESS FULL AREL 900 200K 35 Query
buffer reads 229
62Oracle query Restrict Project via indexed col2
(1)
Restrict Project explain plan for select
col1,col2 from arel where col2 30 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 30 1K 31 TABLE
ACCESS BY INDEX ROWID AREL 30 1K 31
INDEX RANGE SCAN AREL_COL2 30
1 Query buffer reads 34
63Oracle query Restrict Project via indexed col2
(2)
Restrict Project explain plan for select
col1,col2 from arel where col2 1 start utlxpls
/ Operation Name Rows Bytes Cost SELECT
STATEMENT 120 4K 35 TABLE ACCESS
FULL AREL 120 4K 35 Query buffer
reads 236 Why has it changed just because
using a different row value?
64Oracle query Project Join all rows arel brel
Project Join explain plan for select count()
from arel a, brel b where a.col1 b.col1 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 23 2SORT
AGGREGATE 1 23 NESTED LOOPS
900 20K 2
INDEX FAST FULL SCAN PK_AREL 900 11K
INDEX UNIQUE SCAN PK_BREL 900 8K
Query buffer reads 912
65Oracle query Restrict Join via indexed cols (1)
Restrict Join Query buffer reads 4 explain
plan for select count() from arel a, brel b
where a.col1 b.col1 and a.col1 500 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 1 36 1 SORT
AGGREGATE 1 36 NESTED LOOPS 1 36 1
INDEX UNIQUE SCAN PK_BREL 1 10 1 INDEX
UNIQUE SCAN PK_AREL 1 26 1
66Oracle query Restrict Join via indexed cols (2)
explain plan for Query buffer reads 64 select
count() from arel a, brel b where a.col1
b.col1 and a.col2 10 Operation Name
Rows Bytes Cost SELECT STATEMENT 1 36
31 SORT AGGREGATE 1 36 NESTED LOOPS 30
1K 31 TABLE ACCESS BY INDEX
ROWID AREL INDEX RANGE SCAN AREL_COL2
30 1 INDEX UNIQUE SCAN PK_BREL 900
8K
67Oracle query Restrict Join via indexed cols (3)
Restrict Join Query buffer reads
351 explain plan for select count() from arel
a, brel b where a.col1 b.col1 and a.col2
1 Operation Name Rows Bytes Cost SELECT
STATEMENT 1 36 35 SORT AGGREGATE
1 36 NESTED LOOPS 120 4K 35
TABLE ACCESS FULL AREL 120 3K 35 INDEX
UNIQUE SCAN PK_BREL 900 8K
68Oracle query Restrict Project via indexed col2
(3)
Restrict Project explain plan for select
col1,col2 from brel where col2 30 start
utlxpls / Operation Name Rows Bytes
Cost SELECT STATEMENT 30 300 2 TABLE
ACCESS FULL BREL 30 300 2 Query buffer
reads 18 Previous query against
arel used indexed access. Why not for brel?
69Oracle query Restrict Project via indexed col2
(4)
analyze table brel delete statistics explain
plan for select col1,col2 from brel where col2
1 start utlxpls / Operation Name Rows
Bytes Cost SELECT STATEMENT TABLE ACCESS BY
INDEX ROWID BREL INDEX RANGE
SCAN BREL_COL2 Now no statistics
so chooses index access
70Optimiser estimation methods
- Hard coded values in the optimiser algorithm
- e.g restriction 1 rows on , 10 on gt
- No consideration of actual data values
- Statistics based on number of different distinct
values in column - assumes uniform distribution
- Statistics based on histogram of count of unique
values.
71Importance of estimates
- The use of an index is only efficient if less
than 20 of the data is to be returned. - Each row access may require a separate disk read.
Thus same disk block may be retrieved more than
once if buffer space limited. More efficient to
execute a full table scan. - Cost of statistics collection needs to be traded
off against improved performance of queries based
on more accurate estimates.
72Database Implementations
73Relational Databases
- The Conceptual model has become the Physical
model - Row based storage
- No ordering of data
- Row based searches
- Very many comparisons is required to find and
manipulate data
74Relational Databases/APL
- Current APL/K/J databases
- all relational
- all Column stores
- kdb
- Vstar
- flipdb
- TakeCare
75Relational Databases/APL
- The relational model holds
- Row stores are implemented with great engineering
skills - Use them if you can
76Relational Databases/APL
- but
- The physical laws still apply
- Some problems might need different solutions
77Semistructured data
78title year length filmType studioName starName
Star Wars 1977 124 color Fox Carrie Fisher
Star Wars 1977 124 color Fox Mark Hamill
Star Wars 1977 124 color Fox Harrison Ford
Mighty Ducks 1991 104 color Disney Emilio Estevez
Wayne's World 1992 95 color Paramount Dana Carvey
Wayne's World 1992 95 color Paramount Mike Meyers
79Root
star
movie
star
starIn
mh
sw
cf
starOf
year
name
title
city
address
address
name
street
Star Wars
1977
Mark Hamill
Oak
street
Brentwood
Carrie Fisher
street
city
city
starOf
Hollywood
Maple
Locust
Malibu
starIn
80Multidimensional data
81(No Transcript)