Principles of Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Principles of Query Processing

Description:

... result = R * selectivity. Scan. Clustered index: Good. Non-clustered index: Good for low selectivity. Worse than scan for high selectivity. SELECT * FROM Reserves ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 56
Provided by: hhp7
Category:

less

Transcript and Presenter's Notes

Title: Principles of Query Processing


1
Principles of Query Processing
CS5226 Week 5
  • Pang Hwee Hwa
  • School of Computing, NUS

2
ApplicationProgrammer(e.g., business
analyst, Data architect)
Application
SophisticatedApplicationProgrammer(e.g., SAP
admin)
Query Processor
Indexes
Storage Subsystem
Concurrency Control
Recovery
DBA,Tuner
Operating System
HardwareProcessor(s), Disk(s), Memory
3
Overview of Query Processing
Database
Statistics
Cost Model
Query Optimizer
Query Evaluator
QEP
Parsed Query
Parser
High Level Query
Query Result
4
Outline
  • Processing relational operators
  • Query optimization
  • Performance tuning

5
Projection Operator
  • ?R.attrib, .. (R)
  • Implementation is straightforward

SELECT bid FROM Reserves R WHERE R.rname lt
C
6
Selection Operator
  • ?R.attr op value (R)
  • Size of result R selectivity
  • Scan
  • Clustered index Good
  • Non-clustered index
  • Good for low selectivity
  • Worse than scan for high selectivity

SELECT FROM Reserves R WHERE R.rname lt
C
7
Example of Join
SELECT FROM Sailors R, Reserve S WHERE
R.sidS.sid
8
Notations
  • R number of pages in outer table R
  • R number of tuples in outer table R
  • S number of pages in inner table S
  • S number of tuples in inner table S
  • M number of main memory pages allocated

9
Simple Nested Loop Join
R
S
Tuple
R tuples
10
Simple Nested Loop Join
  • Scan inner table S per R tuple R S
  • Each scan costs S pages
  • For R tuples
  • R pages for outer table R
  • Total cost R R S pages
  • Not optimal!

11
Block Nested Loop Join
R
S
M 2 pages
?R / (M 2)? blocks
12
Block Nested Loop Join
  • Scan inner table S per block of (M 2) pages of
    R tuples
  • Each scan costs S pages
  • R / (M 2) blocks of R tuples
  • R pages for outer table R
  • Total cost R ?R / (M 2)? S pages
  • R should be the smaller table

13
Index Nested Loop Join
R
S
Index
Tuple
R tuples
14
Index Nested Loop Join
  • Probe S index for matching S tuples per R tuple
  • Probe hash index 1.2 I/Os
  • Probe B tree 2-4 I/Os, plus retrieve matching S
    tuples 1 I/O
  • For R tuples
  • R pages for outer table R
  • Total cost R R index retrieval
  • Better than Block NL join only for small number
    of R tuples

15
Sort Merge Join
  • External sort R
  • External sort S
  • Merge sorted R and sorted S

16
External Sort R
(m-1)-way merge
Size of R0,i M, R0,is ?R/M?
merge passes ?logM-1 R/M? Cost per pass
R input R output 2 R Total cost 2
R (?logM-1 R/M? 1) including split pass
17
Sort Merge Join
  • External-sort R 2 R (?logM-1 R/M? 1)
  • Split R into R/M sorted runs each of size M 2
    R
  • Merge up to (M 1) runs repeatedly
  • ?logM-1 R/M? passes, each costing 2 R
  • External-sort S 2 S (?logM-1 S/M? 1)
  • Merge matching tuples from sorted R and S R
    S
  • Total cost 2 R (?logM-1 R/M? 1) 2 S
    (?logM-1 S/M? 1) R S
  • If R lt M(M-1), cost 5 (R S)

18
GRACE Hash Join
S
0 1 2 3
X X X X X X X X X X X X
X X X X X X
X X X X X X
X X X
X X X
X X X
X X X
0
bucketID X mod 4 Join on R.X S.X
1
R S R0 S0 R1
S1 R2 S2
R3 S3
R
2
3
19
GRACE Hash Join Partition Phase
R ? (M 1) partitions, each of size R / (M 1)
20
GRACE Hash Join Join Phase
Partition must fit in memory R / (M 1) lt M -1
21
GRACE Hash Join Algorithm
  • Partition phase 2 (R S)
  • Partition table R using hash function h1 2 R
  • Partition table S using hash function h1 2 S
  • R tuples in partition i will match only S tuples
    in partition I
  • R ? (M 1) partitions, each of size R / (M
    1)
  • Join phase R S
  • Read in a partition of R (R / (M 1) lt M -1)
  • Hash it using function h2 (ltgt h1!)
  • Scan corresponding S partition, search for
    matches
  • Total cost 3 (R S) pages
  • Condition M gt vfR, f 1.2 to account for hash
    table

22
Summary of Join Operator
  • Simple nested loop R R S
  • Block nested loop R ?R / (M 2)? S
  • Index nested loop R R index retrieval
  • Sort-merge 2 R (?logM-1 R/M? 1) 2 S
    (?logM-1 S/M? 1) R S
  • GRACE hash 3 (R S)
  • Condition M gt vfR

23
Overview of Query Processing
Database
Statistics
Cost Model
Query Optimizer
Query Evaluator
QEP
Parsed Query
Parser
High Level Query
Query Result
24
Query Optimization
  • Given An SQL query joining n tables
  • Dream Map to most efficient plan
  • Reality Avoid rotten plans
  • State of the art
  • Most optimizers follow System Rs technique
  • Works fine up to about 10 joins

SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
25
Complexity of Query Optimization
  • Many degrees of freedom
  • Selection scan versus (clustered, non-clustered)
    index
  • Join block nested loop, sort-merge, hash
  • Relative order of the operators
  • Exponential search space!
  • Heuristics
  • Push the selections down
  • Push the projections down
  • Delay Cartesian products
  • System R Only left-deep trees

26
Equivalences in Relational Algebra
  • Selection
    - cascade

  • - commutative
  • Projection
    - cascade
  • Join
    - associative

  • - commutative

R (S T) (R S) T
(R S) (S R)
27
Equivalences in Relational Algebra
  • A projection commutes with a selection that only
    uses attributes retained by the projection
  • Selection between attributes of the two arguments
    of a cross-product converts cross-product to a
    join
  • A selection on just attributes of R commutes with
    join R S (i.e., (R S)
    (R) S )
  • Similarly, if a projection follows a join R
    S, we can push it by retaining only attributes
    of R (and S) that are needed for the join or are
    kept by the projection

28
System R Optimizer
  • Find all plans for accessing each base table
  • For each table
  • Save cheapest unordered plan
  • Save cheapest plan for each interesting order
  • Discard all others
  • Try all ways of joining pairs of 1-table plans
    save cheapest unordered interesting ordered
    plans
  • Try all ways of joining 2-table with 1-table
  • Combine k-table with 1-table till you have full
    plan tree
  • At the top, to satisfy GROUP BY and ORDER BY
  • Use interesting ordered plan
  • Add a sort node to unordered plan

29
Source Selinger et al, Access Path Selection in
a Relational Database Management System
30
Note Only branches for NL join are shown here.
Additional branches for other join
methods (e.g. sort-merge) are not shown.
Source Selinger et al, Access Path Selection in
a Relational Database Management System
31
What is Cheapest?
  • Need information about the relations and indexes
    involved
  • Catalogs typically contain at least
  • tuples (NTuples) and pages (NPages) for each
    relation.
  • distinct key values (NKeys) and NPages for each
    index.
  • Index height, low/high key values (Low/High) for
    each tree index.
  • Catalogs updated periodically.
  • Updating whenever data changes is too expensive
    lots of approximation anyway, so slight
    inconsistency ok.
  • More detailed information (e.g., histograms of
    the values in some field) are sometimes stored.

32
Estimating Result Size
SELECT attribute list FROM relation list WHERE
term1 AND ... AND termk
  • Consider a query block
  • Maximum tuples in result is the product of the
    cardinalities of relations in the FROM clause.
  • Reduction factor (RF) associated with each termi
    reflects the impact of the term in reducing
    result size
  • Term colvalue has RF 1/NKeys(I)
  • Term col1col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
  • Term colgtvalue has RF (High(I)-value)/(High(I)-Low
    (I))
  • Result cardinality Max tuples product of
    all RFs.
  • Implicit assumption that terms are independent!

33
Cost Estimates for Single-Table Plans
  • Index I on primary key matches selection
  • Cost is Height(I)1 for a B tree, about 1.2 for
    hash index.
  • Clustered index I matching one or more selects
  • (NPages(I)NPages(R)) product of RFs of
    matching selects.
  • Non-clustered index I matching one or more
    selects
  • (NPages(I)NTuples(R)) product of RFs of
    matching selects.
  • Sequential scan of file
  • NPages(R).
  • Note Typically, no duplicate elimination on
    projections! (Exception Done on answers if user
    says DISTINCT.)

34
Counting the Costs
  • With 5 buffers, cost of plan
  • Scan Reserves (1000) write temp T1 (10 pages,
    if we have 100 boats, uniform distribution)
  • Scan Sailors (500) write temp T2 (250 pages, if
    we have 10 ratings).
  • Sort T1 (2102), sort T2 (22504), merge
    (10250), total2300
  • Total 4060 page I/Os
  • If we used BNL join, join cost 104250, total
    cost 2770
  • If we push projections, T1 has only sid, T2
    only sid and sname
  • T1 fits in 3 pages, cost of BNL drops to under
    250 pages, total lt 2000

SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
35
Exercise
  • Reserves 100,000 tuples, 100 tuples per page
  • With clustered index on bid of Reserves, we get
    100,000/100 1000 tuples on 1000/100 10 pages
  • Join column sid is a key for Sailors - at most
    one matching tuple
  • Decision not to push ratinggt5 before the join is
    based on availability of sid index on Sailors
  • Cost Selection of Reserves tuples (10 I/Os)
    for each tuple, must get matching Sailors tuple
    (10001.2) total 1210 I/Os

(On-the-fly)
sname
(On-the-fly)
rating gt 5
(Index Nested Loops,
with pipelining )
sidsid
(Use hash Index on sid)
Sailors
bid100
(Use clustered index on sid)
Reserves
36
Query Tuning
37
Avoid Redundant DISTINCT
  • DISTINCT usually entails a sort operation
  • Slow down query optimization because one more
    interesting order to consider
  • Remove if you know the result has no duplicates

SELECT DISTINCT ssnum FROM Employee WHERE dept
information systems
38
Change Nested Queries to Join
  • Might not use index on Employee.dept
  • Need DISTINCT if an employee might belong to
    multiple departments

SELECT ssnum FROM Employee WHERE dept IN (SELECT
dept FROM Techdept)
SELECT ssnum FROM Employee, Techdept WHERE
Employee.dept Techdept.dept
39
Avoid Unnecessary Temp Tables
  • Creating temp table causes update to catalog
  • Cannot use any index on original table

SELECT INTO Temp FROM Employee WHERE salary gt
40000 SELECT ssnum FROM Temp WHERE Temp.dept
information systems
SELECT ssnum FROM Employee WHERE Employee.dept
information systems AND salary gt 40000
40
Avoid Complicated Correlation Subqueries
  • Search all of e2 for each e1 record!

SELECT ssnum FROM Employee e1 WHERE salary
(SELECT MAX(salary) FROM Employee e2
WHERE e2.dept e1.dept
SELECT MAX(salary) as bigsalary, dept INTO
Temp FROM Employee GROUP BY dept SELECT
ssnum FROM Employee, Temp WHERE salary
bigsalary AND Employee.dept Temp.dept
41
Avoid Complicated Correlation Subqueries
  • SQL Server 2000 does a good job at handling the
    correlated subqueries (a hash join is used as
    opposed to a nested loop between query blocks)
  • The techniques implemented in SQL Server 2000 are
    described in Orthogonal Optimization of
    Subqueries and Aggregates by C.Galindo-Legaria
    and M.Joshi, SIGMOD 2001.

gt 10000
gt 1000
42
Join on Clustering and Integer Attributes
  • Employee is clustered on ssnum
  • ssnum is an integer

SELECT Employee.ssnum FROM Employee,
Student WHERE Employee.name Student.name
SELECT Employee.ssnum FROM Employee,
Student WHERE Employee.ssnum Student.ssnum
43
Avoid HAVING when WHERE is enough
  • May first perform grouping for all departments!

SELECT AVG(salary) as avgsalary, dept FROM
Employee GROUP BY dept HAVING dept information
systems
SELECT AVG(salary) as avgsalary FROM
Employee WHERE dept information systems GROUP
BY dept
44
Avoid Views with unnecessary Joins
  • Join with Techdept unnecessarily

CREATE VIEW Techlocation AS SELECT ssnum,
Techdept.dept, location FROM Employee,
Techdept WHERE Employee.dept Techdept.dept SELE
CT dept FROM Techlocation WHERE ssnum 4444
SELECT dept FROM Employee WHERE ssnum 4444
45
Aggregate Maintenance
  • Materialize an aggregate if needed frequently
  • Use trigger to update

create trigger updateVendorOutstanding on orders
for insert as update vendorOutstanding set amount
(select vendorOutstanding.amountsum(inserted.
quantityitem.price) from inserted,item where
inserted.itemnum item.itemnum ) where vendor
(select vendor from inserted)
46
Avoid External Loops
  • No loop
  • sqlStmt select from lineitem where l_partkey
    lt 200
  • odbc-gtprepareStmt(sqlStmt)
  • odbc-gtexecPrepared(sqlStmt)
  • Loop
  • sqlStmt select from lineitem where l_partkey
    ?
  • odbc-gtprepareStmt(sqlStmt)
  • for (int i1 ilt200 i)
  • odbc-gtbindParameter(1, SQL_INTEGER, i)
  • odbc-gtexecPrepared(sqlStmt)

47
Avoid External Loops
Let the DBMS optimize set operations
  • SQL Server 2000 on Windows 2000
  • Crossing the application interface has a
    significant impact on performance

48
Avoid Cursors
  • No cursor
  • select from employees
  • Cursor
  • DECLARE d_cursor CURSOR FOR select from
    employees
  • OPEN d_cursorwhile (_at__at_FETCH_STATUS 0)
  • BEGIN
  • FETCH NEXT from d_cursorEND
  • CLOSE d_cursor
  • go

49
Avoid Cursors
  • SQL Server 2000 on Windows 2000
  • Response time is a few seconds with a SQL query
    and more than an hour iterating over a cursor

50
Retrieve Needed Columns Only
  • All
  • Select from lineitem
  • Covered subset
  • Select l_orderkey, l_partkey, l_suppkey,
    l_shipdate, l_commitdate from lineitem
  • Avoid transferring unnecessary data
  • May enable use of a covering index.

51
Use Direct Path for Bulk Loading
  • sqlldr directpathtrue controlload_lineitem.ctl
    dataE\Data\lineitem.tbl
  • load data
  • infile "lineitem.tbl"
  • into table LINEITEM append
  • fields terminated by ''
  • (
  • L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER,
    L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX,
    L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE
    "YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD",
    L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT,
    L_SHIPMODE, L_COMMENT
  • )

52
Use Direct Path for Bulk Loading
  • Direct path loading bypasses the query engine and
    the storage manager. It is orders of magnitude
    faster than for conventional bulk load (commit
    every 100 records) and inserts (commit for each
    record).

53
Some Idiosyncrasies
  • OR may stop the index being used
  • break the query and use UNION
  • Order of tables may affect join implementation

54
Query Tuning Thou Shalt
  • Avoid redundant DISTINCT
  • Change nested queries to join
  • Avoid unnecessary temp tables
  • Avoid complicated correlation subqueries
  • Join on clustering and integer attributes
  • Avoid HAVING when WHERE is enough
  • Avoid views with unnecessary joins
  • Maintain frequently used aggregates
  • Avoid external loops

55
Query Tuning Thou Shalt
  • Avoid cursors
  • Retrieve needed columns only
  • Use direct path for bulk loading
Write a Comment
User Comments (0)
About PowerShow.com