Title: Speeding up Nested Iteration Exploiting Parameter Sort Orders
1Speeding up Nested Iteration Exploiting Parameter
Sort Orders
- Ravindra N. Guravannavar
- Indian Institute of Technology Bombay
- Advisor
- Prof. S. Sudarshan
2Motivation
- Complex nested queries are commonly used
- Queries using expensive user-defined functions
(UDFs) lead to nested iteration - Queries using nested constructs and UDFs are
sometimes more natural to formulate and easy to
read than equivalent flat queries
3Example 1 A Simple Nested Query
- List the orders none of whose line items have the
shipping address same as the - default shipping address of the order
- SELECT PO.order_id
- FROM PURCHASEORDER PO
- WHERE default_ship_to NOT IN (
- SELECT ship_to
- FROM ORDERITEM OI
- WHERE OI.order_id PO.order_id)
- - Nested Iteration
ORDERITEM
PURCHASEORDER
4Example 2 A Nested Aggregate Query
- List the days on which the sales exceeded the
- maximum daily sales seen in the past
- SELECT day, sales
- FROM DAILYSALES DS1
- WHERE sales gt (SELECT MAX(sales)
- FROM DAILYSALES DS2
- WHERE DS2.day lt DS1.day)
5Example 3 A Query Invoking a UDF
- Find the turn-around time for high priority
orders - SELECT orderid, TurnaroundTime(orderid,
totalprice, orderdate) - FROM ORDERS WHERE order_priorityHIGH
- DEFINE TurnaroundTime(_at_orderid, _at_totalprice,
_at_orderdate) - IF (_at_totalprice lt 2000)
- SELECT max(L.shipdate _at_orderdate) FROM
LINEITEM L - WHERE L.orderid_at_orderid
- ELSE
- SELECT MAX(L.commitdate _at_orderdate)
- FROM LINEITEM L WHERE
L.orderid_at_orderid - END
6Representing Nested Queries
A The Apply Operator Operation between the
outer tuple and result of the inner block
7Evaluation based on Tuple Iteration
- Inner subquery is evaluated for each outer tuple
- Cost Cost(OuterBlock) nCost(InnerBlock)
Where n tuples in the result of outer block - Optimizations Proposed in System R
- Cache the inner subquery result for each distinct
correlation binding - Sort the outer tuples so as to be able to cache a
single result at any given time
8Decorrelated Evaluation
- Rewrite nested query as an equivalent flat query
- Allows the choice of set-oriented evaluation
plans such as hash and merge-join
9Decorrelation Example
- Original Query
- SELECT PO.order_id, PO.order_date
- FROM PURCHASEORDER PO
- WHERE default_ship_to IN ( SELECT ship_to
- FROM ORDERITEM OI
- WHERE OI.order_id PO.order_id)
- Decorrelated Query
- SELECT PO.order_id, PO.order_date
- FROM PURCHASEORDER PO, ORDERITEM OI
- WHERE PO.order_idOI.order_id AND
- PO.default_ship_toOI.ship_to
- Queries are not equivalent when duplicates are
present
10Problems with Decorrelation
- Not always possible
- E.g., NOT IN predicate
- Many cases need duplicate elimination, an extra
outer-join or materialization - IN predicate
- COUNT aggregate in the inner sub-query
- Non-equality correlation predicates
- May not be applicable to UDFs unless their
structure is very simple
11Our Approach
- Optimize nested queries keeping their structure
intact - Exploit properties of parameters (such as sort
order, prior predicates applied etc.) to
efficiently evaluate the inner sub-query - Produces better plans than decorrelation for
certain types of queries - More generic and can be applied to a wider class
of queries (e.g., Queries invoking complex UDFs)
12Benefits of Sorting Outer Tuples
- For ti ? t1, t2, t3, tn do
- inRes Ø
- For ui ? u1, u2, u3, um do
- if (pred1(ti ,ui))
- Add to ui to inRes
- done
- process(ti ,inRes)
- done
- Sorting allows caching of a single inner result
(System R) - Advantageous buffer effects (Graefe)
- A clustered index scan in the inner block will
access each block at most once irrespective of
the buffer replacement policy - State retaining operators
13Benefits of Sorting for a Clustered Index
- Case-1
- Keys 50, 500,400,80,600,200
- Potential data block fetches6
- Assume a single data block
- can be held in memory
- Random I/O
- Case-2
- Keys 50,80,200,400,500,600
- Data block fetches3
- Sequential I/O
400
400
500
600
50
80
200
Data Block-1
Data Block-2
Data Block-3
14Restartable Table Scan
- SELECT PO.order_id
- FROM PURCHASEORDER PO
- WHERE default_ship_to NOT IN (
- SELECT ship_to
- FROM ORDERITEM OI
- WHERE OI.order_id PO.order_id)
PURCHASEORDER
ORDERITEM
15Incremental Computation of Aggregates
- SELECT day, sales
- FROM DAILYSALES DS1
- WHERE sales gt (SELECT MAX(sales)
- FROM DAILYSALES DS2
- WHERE DS2.day lt DS1.day)
- Aggregates that can be computed incrementally
- SUM, COUNT, MAX, MIN, AVG
16Extensions to the Volcano Optimizer
- We propose the following extensions to the
- Volcano optimizer
- Consider interesting parameter sort orders for
queries containing external parameters - Consider the total cost (cost for n evaluations)
for queries executed in a loop
17Extensions to the Volcano Optimizer
- Contract of the original algorithm for
optimization - Plan FindBestPlan(Expr e, PhysProp rpp, Cost cl)
- Contract of the modified algorithm for
optimization - Plan FindBestPlan(Expr e, PhysProp rpp, Cost c,
- Order pso, int callCount)
- Plans generated and cached for lte, rpp, pso,
callCountgt - Not all possible orderings of the parameters are
valid - Parameter Sort Order (a1, a2, an) is valid iff
level(ai) lt level(aj) for all i, j s.t. i lt j. - Not all valid orders may be interesting (we
consider only valid, interesting parameter sort
orders)
18The Basic Idea
A
Required Output Sort Order
Interesting Parameter Sort Order
A
Query Block-1
Binds a, b
Query Block-2
Query Block-3
Binds c Uses a,b
Uses a, b, c
19Sort Order Propagation for a Multi-Level
Multi-Branch Expression
20Search for the Best Plan
- The best plan is computed recursively by adding
the cost of each physical operator to the cost of
the best plans for its inputs and retaining the
cheapest combination. - At the Apply operator, the cost of the inner
(right) sub-expression is obtained for n
iterations where n is the estimated number of
distinct correlation values in the left (bind)
sub-expression - Memorization of best plans is done for
- ltexpr, o/p phy prop, call count, i/p param sort
ordergt - Operator cost function will take call count and
parameter sort order as inputs
21UDFs with Procedural Constructs
- DEFINE fn(p1, p2, pn) AS
- BEGIN
- fnQ1 ltp1, p2gt
- fnQ2 ltp1, p2, p3gt
- if (condition)
- fnQ3ltp2gt
- else
- fnQ4ltp3gt
- OPEN CURSOR ON fnQ5ltp2, p3gt // Binds v1, v2
- LOOP
- fnQ6ltp1, p2, v1, v2gt
- END LOOP
- END
Exactly-Once Query
At-Most-Once Query
Query Inside Cursor Loop
22Query with UDF Modeled as Nested Query
A
A
fnQ1
Qi
fnQ2
fnQ3
fnQ4
fnQ5
fnQ6
23Experiments
Query 1 A NOT IN query with no outer
predicates SELECT PO.order_id FROM
PURCHASEORDER PO WHERE default_ship_to NOT IN
(SELECT ship_to FROM ORDERITEM OI WHERE
OI.order_id PO.order_id)
24Experiments (Contd.)
Query 2 A Nested Aggregate Query with
Non-Equality Corrl. Predicate SELECT day, sales
FROM DAILYSALES DS1 WHERE sales gt (SELECT
MAX(sales) FROM DAILYSALES DS2 WHERE DS2.day lt
DS1.day)
25Experiments (Contd.)
Query 3 TPC-H MIN COST Supplier Query
26Conclusion
- Nested iteration cannot be completely avoided and
may be quite efficient if implemented properly - State-retention techniques can perform better
than decorrelation when the selectivity of the
outer predicates is low - We showed how a cost-based optimizer can be
extended to consider interesting parameter sort
orders
27Future Work
- Analyzing function body to find expected
execution count for each query - Study of parameter properties other than sort
order that would be interesting to nested queries
and functions - Implementing the proposed Optimizer extensions
and studying its performance
28References
- Won Kim. On Optimizing an SQL-like Nested Query.
In ACM Trans on Database Systems, 1982 - Ganski and Wong. Optimization of Nested SQL
Queries Revisited. In ACM SIGMOD, 1987 - U. Dayal. Of Nests and Trees A Unified Approach
to Processing Queries that Contain Nested
Subqueries, Aggregates and Quantifiers, In VLDB,
1987 - P. Seshadri et al. Complex Query Decorrelation.
In ICDE, 1996 - G. Graefe. Executing Nested Queries. In DBTW,
2003 - G. Graefe and McKenna. The Volcano Optimizer
Generator Extensibility and Efficient Search. In
ICDE 1993 - Selinger et al. Access Path Selection in a
Relational Database Management System. In ACM
SIGMOD, 1979 - P. Roy, Multi-Query Optimization and
Applications, Ph.D. Thesis, IIT Bombay, 2000.
29Questions?
30Extra Slides
31Physical Plan Space Generation
- PhysEqNode PhysDAGGen(LogEQNode e, PhyProp p,
ParamSortOrder s) - If a physical equivalence node np exists for e,
p, s - return np
- Create an equivalence node np for e, p, s
- For each logical operation node o below e
- If(o is an instance of ApplyOp)
- ProcApplyNode(o, s, np)
- else
- ProcLogOpNode(o, p, s, np)
- For each enforcer f that generates property p
- Create an enforcer node of under np
- Set the input of of PhysDAGGen(e, null, s)
- return np
- End
32Processing a Non-Apply Node
- void ProcLogOpNode(LogOpNode o, PhysProp
p,ParamSortOrder s, - PhysEqNode np)
- For each algorithm a for o that guarantees p and
- requires no stronger sort order than s
- Create an algorithm node oa under np
- For each input i of oa
- Let oi be the i th input of oa
- Let pi be the physical property required
- from input i by algorithm a
- Set input i of oa PhysDAGGen(oi, pi, s)
- End
33Processing the Apply Node
- void ProcApplyNode(LogOpNode o, ParamSortOrder s,
PhysEqNode np) - Initialize i_ords to be an empty set or sort
orders - For each use expression u under o
- uOrds GetInterestingOrders(u)
- i_ords i_ords Union uOrds
- l_ords GetLocalOrders(i ords, o.bindInput)
- For each order ord in l_ords and null
- leq PhysDAGGen(lop.bindInput, ord, s)
- Let newOrd concat(s, ord)
- applyOp create new applyPhysOp(o.TYPE)
- applyOp.lchild leq
- For each use expression u of o
- ueq PhysDAGGen(u, null, newOrd)
- Add ueq as a child node of applyOp
- np.addChild(applyOp)
- End
34Generating Interesting Parameter Orders
- SetltOrdergt GetInterestingOrders(LogEqNode e)
- if the set of interesting orders i_ords for e
is already found - return i_ords
- Create an empty set result of sort orders
- for each logical operation node o under e
- for each algorithm a for o
- Let sa be the sort order of interest
- to a on the unbound parameters in e
- if sa is a valid order and sa is not
in result - Add sa to result
- for each input logical equivalence
node ei of a - childOrd GetInterestingOrders(ei
) - if (o is an Apply operator AND ei
is a use input) - childOrd GetAncestorOrders(c
hildOrd, o.bindInput) - result result Union childOrd
- return result
- End
35Extracting Ancestor Orders
- SetltOrdergt GetAncestorOrders(SetltOrdergt i_ords,
LogEqNode e) - Initialize l_ords to be an empty set of
sort orders - for each order ord in i_ords
- newOrd Empty vector
- for (i 1 i ltlength(ord) i i
1) - if ordi is NOT bound by e
- append(ordi, newOrd)
- else
- break
- add newOrd to l_ords
- return l_ords
- End
36Extracting Local Orders
- SetltOrdergt GetLocalOrders(SetltOrdergt i_ords,
LogEqNode e) - Initialize l_ords to be an empty set or sort
orders - For each ord in i_ords
- newOrd Empty vector
- For (i length(ord) i gt 0 i i 1 )
- If ordi is bound by e
- prepend(ordi, newOrd)
- Else
- break
- add newOrd to l_ords
- return l_ords
- End
37Decorrelation Example
- Original
- SELECT ename, sal FROM EMP E1 WHERE age lt 40 AND
- sal (SELECT MAX(sal) FROM EMP E2 WHERE
- E2.depidE1.depid)
- Decorrelated
- Temp(C1, C2) SELECT depid as C1, max(sal) as C2
FROM EMP - GROUP BY EMP.depid
- SELECT ename, sal FROM EMP E, Temp T WHERE E.age
lt 40 - AND E.depidT.C1 AND E.salT.C2
38Optimization Steps
- Logical Plan Space LQDAG
- Physical Plan Space PQDAG
39An Example LQDAG
- Expression
- s(R3) Apply pf()(s(R4))
- The function f() uses
- parameters a, b, c