Title: Advanced Database Management Presentation 1 :Tzavala Polina EYO 617
1Advanced Database ManagementPresentation 1
Tzavala Polina EYO 617
- Capability Sensitive Query Processing on Internet
Sources - Hector Garcia-Molina, Wilburt Labio, Ramana
Yerneni
2The problem
- Efficiently generating capability-sensitive
plans for selection queries over Internet data
sources. - Differences with the traditional query
optimization problem? - 1.Large Hetereogenity of Internet sources.
- 2.Selection queries.
- 3.Focus on Efficiency
- Slow response time of many Internet sources.
- Low bandwidth of many connections.
3Example Query in Amazon
- Amazon forms do not support disjunctions
- Full Query in DNF
- (author Sigmund Freud title contain
dreams)OR - ( author Carl Jung title contains
dreams) - SubQuery 1
- author Sigmund Freud title contain dreams
- SubQuery 2
- author Carl Jung title contains dreams
- Query in CNF
- author Sigmund Freud" OR author Carl
Jung") - (title contains dreams)
4Systems Inefficiencies
- How existing systems would respond to this Query?
- 1.System R ,Ingres ,DB2, NonStop SQL would
send (through a wrapper) the full unsupported
query to the Amazon source. - ProblemThey assume that sources have full
relational capabilities. - 2.The Information Manifold and theTSIMMIS
Systems would convert the Query in CNF - Problem Although they do take into account
source capabilities, only consider limited
options.
5Systems Inefficiencies
- How existing systems would respond to this
Query? - 3.Garlic would send the second clause and apply
the first one itself. This plan extracts over
2,000 entries from Amazon,instead of 9. - Problem Garlic query conditions are always
processed in (CNF).It retrieves too much data for
processing. - 4.DISCO considers plans that support the whole
Query or none at all, does not split the target
query condition into parts. - Problem considers limited options.
6Challenge for a better solutionOverview
- Rather than blindly working in DNF or CNF, a
query processing system that deals with Internet
sources must - carefully consider the space of available
options, and - select an efficient one that is supported by the
source. - 1.Framework for describing source capabilities
and query plans. - 2. Scheme for generating efficient feasible
plans for selection queries. - 3. GenCompact Architecture.
- 4. GenCompact Efficiency Analysis.
71.Framework for describing source capabilities.
- Hetereogenity of Internet sources
- Standardised Internet Interfaces(i.e Forms) and
Internet DB Schemas impose strict Query
limitations - 1.On Conditions that can be specified.
- 2. On the number of conditions in the selection
condition expression. - 3.On the Structure of the Condition expression.
- (e.g conjunctive queries)
8Describing Source Capabilities
- Is a source query supported by the source?
- Source capability description language based on
context free grammars (CFGs). - Parser checks for the supportability of a query
against the capability description. - Check(Condition n) Given a condition
expression and a source, it returns the set of
attributes that can be exported by the source
when evaluating the condition expression. - SP(C, A,R) is supported by R if
ASubset(Check(C)).
9Simple Source Description Lanuage
- Example database of cars for sale
- Attributes make, model, year, color, price
- _s ? _s1 _s2
- _s1 ? make m ? price lt p
- _s2 ? make m ? color c
- attributes _s1 make, model, year, color
- attributes _s2 make, model, year
-
10Framework for Describing Query PlansQuery trees
Target queries are submitted to a mediator that
generates and executes source capability
sensitive query plans.
The leaf nodes represent SP queries, called
source queries, that are actually sent to R.
11Query trees
- The non-leaf nodes represent selection (S),
projection (P), intersection and union operations
that are performed at the mediator to combine the
results of the source queries . - The non-leaf nodes can also represent a Choice
operator .
122.Scheme for generating efficient feasible
plansThe GenModular Architecture
13GenModularThe rewrite module.
- 1.The rewrite module
- Considers various rewritings of the target query
condition. - Commutative, associative and distributive
transformations of condition expressions. - Produces a set of CTs that represent condition
expressions equivalent to the target query
condition.
14GenModularThe mark module.
- 2.The mark module
- Identifies parts of the condition that can be
answered by the source.Calls the Check function
on every subtree of the CT. - Check() set of attributes returned by the
source when evaluating a part of the CT . - n.export()records the results of Check().
- Check() is called on all nodes.
15GenModularThe generate module.
- 3.The generate module
- Produces the set of feasible plans by
invoking EPG on each node of the CTs . - EPG ( Exhaustive Plan Generator) Combines the
source queries parts into query plans for the
target query . - Pure plansevaluated completely at the
source. - Impure plansevaluated remotely and locally
by the mediator. - 4.The cost module chooses the least expensive
among these plans.
16GenCompact
- GenCompact vias GenModular
- 1.Intelligent Plan Generation Reduces the number
of CTs that need to be processed.Integration of
the mark,generate and cost modules. - 2. Pruning Techniques By using the the cost
model, it identifies strategies to significantly
reduce the plan space explored without pruning
the optimal plan.
17GenCompactThe rewrite module
- GenCompact's rewrite module
- 1.Uses fewer rewrite rules and produces fewer
CTs.The associativity and copy rules are not
used. - Result Reducing Costs
- 2.Rewrites the CFG source description, and
parses with a larger set of rules.The Check()
function exports more variables but the final
Query has to be fixed before being executed. - The rewritting is made only when the source
is integrated into the system. - The parser contains more CFG rules and is
more complex . - Result No Cost Increase
18GenCompactCost Model
- Given a query plan that uses a set of source
queries SQ, the cost of the plan is - Sumk1 k2(result size of sq)
- k1 the overhead of the messages over the
Internet and - the overhead in starting a query at source R
- Depends on number of source queries.
- k2 the cost per tuple of computing the answer
at source R and the cost per byte of
transferring the answer over the network. - Depends on the size of the results of each
query.
19GenCompact Plan Generation
- GenCompact limits the space of the alternative
plans explored with the use of some heuristics
rules. - The pruning rules reduce the number of plans
explored without affecting the optimality of the
result.
20GenCompactPlan Generation
- Pruning Rule 1
- Prune impure plans when pure plan exists.
- A pure plan processes the target query entirely
at the source (no mediator postprocessing is
required). - If the pure plan is feasible, no impure plan
need be generated, because under our cost model
it will never be cheaper than the pure plan. - Impure plans use at least as many source
queries and transfer at least as much data as the
pure plan.
21GenCompactPlan Generation
- Pruning Rule 2
- Prune locally sub-optimal plans. In order to
prune impure plans for a target query
SP(n,A,R). - Generate plans for the sub-queries and combine
them to form the target query plans. When
considering plans to be combined,it is safe to
prune away all but the cheapest plan for each
sub-query.
22GenCompactPlan Generation
- Pruning Rule 3
- Prune dominated plans for queries with
conjunctive conditions. - Example
- P plan for Target QuerySP((c1 c2 c3
c4)A R). - P1 plan for the sub-query SP(c1 c2 c3 A R)
- P2 plan for the sub-query SP(c1 c2 A R).
- If P1 is cheaper than P2, P1 dominates P2.
- There is no need to consider P2 when
combining plans to form the target query plan.The
cost of any plan P for the target query that uses
P2 can always be lowered by replacing P2 with P1
in P.
23GenCompactPlan Generation
- For each CT produced by the rewrite module.
- It considers all the plans of the CT.
- It applies the copy and association rules to
find more plans that can be obtained. - It generates a canonical tree.
- It calls IPG on each canonical CT.
- IPG imlements the Minimum Cost Set Cover
Algorithm to Combine best feasible sub plans at
each node. - It generates only a single query plan for the
CT. - The overall best plan is chosen.
24Performance Evaluation Overview
- T number of CTs.
- W complexity of the plan generation for each
CT. - the cost of parsing (calls to the Check
function), and - the cost of the EPG and IPG calls.
- TGMgtgtTGC
- WGCgtgtWGM
- GenModular more CTs.
- GenCompact fewer CTs BUT much higher cost of
processing a CT.
25Performance Evaluation
26Performance Evaluation Framework
- CTtarget query condition balanced F-ary tree.
- Hheight of CT
- CT number of nodes in CT.
- Cq the selection condition of the target query
q - Aq the projected attributes of the target
query q
27Performance Evaluation
- CT2 the parsing costs for each of the CT
nodes for the subtree - rooted there in O(CT) time.
- CT 2Att(C)-A 2F the cost of the EPG calls.
Procedure EPG is called on each node at least
once. - 2Att(C)-A how many times EPG can be called on
each node with different attributes.The Aq
attributes are always included any time EPG is
called . - 2FThe actual cost of each EPG call, for each of
the possible subsets of the children of a node.
28Performance Evaluation
- 2F additional parser calls .Each subset of
children nodes of a non-leaf node constitutes a
condition that needs to be parsed. - Each IPG invocation may take O(22F ) time
because it needs to solve the MCSC problem using
an exhaustive algorithm.
29Performance Evaluation
GenCompact is much more efficient than
GenModular. A simple DNF target query condition
with 3 terms and 3 atomic conditions per term
leads to a ratio of 224.