Advanced Database Management Presentation 1 :Tzavala Polina EYO 617

1 / 29

About This Presentation

Title:

Advanced Database Management Presentation 1 :Tzavala Polina EYO 617

Description:

(author = 'Sigmund Freud' ^ title contain dreams')OR ... author = 'Sigmund Freud' OR author = 'Carl Jung') ^ (title contains 'dreams' ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 30

Provided by: wimA3

more less

Transcript and Presenter's Notes

Title: Advanced Database Management Presentation 1 :Tzavala Polina EYO 617

1
Advanced Database ManagementPresentation 1
Tzavala Polina EYO 617

Capability Sensitive Query Processing on Internet
Sources
Hector Garcia-Molina, Wilburt Labio, Ramana
Yerneni

2
The problem

Efficiently generating capability-sensitive
plans for selection queries over Internet data
sources.
Differences with the traditional query
optimization problem?
1.Large Hetereogenity of Internet sources.
2.Selection queries.
3.Focus on Efficiency
Slow response time of many Internet sources.
Low bandwidth of many connections.

3
Example Query in Amazon

Amazon forms do not support disjunctions
Full Query in DNF
(author Sigmund Freud title contain
dreams)OR
( author Carl Jung title contains
dreams)
SubQuery 1
author Sigmund Freud title contain dreams
SubQuery 2
author Carl Jung title contains dreams
Query in CNF
author Sigmund Freud" OR author Carl
Jung")
(title contains dreams)

4
Systems Inefficiencies

How existing systems would respond to this Query?
1.System R ,Ingres ,DB2, NonStop SQL would
send (through a wrapper) the full unsupported
query to the Amazon source.
ProblemThey assume that sources have full
relational capabilities.
2.The Information Manifold and theTSIMMIS
Systems would convert the Query in CNF
Problem Although they do take into account
source capabilities, only consider limited
options.

5
Systems Inefficiencies

How existing systems would respond to this
Query?
3.Garlic would send the second clause and apply
the first one itself. This plan extracts over
2,000 entries from Amazon,instead of 9.
Problem Garlic query conditions are always
processed in (CNF).It retrieves too much data for
processing.
4.DISCO considers plans that support the whole
Query or none at all, does not split the target
query condition into parts.
Problem considers limited options.

6
Challenge for a better solutionOverview

Rather than blindly working in DNF or CNF, a
query processing system that deals with Internet
sources must
carefully consider the space of available
options, and
select an efficient one that is supported by the
source.
1.Framework for describing source capabilities
and query plans.
2. Scheme for generating efficient feasible
plans for selection queries.
3. GenCompact Architecture.
4. GenCompact Efficiency Analysis.

7
1.Framework for describing source capabilities.

Hetereogenity of Internet sources
Standardised Internet Interfaces(i.e Forms) and
Internet DB Schemas impose strict Query
limitations
1.On Conditions that can be specified.
2. On the number of conditions in the selection
condition expression.
3.On the Structure of the Condition expression.
(e.g conjunctive queries)

8
Describing Source Capabilities

Is a source query supported by the source?
Source capability description language based on
context free grammars (CFGs).
Parser checks for the supportability of a query
against the capability description.
Check(Condition n) Given a condition
expression and a source, it returns the set of
attributes that can be exported by the source
when evaluating the condition expression.
SP(C, A,R) is supported by R if
ASubset(Check(C)).

9
Simple Source Description Lanuage

Example database of cars for sale
Attributes make, model, year, color, price
_s ? _s1 _s2
_s1 ? make m ? price lt p
_s2 ? make m ? color c
attributes _s1 make, model, year, color
attributes _s2 make, model, year

10
Framework for Describing Query PlansQuery trees
Target queries are submitted to a mediator that
generates and executes source capability
sensitive query plans.
The leaf nodes represent SP queries, called
source queries, that are actually sent to R.
11
Query trees

The non-leaf nodes represent selection (S),
projection (P), intersection and union operations
that are performed at the mediator to combine the
results of the source queries .
The non-leaf nodes can also represent a Choice
operator .

12
2.Scheme for generating efficient feasible
plansThe GenModular Architecture
13
GenModularThe rewrite module.

1.The rewrite module
Considers various rewritings of the target query
condition.
Commutative, associative and distributive
transformations of condition expressions.
Produces a set of CTs that represent condition
expressions equivalent to the target query
condition.

14
GenModularThe mark module.

2.The mark module
Identifies parts of the condition that can be
answered by the source.Calls the Check function
on every subtree of the CT.
Check() set of attributes returned by the
source when evaluating a part of the CT .
n.export()records the results of Check().
Check() is called on all nodes.

15
GenModularThe generate module.

3.The generate module
Produces the set of feasible plans by
invoking EPG on each node of the CTs .
EPG ( Exhaustive Plan Generator) Combines the
source queries parts into query plans for the
target query .
Pure plansevaluated completely at the
source.
Impure plansevaluated remotely and locally
by the mediator.
4.The cost module chooses the least expensive
among these plans.

16
GenCompact

GenCompact vias GenModular
1.Intelligent Plan Generation Reduces the number
of CTs that need to be processed.Integration of
the mark,generate and cost modules.
2. Pruning Techniques By using the the cost
model, it identifies strategies to significantly
reduce the plan space explored without pruning
the optimal plan.

17
GenCompactThe rewrite module

GenCompact's rewrite module
1.Uses fewer rewrite rules and produces fewer
CTs.The associativity and copy rules are not
used.
Result Reducing Costs
2.Rewrites the CFG source description, and
parses with a larger set of rules.The Check()
function exports more variables but the final
Query has to be fixed before being executed.
The rewritting is made only when the source
is integrated into the system.
The parser contains more CFG rules and is
more complex .
Result No Cost Increase

18
GenCompactCost Model

Given a query plan that uses a set of source
queries SQ, the cost of the plan is
Sumk1 k2(result size of sq)
k1 the overhead of the messages over the
Internet and
the overhead in starting a query at source R
Depends on number of source queries.
k2 the cost per tuple of computing the answer
at source R and the cost per byte of
transferring the answer over the network.
Depends on the size of the results of each
query.

19
GenCompact Plan Generation

GenCompact limits the space of the alternative
plans explored with the use of some heuristics
rules.
The pruning rules reduce the number of plans
explored without affecting the optimality of the
result.

20
GenCompactPlan Generation

Pruning Rule 1
Prune impure plans when pure plan exists.
A pure plan processes the target query entirely
at the source (no mediator postprocessing is
required).
If the pure plan is feasible, no impure plan
need be generated, because under our cost model
it will never be cheaper than the pure plan.
Impure plans use at least as many source
queries and transfer at least as much data as the
pure plan.

21
GenCompactPlan Generation

Pruning Rule 2
Prune locally sub-optimal plans. In order to
prune impure plans for a target query
SP(n,A,R).
Generate plans for the sub-queries and combine
them to form the target query plans. When
considering plans to be combined,it is safe to
prune away all but the cheapest plan for each
sub-query.

22
GenCompactPlan Generation

Pruning Rule 3
Prune dominated plans for queries with
conjunctive conditions.
Example
P plan for Target QuerySP((c1 c2 c3
c4)A R).
P1 plan for the sub-query SP(c1 c2 c3 A R)
P2 plan for the sub-query SP(c1 c2 A R).
If P1 is cheaper than P2, P1 dominates P2.
There is no need to consider P2 when
combining plans to form the target query plan.The
cost of any plan P for the target query that uses
P2 can always be lowered by replacing P2 with P1
in P.

23
GenCompactPlan Generation

For each CT produced by the rewrite module.
It considers all the plans of the CT.
It applies the copy and association rules to
find more plans that can be obtained.
It generates a canonical tree.
It calls IPG on each canonical CT.
IPG imlements the Minimum Cost Set Cover
Algorithm to Combine best feasible sub plans at
each node.
It generates only a single query plan for the
CT.
The overall best plan is chosen.

24
Performance Evaluation Overview

T number of CTs.
W complexity of the plan generation for each
CT.
the cost of parsing (calls to the Check
function), and
the cost of the EPG and IPG calls.
TGMgtgtTGC
WGCgtgtWGM
GenModular more CTs.
GenCompact fewer CTs BUT much higher cost of
processing a CT.

25
Performance Evaluation
26
Performance Evaluation Framework

CTtarget query condition balanced F-ary tree.
Hheight of CT
CT number of nodes in CT.
Cq the selection condition of the target query
q
Aq the projected attributes of the target
query q

27
Performance Evaluation

CT2 the parsing costs for each of the CT
nodes for the subtree
rooted there in O(CT) time.
CT 2Att(C)-A 2F the cost of the EPG calls.
Procedure EPG is called on each node at least
once.
2Att(C)-A how many times EPG can be called on
each node with different attributes.The Aq
attributes are always included any time EPG is
called .
2FThe actual cost of each EPG call, for each of
the possible subsets of the children of a node.

28
Performance Evaluation

2F additional parser calls .Each subset of
children nodes of a non-leaf node constitutes a
condition that needs to be parsed.
Each IPG invocation may take O(22F ) time
because it needs to solve the MCSC problem using
an exhaustive algorithm.

29
Performance Evaluation
GenCompact is much more efficient than
GenModular. A simple DNF target query condition
with 3 terms and 3 atomic conditions per term
leads to a ratio of 224.

Write a Comment

User Comments (0)