Advanced Database Management Presentation 1 :Tzavala Polina EYO 617

1 / 29
About This Presentation

Advanced Database Management Presentation 1 :Tzavala Polina EYO 617


(author = 'Sigmund Freud' ^ title contain dreams')OR ... author = 'Sigmund Freud' OR author = 'Carl Jung') ^ (title contains 'dreams' ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 30
Provided by: wimA3


Transcript and Presenter's Notes

Title: Advanced Database Management Presentation 1 :Tzavala Polina EYO 617

Advanced Database ManagementPresentation 1
Tzavala Polina EYO 617
  • Capability Sensitive Query Processing on Internet
  • Hector Garcia-Molina, Wilburt Labio, Ramana

The problem
  • Efficiently generating capability-sensitive
    plans for selection queries over Internet data
  • Differences with the traditional query
    optimization problem?
  • 1.Large Hetereogenity of Internet sources.
  • 2.Selection queries.
  • 3.Focus on Efficiency
  • Slow response time of many Internet sources.
  • Low bandwidth of many connections.

Example Query in Amazon
  • Amazon forms do not support disjunctions
  • Full Query in DNF
  • (author Sigmund Freud title contain
  • ( author Carl Jung title contains
  • SubQuery 1
  • author Sigmund Freud title contain dreams
  • SubQuery 2
  • author Carl Jung title contains dreams
  • Query in CNF
  • author Sigmund Freud" OR author Carl
  • (title contains dreams)

Systems Inefficiencies
  • How existing systems would respond to this Query?
  • 1.System R ,Ingres ,DB2, NonStop SQL would
    send (through a wrapper) the full unsupported
    query to the Amazon source.
  • ProblemThey assume that sources have full
    relational capabilities.
  • 2.The Information Manifold and theTSIMMIS
    Systems would convert the Query in CNF
  • Problem Although they do take into account
    source capabilities, only consider limited

Systems Inefficiencies
  • How existing systems would respond to this
  • 3.Garlic would send the second clause and apply
    the first one itself. This plan extracts over
    2,000 entries from Amazon,instead of 9.
  • Problem Garlic query conditions are always
    processed in (CNF).It retrieves too much data for
  • 4.DISCO considers plans that support the whole
    Query or none at all, does not split the target
    query condition into parts.
  • Problem considers limited options.

Challenge for a better solutionOverview
  • Rather than blindly working in DNF or CNF, a
    query processing system that deals with Internet
    sources must
  • carefully consider the space of available
    options, and
  • select an efficient one that is supported by the
  • 1.Framework for describing source capabilities
    and query plans.
  • 2. Scheme for generating efficient feasible
    plans for selection queries.
  • 3. GenCompact Architecture.
  • 4. GenCompact Efficiency Analysis.

1.Framework for describing source capabilities.
  • Hetereogenity of Internet sources
  • Standardised Internet Interfaces(i.e Forms) and
    Internet DB Schemas impose strict Query
  • 1.On Conditions that can be specified.
  • 2. On the number of conditions in the selection
    condition expression.
  • 3.On the Structure of the Condition expression.
  • (e.g conjunctive queries)

Describing Source Capabilities
  • Is a source query supported by the source?
  • Source capability description language based on
    context free grammars (CFGs).
  • Parser checks for the supportability of a query
    against the capability description.
  • Check(Condition n) Given a condition
    expression and a source, it returns the set of
    attributes that can be exported by the source
    when evaluating the condition expression.
  • SP(C, A,R) is supported by R if

Simple Source Description Lanuage
  • Example database of cars for sale
  • Attributes make, model, year, color, price
  • _s ? _s1 _s2
  • _s1 ? make m ? price lt p
  • _s2 ? make m ? color c
  • attributes _s1 make, model, year, color
  • attributes _s2 make, model, year

Framework for Describing Query PlansQuery trees
Target queries are submitted to a mediator that
generates and executes source capability
sensitive query plans.
The leaf nodes represent SP queries, called
source queries, that are actually sent to R.
Query trees
  • The non-leaf nodes represent selection (S),
    projection (P), intersection and union operations
    that are performed at the mediator to combine the
    results of the source queries .
  • The non-leaf nodes can also represent a Choice
    operator .

2.Scheme for generating efficient feasible
plansThe GenModular Architecture
GenModularThe rewrite module.
  • 1.The rewrite module
  • Considers various rewritings of the target query
  • Commutative, associative and distributive
    transformations of condition expressions.
  • Produces a set of CTs that represent condition
    expressions equivalent to the target query

GenModularThe mark module.
  • 2.The mark module
  • Identifies parts of the condition that can be
    answered by the source.Calls the Check function
    on every subtree of the CT.
  • Check() set of attributes returned by the
    source when evaluating a part of the CT .
  • n.export()records the results of Check().
  • Check() is called on all nodes.

GenModularThe generate module.
  • 3.The generate module
  • Produces the set of feasible plans by
    invoking EPG on each node of the CTs .
  • EPG ( Exhaustive Plan Generator) Combines the
    source queries parts into query plans for the
    target query .
  • Pure plansevaluated completely at the
  • Impure plansevaluated remotely and locally
    by the mediator.
  • 4.The cost module chooses the least expensive
    among these plans.

  • GenCompact vias GenModular
  • 1.Intelligent Plan Generation Reduces the number
    of CTs that need to be processed.Integration of
    the mark,generate and cost modules.
  • 2. Pruning Techniques By using the the cost
    model, it identifies strategies to significantly
    reduce the plan space explored without pruning
    the optimal plan.

GenCompactThe rewrite module
  • GenCompact's rewrite module
  • 1.Uses fewer rewrite rules and produces fewer
    CTs.The associativity and copy rules are not
  • Result Reducing Costs
  • 2.Rewrites the CFG source description, and
    parses with a larger set of rules.The Check()
    function exports more variables but the final
    Query has to be fixed before being executed.
  • The rewritting is made only when the source
    is integrated into the system.
  • The parser contains more CFG rules and is
    more complex .
  • Result No Cost Increase

GenCompactCost Model
  • Given a query plan that uses a set of source
    queries SQ, the cost of the plan is
  • Sumk1 k2(result size of sq)
  • k1 the overhead of the messages over the
    Internet and
  • the overhead in starting a query at source R
  • Depends on number of source queries.
  • k2 the cost per tuple of computing the answer
    at source R and the cost per byte of
    transferring the answer over the network.
  • Depends on the size of the results of each

GenCompact Plan Generation
  • GenCompact limits the space of the alternative
    plans explored with the use of some heuristics
  • The pruning rules reduce the number of plans
    explored without affecting the optimality of the

GenCompactPlan Generation
  • Pruning Rule 1
  • Prune impure plans when pure plan exists.
  • A pure plan processes the target query entirely
    at the source (no mediator postprocessing is
  • If the pure plan is feasible, no impure plan
    need be generated, because under our cost model
    it will never be cheaper than the pure plan.
  • Impure plans use at least as many source
    queries and transfer at least as much data as the
    pure plan.

GenCompactPlan Generation
  • Pruning Rule 2
  • Prune locally sub-optimal plans. In order to
    prune impure plans for a target query
  • Generate plans for the sub-queries and combine
    them to form the target query plans. When
    considering plans to be combined,it is safe to
    prune away all but the cheapest plan for each

GenCompactPlan Generation
  • Pruning Rule 3
  • Prune dominated plans for queries with
    conjunctive conditions.
  • Example
  • P plan for Target QuerySP((c1 c2 c3
    c4)A R).
  • P1 plan for the sub-query SP(c1 c2 c3 A R)
  • P2 plan for the sub-query SP(c1 c2 A R).
  • If P1 is cheaper than P2, P1 dominates P2.
  • There is no need to consider P2 when
    combining plans to form the target query plan.The
    cost of any plan P for the target query that uses
    P2 can always be lowered by replacing P2 with P1
    in P.

GenCompactPlan Generation
  • For each CT produced by the rewrite module.
  • It considers all the plans of the CT.
  • It applies the copy and association rules to
    find more plans that can be obtained.
  • It generates a canonical tree.
  • It calls IPG on each canonical CT.
  • IPG imlements the Minimum Cost Set Cover
    Algorithm to Combine best feasible sub plans at
    each node.
  • It generates only a single query plan for the
  • The overall best plan is chosen.

Performance Evaluation Overview
  • T number of CTs.
  • W complexity of the plan generation for each
  • the cost of parsing (calls to the Check
    function), and
  • the cost of the EPG and IPG calls.
  • TGMgtgtTGC
  • WGCgtgtWGM
  • GenModular more CTs.
  • GenCompact fewer CTs BUT much higher cost of
    processing a CT.

Performance Evaluation
Performance Evaluation Framework
  • CTtarget query condition balanced F-ary tree.
  • Hheight of CT
  • CT number of nodes in CT.
  • Cq the selection condition of the target query
  • Aq the projected attributes of the target
    query q

Performance Evaluation
  • CT2 the parsing costs for each of the CT
    nodes for the subtree
  • rooted there in O(CT) time.
  • CT 2Att(C)-A 2F the cost of the EPG calls.
    Procedure EPG is called on each node at least
  • 2Att(C)-A how many times EPG can be called on
    each node with different attributes.The Aq
    attributes are always included any time EPG is
    called .
  • 2FThe actual cost of each EPG call, for each of
    the possible subsets of the children of a node.

Performance Evaluation
  • 2F additional parser calls .Each subset of
    children nodes of a non-leaf node constitutes a
    condition that needs to be parsed.
  • Each IPG invocation may take O(22F ) time
    because it needs to solve the MCSC problem using
    an exhaustive algorithm.

Performance Evaluation
GenCompact is much more efficient than
GenModular. A simple DNF target query condition
with 3 terms and 3 atomic conditions per term
leads to a ratio of 224.
Write a Comment
User Comments (0)