Optimizing Recursive Information Gathering Plans - PowerPoint PPT Presentation

About This Presentation
Title:

Optimizing Recursive Information Gathering Plans

Description:

Information Gathering. Tuple-transfer costs are assumed to dominate the query-execution ... John Doe, who lives on Lemon St, we will have to get the numbers of ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 35
Provided by: unkn1429
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Recursive Information Gathering Plans


1
Optimizing Recursive Information Gathering Plans
Eric Lambrecht, Subbarao Kambhampati Senthil
Gnanaprakasam Arizona State University Tempe,
USA rakaposhi.eas.asu.edu/yochan.html
2
Information Gathering
user
Gatherer
wrapper
db
wrapper
lthtmlgt
cgi
3
EMERAC Query Planning System
Build query plan using source inversion
Execution Optimizations Source call ordering
Logical OptimizationsRedundancy removal
Execute query plan
Duschka (with Genesereth Levy) 97
Optimization steps
4
Organization
  • Optimization challenges in EMERAC
  • Building Source Complete Plans Review
  • Logical optimization
  • Minimization of recursive IG plans by removing
    redundant source calls
  • Execution optimization
  • Ordering source calls to minimize both access and
    tuple-transfer costs
  • Implementation and Results
  • Contributions

5
Modeling Information Gathering
  • Information sources
  • relational
  • answer select queries (possibly a restricted
    set of query patterns)
  • autonomous
  • World model
  • relational
  • Query on the world model
  • Reformulate the query as calls on information
    sources. Optimize. Execute.

Local as View model
6
Modeling Sources
Sources related to world model by describing them
as views over world model
movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
query(X, Y) - title-time(X, Y)
Required binding..
7
Optimization challenges in EMERAC
Traditional
Information Gathering
  • Multiple sources export partial and overlapping
    portions of a relation
  • Need to minimize plans to remove redundancy
  • Sources are rarely fully relational
  • Only limited types of queries allowed
  • Wrapped web-pages
  • Form-interfaced databases
  • Certain forms of join computation may be
    precluded
  • Need to model query capabilities
  • Each relation is exported in to-to by a single
    database
  • All sources are assumed to be fully relational

8
Continued
Optimization challenges in EMERAC
  • Tuple-transfer costs are assumed to dominate the
    query-execution costs
  • Use of Bound-is-easier assumption
  • Assume availability of full source-statistics
  • Selectivity indices, histograms etc.
  • Access cost source latencies tend to equal or
    dominate the transfer cost
  • Need to consider number of source calls
  • Need for considering bushy joins (instead of just
    left-linear join trees)
  • Full statistics are rarely available about
    internet sources
  • Sources are decentralized and autonomous
  • Difficult to do systematic optimization

9
Source Access Limitations
  • Sources can have a variety of access limitations
  • Form interfaced databases may require certain
    attributes to be bound
  • Whitepages may require the name of the person
  • To get the numbers of a set of n people, we will
    have to access the source n times
  • and may be unable to handle bindings of other
    attributes
  • A Whitepages database may not take the address of
    a person as a bound attribute
  • To get the number of John Doe, who lives on Lemon
    St, we will have to get the numbers of all John
    Does, and locally filter the ones not living on
    Lemon Street
  • Wrapped web-pages cannot select over any
    attributes

10
Representing Source Access Limitations
  • Use annotations on the attributes of the source
    relation
  • annotation identifies attributes that must be
    bound
  • annotation identifies un-selectable
    attributes
  • S(X,Y,Z)
  • A form-interfaced web-page that requires bindings
    for X and is able to do selections only on Z.
  • and annotations help identify feasible
    binding patterns for sources
  • Sb-- are feasible Sf-- are infeasible
  • Sbbf must be modeled as Sbff filtered locally
    with binding on Y

11
Properties of optimal information gathering plans
  • Source-complete no other plan returns more
    information using the available sources
  • Source-minimal a plan for which no information
    source can be removed, yet the plan returns the
    same answer.
  • Access-cost minimal a plan which reduces the
    number of separate accesses to individual sources
  • Bandwidth-minimal a plan that, when executed,
    transfers the smallest amount of data over the
    network yet is still source complete

12
Ensuring properties of optimal information
gathering plans
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
13
Building Source Complete Plans
Duschka, Genesereth 97
query(X, Y) - title-time(X, Y)
  • movie-hut(X, Y) -gt title-time(X, Y),
    title-actor(X, Z)

house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
Source Inversion Rules
title-time(X, Y) -
movie-hut(X, Y)
ltX, f1(X, Y)gt
title-actor (X, X, Y) - movie-hut(X,
Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
Binding restrictions lead to recursion in the plan
14
Problems with Plans derived from source inversion
rules
  • Every source that is remotely relevant to the
    query is made part of the plan
  • Many of these sources may be overlapping

title-time(X, Y) -
movie-hut(X, Y) title-actor (Y, X, Y)
- movie-hut(X, Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
ltX, f1(X, Y)gt
  • If both movie-hut and house-of-movies have same
    information
  • both sources are not necessary
  • the recursion is not necessary

title-time(X, Y) - dom(X),
house-of-movies(X, Y) title-actor (Y,
X, Y) - dom(X), house-of-movies(X, Y)
dom(Y) - dom(X),
house-of-movies(X, Y)
query(X,Y) - title-time(X, Y)
ltX, f2(X, Y)gt
15
Minimizing information gathering plans
  • Model source overlaps
  • Use LCW statements
  • Rewrite the source-complete plan
  • Greedily remove rules from plan with uniform
    equivalence and LCW statements ( make the plan
    source-minimal)
  • Uniform containment checks Sagiv, 88
  • Use heuristics to guide removal and pull out
    recursion first

16
LCW Statements
View movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z) LCW movie-hut(X, Y) lt-
title-time(X, Y), title-actor(X, Z) To check if
one rule, r , with information source predicates
contains another rule, r , see if r s s l
contains r s s v
1
2
1
2
Inter-source subsumption relations Mirror
sources can also be handled
Etzioni et al 97, Duschka 97
17
Uniform Equivalence
  • Equivalence
  • Two datalog programs X and Y are equivalent if,
    for every set of extensional predicates, the two
    programs produce the same output.
  • Undecidable
  • Uniform Equivalence
  • X and Y are equivalent if, for every set of
    extensional and intensional predicates the two
    plans produce the same output
  • Decidable
  • Implies equivalence

Sagiv 88
18
Testing for Uniform Containment
p(X, Y) - q(X, Y) q(X, Y) - r(X, Y)
uniformly contain
p(W, X) - r(W, X)
?
does
assert r(W, X) and try to derive p(W, X)
19
Greedily Minimizing Information Gathering Plans
  • Remove non-recursive IDB predicates
  • Sort the rules so those with dom predicates come
    before those without dom predicates
  • for each rule r do
  • let r be a rule of P that has not yet been
    considered
  • let P be the program obtained by deleting rule r
    from P
  • if Ps s l uniformly contains rs s
    v then
  • replace P with P. Prune unreachable rules.

Source costs can be used



Uniform containment check is exponential in the
worst case
20
Minimization example
title-time(X, Y) -
movie-hut(X, Y)
ltX, f1(X, Y)gt
title-actor (X, X, Y) - movie-hut(X,
Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z)
21
Relating binding patterns
  • Generality of binding patterns
  • Sp is more general than Sq if every
    non--annotated attribute that is free in q is
    also free in p (but not vice versa)
  • Call to S with binding pattern p will subsume the
    results of call to S with binding pattern q
  • For S(X,Y,Z), Sbbf is more general than Sbfb
  • Holds only because of annotations
  • (B) is the number of bound variables in the
    binding pattern B that are not -annotated
  • (.) is used to relate binding patterns of
    different sources (as in bound-is-easier
    assumption)

22
EMERAC
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
23
Issues in ordering source calls
  • Execution cost is a function of both access cost
    and the tuple-transfer cost (ignoring local
    processing costs)
  • Tension between access costs traffic costs
  • E.g. Execute S1(W,X) S2(X,Y) where the query
    binds W
  • Tuple-transfer cost reduction motivates calling
    sources with the least general binding patterns
    possible
  • Bound-is-easier (S1 first, and then feed X
    bindings to S2)
  • Access cost reduction motivates calling sources
    with the most general binding patterns possible
  • Feeding X bindings for S2 will generate many
    separate accesses, increasing the access cost

24
Our Approach Assumptions
  • Exact optimization is not worth it
  • Lack of full source statistics
  • NP-hardness of the optimization problem
  • Join-ordering, which is a special case, is
    already NP-Complete
  • Source access costs dominate tuple-transfer costs
    by default
  • Reasonable given the large setup and latency
    costs for internet sources

25
Our Approach Overview
  • A greedy approach (along the lines of
    bound-is-easier type procedures)
  • By default, attempts to access each source with
    the most general feasible binding pattern
  • Reasonable given the assumption that access costs
    dominate transfer costs
  • The default is over-ridden if a binding pattern
    is known to produce too much traffic
  • Binding patterns producing high traffic are
    stored in a table called HTBP
  • Implicitly produces bushy join trees

26
The HTBP Table
  • The HTBP table contains, for every source S, the
    least general binding patterns of S which are
    known to produce high traffic
  • A call to source S with binding pattern B is
    considered high-traffic producing, if HTBP
    contains SB and B is either equal or more
    general than B
  • E.g. Book(Author,Title,ISBN,Subj,Price,Pages)
  • HTBP may contain all binding patterns that do not
    bind at least one of the first four attributes
  • Bookffffbb listed explicitly in HTBP
  • Bookfffffb Bookfffffbf Bookffffff would be
    considered to be implicitly in HTBP
  • Advantage HTBP should be easy to specify even if
    full source statistics are not available

27
The Algorithm
For each stage i from 1 to m do For each
unchosen subgoal S pick the most general
feasible BP B of S w.r.t.
V FBP such that B is not in HTBP.
If such a B exists, Push SB
into Ci. Mark S chosen. Add
all variables of S to V If no such B
exists, but there is a feasible binding pattern
for S Pick the BP B with most
bound variables (in terms of (.))
Push SB into Pi If no subgoal has
been chosen at this level (Ci is empty),
and there are some postponed
sources (Pi is non-empty) Choose
SkB in Pi with the maximum (B) value
Push SkB into Ci Add all
variables of Sk to V Return the array C1m

Default case Reduce accesses
HTBP case Reduce transfer costs
28
Example
  • Sources DP(AAuthor,TTitle,YYear)
  • SM98(TTitle,UURL)
  • Query Q(A,T,U,1998)
  • Plan Q(A,T,U,1998) - DP(A,T,1998)
    SM98(T,U)

HTBP DPbbb SM98bb Step 1. VY Cand DPfff
DPffb SM98ff XX XX
XX P1 DPffb SM98ff C1
DPffb Step 2. VA,T,Y Cand SM98ff SM98bf
XX XX P2SM98bf
C2SM98bf
HTBP DPffb Step 1. VY Cand DPfff DPffb
SM98ff XX XX C1
SM98ff Step 2. VY, U, T Cand DPfff DPffb
DPfbf DPfbb XX XX
XX C2 DPfbf
HTBP Step 1. VY Cand DPfff DPffb
SM98ff C1 SM98ff
DPfff
Bound-is-easier
29
Implementation
Implemented the technique in the Emerac
Information Gatherer Experimented with
simulated sources derived form DBLP data --
Our approach tended to reduce the total cost over
bound-is-easier approach whenever there were
significant number of binding patterns that are
not subsumed by HBTP
A prototype Information Gatherer written in
JAVA --Incorporates recursive plan
minimization execution ordering --Threading
execution Partial results returned asynchronously
30
Implementation
  • The Emerac Information Gatherer
  • written in Java
  • incorporates rewriting and execution ordering
    techniques
  • executes plans in parallel
  • returns partial results during plan execution
  • object oriented design makes it easy to modify

31
Experiments
  • Experimented with simulated sources derived form
    DBLP data
  • Our minimization approach reduces access costs by
    removing redundant recursive sources
  • Minimization cost offset by the improvements in
    execution time
  • Our source ordering approach tended to reduce the
    total cost over bound-is-easier approach whenever
    there were significant number of binding patterns
    that are not subsumed by HBTP

32
LCW vs. Naïve Artificial Sources
33
LCW vs. Naïve DBLP Sources
34
Graceful degradation
35
Contributions
  • An approach for minimizing recursive information
    gathering plans
  • An approach for ordering source calls in
    information gathering plans
  • Attempts at minimizing both access cost and
    tuple-transfer cost
  • Implementation Evaluation in EMERAC

36
Current directions
  • Integrate minimization source-call ordering
    phases
  • Model cost-quality tradeoffs
  • Handling run-time exceptions
  • unavailability of sources etc.
  • Tracking time and solution quality statistics
  • Improve the granularity of the HTBP table
Write a Comment
User Comments (0)
About PowerShow.com