Title: Optimizing Recursive Information Gathering Plans
1Optimizing Recursive Information Gathering Plans
Eric Lambrecht, Subbarao Kambhampati Senthil
Gnanaprakasam Arizona State University Tempe,
USA rakaposhi.eas.asu.edu/yochan.html
2Information Gathering
user
Gatherer
wrapper
db
wrapper
lthtmlgt
cgi
3EMERAC Query Planning System
Build query plan using source inversion
Execution Optimizations Source call ordering
Logical OptimizationsRedundancy removal
Execute query plan
Duschka (with Genesereth Levy) 97
Optimization steps
4Organization
- Optimization challenges in EMERAC
- Building Source Complete Plans Review
- Logical optimization
- Minimization of recursive IG plans by removing
redundant source calls - Execution optimization
- Ordering source calls to minimize both access and
tuple-transfer costs - Implementation and Results
- Contributions
5Modeling Information Gathering
- Information sources
- relational
- answer select queries (possibly a restricted
set of query patterns) - autonomous
- World model
- relational
- Query on the world model
- Reformulate the query as calls on information
sources. Optimize. Execute.
Local as View model
6Modeling Sources
Sources related to world model by describing them
as views over world model
movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
query(X, Y) - title-time(X, Y)
Required binding..
7Optimization challenges in EMERAC
Traditional
Information Gathering
- Multiple sources export partial and overlapping
portions of a relation - Need to minimize plans to remove redundancy
- Sources are rarely fully relational
- Only limited types of queries allowed
- Wrapped web-pages
- Form-interfaced databases
- Certain forms of join computation may be
precluded - Need to model query capabilities
- Each relation is exported in to-to by a single
database - All sources are assumed to be fully relational
8Continued
Optimization challenges in EMERAC
- Tuple-transfer costs are assumed to dominate the
query-execution costs - Use of Bound-is-easier assumption
- Assume availability of full source-statistics
- Selectivity indices, histograms etc.
- Access cost source latencies tend to equal or
dominate the transfer cost - Need to consider number of source calls
- Need for considering bushy joins (instead of just
left-linear join trees) - Full statistics are rarely available about
internet sources - Sources are decentralized and autonomous
- Difficult to do systematic optimization
9Source Access Limitations
- Sources can have a variety of access limitations
- Form interfaced databases may require certain
attributes to be bound - Whitepages may require the name of the person
- To get the numbers of a set of n people, we will
have to access the source n times - and may be unable to handle bindings of other
attributes - A Whitepages database may not take the address of
a person as a bound attribute - To get the number of John Doe, who lives on Lemon
St, we will have to get the numbers of all John
Does, and locally filter the ones not living on
Lemon Street - Wrapped web-pages cannot select over any
attributes
10Representing Source Access Limitations
- Use annotations on the attributes of the source
relation - annotation identifies attributes that must be
bound - annotation identifies un-selectable
attributes - S(X,Y,Z)
- A form-interfaced web-page that requires bindings
for X and is able to do selections only on Z. - and annotations help identify feasible
binding patterns for sources - Sb-- are feasible Sf-- are infeasible
- Sbbf must be modeled as Sbff filtered locally
with binding on Y
11Properties of optimal information gathering plans
- Source-complete no other plan returns more
information using the available sources - Source-minimal a plan for which no information
source can be removed, yet the plan returns the
same answer. - Access-cost minimal a plan which reduces the
number of separate accesses to individual sources - Bandwidth-minimal a plan that, when executed,
transfers the smallest amount of data over the
network yet is still source complete
12Ensuring properties of optimal information
gathering plans
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
13Building Source Complete Plans
Duschka, Genesereth 97
query(X, Y) - title-time(X, Y)
- movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
Source Inversion Rules
title-time(X, Y) -
movie-hut(X, Y)
ltX, f1(X, Y)gt
title-actor (X, X, Y) - movie-hut(X,
Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
Binding restrictions lead to recursion in the plan
14Problems with Plans derived from source inversion
rules
- Every source that is remotely relevant to the
query is made part of the plan - Many of these sources may be overlapping
title-time(X, Y) -
movie-hut(X, Y) title-actor (Y, X, Y)
- movie-hut(X, Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
ltX, f1(X, Y)gt
- If both movie-hut and house-of-movies have same
information - both sources are not necessary
- the recursion is not necessary
title-time(X, Y) - dom(X),
house-of-movies(X, Y) title-actor (Y,
X, Y) - dom(X), house-of-movies(X, Y)
dom(Y) - dom(X),
house-of-movies(X, Y)
query(X,Y) - title-time(X, Y)
ltX, f2(X, Y)gt
15Minimizing information gathering plans
- Model source overlaps
- Use LCW statements
- Rewrite the source-complete plan
- Greedily remove rules from plan with uniform
equivalence and LCW statements ( make the plan
source-minimal) - Uniform containment checks Sagiv, 88
- Use heuristics to guide removal and pull out
recursion first
16LCW Statements
View movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z) LCW movie-hut(X, Y) lt-
title-time(X, Y), title-actor(X, Z) To check if
one rule, r , with information source predicates
contains another rule, r , see if r s s l
contains r s s v
1
2
1
2
Inter-source subsumption relations Mirror
sources can also be handled
Etzioni et al 97, Duschka 97
17Uniform Equivalence
- Equivalence
- Two datalog programs X and Y are equivalent if,
for every set of extensional predicates, the two
programs produce the same output. - Undecidable
- Uniform Equivalence
- X and Y are equivalent if, for every set of
extensional and intensional predicates the two
plans produce the same output - Decidable
- Implies equivalence
Sagiv 88
18Testing for Uniform Containment
p(X, Y) - q(X, Y) q(X, Y) - r(X, Y)
uniformly contain
p(W, X) - r(W, X)
?
does
assert r(W, X) and try to derive p(W, X)
19Greedily Minimizing Information Gathering Plans
- Remove non-recursive IDB predicates
- Sort the rules so those with dom predicates come
before those without dom predicates - for each rule r do
- let r be a rule of P that has not yet been
considered - let P be the program obtained by deleting rule r
from P - if Ps s l uniformly contains rs s
v then - replace P with P. Prune unreachable rules.
Source costs can be used
Uniform containment check is exponential in the
worst case
20Minimization example
title-time(X, Y) -
movie-hut(X, Y)
ltX, f1(X, Y)gt
title-actor (X, X, Y) - movie-hut(X,
Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z)
21Relating binding patterns
- Generality of binding patterns
- Sp is more general than Sq if every
non--annotated attribute that is free in q is
also free in p (but not vice versa) - Call to S with binding pattern p will subsume the
results of call to S with binding pattern q - For S(X,Y,Z), Sbbf is more general than Sbfb
- Holds only because of annotations
- (B) is the number of bound variables in the
binding pattern B that are not -annotated - (.) is used to relate binding patterns of
different sources (as in bound-is-easier
assumption)
22EMERAC
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
23Issues in ordering source calls
- Execution cost is a function of both access cost
and the tuple-transfer cost (ignoring local
processing costs) - Tension between access costs traffic costs
- E.g. Execute S1(W,X) S2(X,Y) where the query
binds W - Tuple-transfer cost reduction motivates calling
sources with the least general binding patterns
possible - Bound-is-easier (S1 first, and then feed X
bindings to S2) - Access cost reduction motivates calling sources
with the most general binding patterns possible - Feeding X bindings for S2 will generate many
separate accesses, increasing the access cost
24Our Approach Assumptions
- Exact optimization is not worth it
- Lack of full source statistics
- NP-hardness of the optimization problem
- Join-ordering, which is a special case, is
already NP-Complete - Source access costs dominate tuple-transfer costs
by default - Reasonable given the large setup and latency
costs for internet sources
25Our Approach Overview
- A greedy approach (along the lines of
bound-is-easier type procedures) - By default, attempts to access each source with
the most general feasible binding pattern - Reasonable given the assumption that access costs
dominate transfer costs - The default is over-ridden if a binding pattern
is known to produce too much traffic - Binding patterns producing high traffic are
stored in a table called HTBP - Implicitly produces bushy join trees
26The HTBP Table
- The HTBP table contains, for every source S, the
least general binding patterns of S which are
known to produce high traffic - A call to source S with binding pattern B is
considered high-traffic producing, if HTBP
contains SB and B is either equal or more
general than B - E.g. Book(Author,Title,ISBN,Subj,Price,Pages)
- HTBP may contain all binding patterns that do not
bind at least one of the first four attributes - Bookffffbb listed explicitly in HTBP
- Bookfffffb Bookfffffbf Bookffffff would be
considered to be implicitly in HTBP - Advantage HTBP should be easy to specify even if
full source statistics are not available
27The Algorithm
For each stage i from 1 to m do For each
unchosen subgoal S pick the most general
feasible BP B of S w.r.t.
V FBP such that B is not in HTBP.
If such a B exists, Push SB
into Ci. Mark S chosen. Add
all variables of S to V If no such B
exists, but there is a feasible binding pattern
for S Pick the BP B with most
bound variables (in terms of (.))
Push SB into Pi If no subgoal has
been chosen at this level (Ci is empty),
and there are some postponed
sources (Pi is non-empty) Choose
SkB in Pi with the maximum (B) value
Push SkB into Ci Add all
variables of Sk to V Return the array C1m
Default case Reduce accesses
HTBP case Reduce transfer costs
28Example
- Sources DP(AAuthor,TTitle,YYear)
- SM98(TTitle,UURL)
- Query Q(A,T,U,1998)
- Plan Q(A,T,U,1998) - DP(A,T,1998)
SM98(T,U)
HTBP DPbbb SM98bb Step 1. VY Cand DPfff
DPffb SM98ff XX XX
XX P1 DPffb SM98ff C1
DPffb Step 2. VA,T,Y Cand SM98ff SM98bf
XX XX P2SM98bf
C2SM98bf
HTBP DPffb Step 1. VY Cand DPfff DPffb
SM98ff XX XX C1
SM98ff Step 2. VY, U, T Cand DPfff DPffb
DPfbf DPfbb XX XX
XX C2 DPfbf
HTBP Step 1. VY Cand DPfff DPffb
SM98ff C1 SM98ff
DPfff
Bound-is-easier
29Implementation
Implemented the technique in the Emerac
Information Gatherer Experimented with
simulated sources derived form DBLP data --
Our approach tended to reduce the total cost over
bound-is-easier approach whenever there were
significant number of binding patterns that are
not subsumed by HBTP
A prototype Information Gatherer written in
JAVA --Incorporates recursive plan
minimization execution ordering --Threading
execution Partial results returned asynchronously
30Implementation
- The Emerac Information Gatherer
- written in Java
- incorporates rewriting and execution ordering
techniques - executes plans in parallel
- returns partial results during plan execution
- object oriented design makes it easy to modify
31Experiments
- Experimented with simulated sources derived form
DBLP data - Our minimization approach reduces access costs by
removing redundant recursive sources - Minimization cost offset by the improvements in
execution time - Our source ordering approach tended to reduce the
total cost over bound-is-easier approach whenever
there were significant number of binding patterns
that are not subsumed by HBTP
32LCW vs. Naïve Artificial Sources
33LCW vs. Naïve DBLP Sources
34Graceful degradation
35Contributions
- An approach for minimizing recursive information
gathering plans - An approach for ordering source calls in
information gathering plans - Attempts at minimizing both access cost and
tuple-transfer cost - Implementation Evaluation in EMERAC
36Current directions
- Integrate minimization source-call ordering
phases - Model cost-quality tradeoffs
- Handling run-time exceptions
- unavailability of sources etc.
- Tracking time and solution quality statistics
- Improve the granularity of the HTBP table