Efficiently gathering information on the Internet using AI - PowerPoint PPT Presentation

About This Presentation
Title:

Efficiently gathering information on the Internet using AI

Description:

Dynamically generated in response to queries from users ... bursty traffic, setup delays etc. Hybrid. Selective materialization; Caching etc... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 47
Provided by: unkn955
Category:

less

Transcript and Presenter's Notes

Title: Efficiently gathering information on the Internet using AI


1
Efficiently gathering information on the Internet
using AI DB techniquesThe EMERAC PROJECT
Subbarao Kambhampati Arizona State
University Tempe, USA rakaposhi.eas.asu.edu/yochan
.html
2
Motivation
  • A lot of data on the WWW is actually generated by
    databases
  • 80 of the web is hidden. Dynamically generated
    in response to queries from users
  • Would be nifty if we can do database-style
    querying of the web.

3
Information Gathering
user
Gatherer
wrapper
db
wrapper
lthtmlgt
cgi
4
Meta-search engines --Text-based --No access to
hidden web
5
Junglee
Netbot
Comparison shoppers --Call every source
--Collate results
DealPilot.Com
6
  • CORA allows more sophisticated queries
  • All papers that cite Rao, but are not written by
    rao
  • Neither CORA nor DBLP are complete
  • CORA tends to be more complete for online papers
    and AI papers
  • DBLP sticks to published papers, and is
  • more complete in DB coverage
  • Both sources provide Englishified-BIBTEX
    citations

7
Data representation
  • Global as View (GAV)
  • The global (mediated) schema is written as a view
    on the sources (databases)
  • Simpler query processing (not reformulation)
  • Less modular
  • (schema changes
  • when new sources are added)
  • Donaji

8
Data Representation --2
  • Local as View (LAV)
  • Modular
  • New sources can be added without changing the
    global schema
  • Needs more sophisticated query processing
  • User query needs to be reformulated into source
    calls
  • Compiling LAV into GAV

9
Data location
  • Warehousing vs. Virtual (on-line) sources
  • Warehousing avoids the problems of the net
  • Data may get stale.
  • There may be too much data.
  • You may not be allowed to shift the data over
  • Virtual Source method accesses the sources on
    demand
  • Has to handle internet problems such as
  • bursty traffic, setup delays etc.
  • Hybrid
  • Selective materialization Caching etc...

10
Tricky issues
  • Sources are not really databases!
  • Legacy systems
  • Limited access patters
  • (Cans ask a white-pages source for the list of
    all numbers)
  • Limited local processing power
  • Typically only selections (on certain attributes)
    are supported
  • Sources are autonomous
  • Unregulated data overlap
  • Lack of full statistics on the sources

11
EMERAC Query Planning System
Build query plan using source inversion
Execution Optimizations Source call ordering
Logical OptimizationsRedundancy removal
Execute query plan
Duschka (with Genesereth Levy) 97
Optimization steps
12
Desirable Properties of information gathering
plans
  • Source-complete no other plan returns more
    information using the available sources
  • Different from the traditional query
    equivalence requirement
  • Source-minimal a plan for which no information
    source can be removed, yet the plan returns the
    same answer.
  • Access-cost minimal a plan which reduces the
    number of separate accesses to individual sources
  • Bandwidth-minimal a plan that, when executed,
    transfers the smallest amount of data over the
    network yet is still source complete

13
Ensuring properties of optimal information
gathering plans
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
14
Modeling Information Gathering in EMERAC
  • Information sources
  • relational
  • answer select queries (possibly a restricted
    set of query patterns)
  • autonomous
  • World model
  • relational
  • Query on the world model
  • Reformulate the query as calls on information
    sources. Optimize. Execute.

Local as View model
15
Source Access Limitations
  • Sources can have a variety of access limitations
  • Form interfaced databases may require certain
    attributes to be bound
  • Whitepages may require the name of the person
  • To get the numbers of a set of n people, we will
    have to access the source n times
  • and may be unable to handle bindings of other
    attributes
  • A Whitepages database may not take the address of
    a person as a bound attribute
  • To get the number of John Doe, who lives on Lemon
    St, we will have to get the numbers of all John
    Does, and locally filter the ones not living on
    Lemon Street
  • Wrapped web-pages cannot select over any
    attributes

16
Representing Source Access Limitations
  • Use annotations on the attributes of the source
    relation
  • annotation identifies attributes that must be
    bound
  • annotation identifies un-selectable
    attributes
  • S(X,Y,Z)
  • A form-interfaced web-page that requires bindings
    for X and is able to do selections only on Z.
  • and annotations help identify feasible
    binding patterns for sources
  • Sb-- are feasible Sf-- are infeasible
  • Sbbf must be modeled as Sbff filtered locally
    with binding on Y

17
Modeling Sources
Sources related to world model by describing them
as views over world model --Source description
restricted to conjunctive queries (SPJ)
movie-hut(X, Y,Z) -gt title-time(X, Y),
title-actor(X, Z)
house-of-movies(X, Y, Z) -gt title-time(X, Y),
title-actor(X, Z)
query(X, Y) - title-time(X, Y)
Required binding..
18
Computing source-complete plans
  • Invert the source descriptions
  • Plans for individual world relations
  • Concatenate the query and the source inversion
    rules
  • A datalog program which when executed will
  • return all accessible tuples

movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
title-time(X, Y) - movie-hut(X, Y,Z)
house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
title-time(X, Y) - dom(X) , movie-hut(X, Y, Z)
19
Building Source Complete Plans
Duschka, Genesereth 97
query(X, Y) - title-time(X, Y)
  • movie-hut(X, Y) -gt title-time(X, Y),
    title-actor(X, Z)

house-of-movies(X, Y) -gt title-time(X, Y),
title-actor(X, Z)
Source Inversion Rules
title-time(X, Y) -
movie-hut(X, Y,Z)
title-actor(X, Z) - movie-hut(X,
Y,Z)
dom(X) - movie-hut(X, Y,Z)
dom(Y) - movie-hut(X, Y,Z)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
Binding restrictions lead to recursion in the plan
20
Complexity of finding maximally-contained plans
(Certain answers)
  • Source inversion approach has poly-time
    complexity for the case considered in EMERAC
  • Complexity doesnt depend on the query
  • Can handle recursive queries just as easily
  • Complexity does change if the sources are not
    conjunctive queries
  • Sources as unions of conjunctive queries
    (NP-hard)
  • Sources as recursive queries (Undecidable)
  • Comparison predicates
  • Complexity also changes based on Open vs. Closed
    world assumption

21
Practical Problems with Plans derived from source
inversion rules
  • Every source that is remotely relevant to the
    query is made part of the plan
  • Many of these sources may be overlapping

title-time(X, Y) -
movie-hut(X, Y) title-actor (Y, X, Y)
- movie-hut(X, Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
ltX, f1(X, Y)gt
  • If both movie-hut and house-of-movies have same
    information
  • both sources are not necessary
  • the recursion is not necessary

title-time(X, Y) - dom(X),
house-of-movies(X, Y) title-actor (Y,
X, Y) - dom(X), house-of-movies(X, Y)
dom(Y) - dom(X),
house-of-movies(X, Y)
query(X,Y) - title-time(X, Y)
ltX, f2(X, Y)gt
22
Optimization challenges in EMERAC
Traditional
Information Gathering
  • Multiple sources export partial and overlapping
    portions of a relation
  • Need to minimize plans to remove redundancy
  • Sources are rarely fully relational
  • Only limited types of queries allowed
  • Wrapped web-pages
  • Form-interfaced databases
  • Certain forms of join computation may be
    precluded
  • Need to model query capabilities
  • Each relation is exported in to-to by a single
    database
  • All sources are assumed to be fully relational

23
Minimizing information gathering plans
  • Model source overlaps
  • Use LCW statements
  • Rewrite the source-complete plan
  • Greedily remove rules from plan with uniform
    equivalence and LCW statements ( make the plan
    source-minimal)
  • Uniform containment checks Sagiv, 88
  • Use heuristics to guide removal and pull out
    recursion first

24
LCW Statements
View movie-hut(X, Y) -gt title-time(X, Y),
title-actor(X, Z) LCW movie-hut(X, Y) lt-
title-time(X, Y), title-actor(X, Z) To check if
one rule, r , with information source predicates
contains another rule, r , see if r s s l
contains r s s v
1
2
1
2
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z),ZAllen
Inter-source subsumption relations Mirror
sources can also be handled
Etzioni et al 97, Duschka 97
25
Testing for Uniform Containment
p(X, Y) - q(X, Y) q(X, Y) - r(X, Y)
uniformly contain
p(W, X) - r(W, X)
?
does
assert r(W, X) and try to derive p(W,
X) using bottom-up evaluation --Exponential
complexity...
26
Greedily Minimizing Information Gathering Plans
  • Remove non-recursive IDB predicates
  • Sort the rules so those with dom predicates come
    before those without dom predicates
  • for each rule r do
  • let r be a rule of P that has not yet been
    considered
  • let P be the program obtained by deleting rule r
    from P
  • if Ps s l uniformly contains rs s
    v then
  • replace P with P. Prune unreachable rules.

Source costs can be used



Uniform containment check is exponential in the
worst case
27
Minimization example
title-time(X, Y) -
movie-hut(X, Y)
ltX, f1(X, Y)gt
title-actor (X, X, Y) - movie-hut(X,
Y)
dom(X) - movie-hut(X, Y)
dom(Y) - movie-hut(X, Y)
title-time(X, Y) - dom(X),
house-of-movies(X, Y)
ltX, f2(X, Y)gt
title-actor (X, X, Y) - dom(X),
house-of-movies(X, Y)
dom(Y) - dom(X), house-of-movies(X, Y)
query(X, Y) - title-time(X, Y)
movie-hut(X, Y) lt- title-time(X, Y),
title-actor(X, Z)
28
LCW vs. Naïve Artificial Sources
29
EMERAC
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
Source completeness
Source-minimality
Access cost and bandwidth minimality
30
Optimization challenges in EMERAC
Traditional
Information Gathering
  • Multiple sources export partial and overlapping
    portions of a relation
  • Need to minimize plans to remove redundancy
  • Sources are rarely fully relational
  • Only limited types of queries allowed
  • Wrapped web-pages
  • Form-interfaced databases
  • Certain forms of join computation may be
    precluded
  • Need to model query capabilities
  • Each relation is exported in to-to by a single
    database
  • All sources are assumed to be fully relational

31
Continued
Optimization challenges in EMERAC
  • Tuple-transfer costs are assumed to dominate the
    query-execution costs
  • Use of Bound-is-easier assumption
  • Assume availability of full source-statistics
  • Selectivity indices, histograms etc.
  • Access cost source latencies tend to equal or
    dominate the transfer cost
  • Need to consider number of source calls
  • Need for considering bushy joins (instead of just
    left-linear join trees)
  • Full statistics are rarely available about
    internet sources
  • Sources are decentralized and autonomous
  • Difficult to do systematic optimization

32
Issues in ordering source calls
  • Execution cost is a function of both access cost
    and the tuple-transfer cost (ignoring local
    processing costs)
  • Tension between access costs traffic costs
  • E.g. Execute S1(W,X) S2(X,Y) where the query
    binds W
  • Tuple-transfer cost reduction motivates calling
    sources with the least general binding patterns
    possible
  • Bound-is-easier (S1 first, and then feed X
    bindings to S2)
  • Access cost reduction motivates calling sources
    with the most general binding patterns possible
  • Feeding X bindings for S2 will generate many
    separate accesses, increasing the access cost

33
Our Approach Assumptions
  • Exact optimization is not worth it
  • Lack of full source statistics
  • NP-hardness of the optimization problem
  • Join-ordering, which is a special case, is
    already NP-Complete
  • Source access costs dominate tuple-transfer costs
    by default
  • Reasonable given the large setup and latency
    costs for internet sources

34
Our Approach Overview
  • A greedy approach (along the lines of
    bound-is-easier type procedures)
  • By default, attempts to access each source with
    the most general feasible binding pattern
  • Reasonable given the assumption that access costs
    dominate transfer costs
  • The default is over-ridden if a binding pattern
    is known to produce too much traffic
  • Binding patterns producing high traffic are
    stored in a table called HTBP
  • Implicitly produces bushy join trees

35
The HTBP Table
  • The HTBP table contains, for every source S, the
    least general binding patterns of S which are
    known to produce high traffic
  • A call to source S with binding pattern B is
    considered high-traffic producing, if HTBP
    contains SB and B is either equal or more
    general than B
  • E.g. Book(Author,Title,ISBN,Subj,Price,Pages)
  • HTBP may contain all binding patterns that do not
    bind at least one of the first four attributes
  • Bookffffbb listed explicitly in HTBP
  • Bookfffffb Bookfffffbf Bookffffff would be
    considered to be implicitly in HTBP
  • Advantage HTBP should be easy to specify even if
    full source statistics are not available

36
The Algorithm
For each stage i from 1 to m do For each
unchosen subgoal S pick the most general
feasible BP B of S w.r.t.
V FBP such that B is not in HTBP.
If such a B exists, Push SB
into Ci. Mark S chosen. Add
all variables of S to V If no such B
exists, but there is a feasible binding pattern
for S Pick the BP B with most
bound variables (in terms of (.))
Push SB into Pi If no subgoal has
been chosen at this level (Ci is empty),
and there are some postponed
sources (Pi is non-empty) Choose
SkB in Pi with the maximum (B) value
Push SkB into Ci Add all
variables of Sk to V Return the array C1m

Default case Reduce accesses
HTBP case Reduce transfer costs
37
Example
  • Sources DP(AAuthor,TTitle,YYear)
  • SM98(TTitle,UURL)
  • Query Q(A,T,U,1998)
  • Plan Q(A,T,U,1998) - DP(A,T,1998)
    SM98(T,U)

HTBP DPbbb SM98bb Step 1. VY Cand DPfff
DPffb SM98ff XX XX
XX P1 DPffb SM98ff C1
DPffb Step 2. VA,T,Y Cand SM98ff SM98bf
XX XX P2SM98bf
C2SM98bf
HTBP DPffb Step 1. VY Cand DPfff DPffb
SM98ff XX XX C1
SM98ff Step 2. VY, U, T Cand DPfff DPffb
DPfbf DPfbb XX XX
XX C2 DPfbf
HTBP Step 1. VY Cand DPfff DPffb
SM98ff C1 SM98ff
DPfff
Bound-is-easier
38
Implementation
  • The Emerac Information Gatherer
  • written in Java
  • incorporates rewriting and execution ordering
    techniques
  • executes plans in parallel
  • returns partial results during plan execution
  • object oriented design makes it easy to modify

39
EMERACs Contributions
  • An approach for minimizing recursive information
    gathering plans
  • An approach for ordering source calls in
    information gathering plans
  • Attempts at minimizing both access cost and
    tuple-transfer cost
  • (partial) Implementation Evaluation

What next??
40
More capable sources
  • EMERAC assumes sources can only do selection
    processing. Real sources tend to provide more
    capabilities
  • Many sources can do union queries on attributes
  • E.g. CNN Stock quote tracker allows upto 8
    symbols at a time
  • Some support constraints
  • Give me all flights prices less than 300
  • Theoretically, such sources can be modeled as
    supplying a (possibly infinite) number of views.
  • Query optimization is harder when the
    capabilities are neither full nor highly limited..

41
More realistic overlap statistics
  • LCWs may not be available (or may not be
    advertised)
  • Statistics on coverage and overlap may be
    available
  • Source A and Source B have 70 overlap on tuples
  • How to use them?
  • Computing unions given partial information about
    intersections..

42
Optimizing for First n-tuples
  • Traditional techniques optimize time to get all
    tuples.
  • It is much better to optimize time to
  • get first n-tuples.
  • Little theory available on such optimization
  • May be counter-intuitive from the point of view
    of traditional optimization
  • Use of double-pipe-lined hash join in TUKWILA
  • Cost-quality tradeoffs (not all answers are
    equal..)

Courtesy while you think. It saves time
Queen to Alice
43
XML .
  • Sources may give their output in XML format
  • Makes unwrapping easy
  • Sources may be based on XML
  • Semi-structured non-relational data
  • XML query processing languages
  • Labeled directed graphs
  • Navigational queries, path expressions etc..

44
XML
HTML
ltPublication URL"ftp//db.stanford.edu/pub/papers
/xml.ps" Authors"RG JM JW"gt ltTitlegtFrom
Semistructured Data to XML Migrating the Lore
Data Model and Query Languagelt/Titlegt
ltPublishedgtProceedings of the 2nd International
Workshop on the Web and Databases (WebDB
'99)lt/Publishedgt ltPagesgt25-30lt/Pagesgt
ltLocationgt ltCitygtPhiladelphialt/Citygt
ltStategtPennsylvanialt/Stategt lt/Locationgt
ltDategt ltMonthgtJunelt/Monthgt
ltYeargt1999lt/Yeargt lt/Dategt lt/Publicationgt
ltPublication URL"ftp//db.stanford.edu/pub/pape
rs/ozone.ps" Authors"TL SA JW"gt
ltTitlegtOzone Integrating Structured and
Semistructured Datalt/Titlegt
ltPublishedgtTechnical Reportlt/Publishedgt
ltInstitutiongtStanford University Database
Grouplt/Institutiongt ltDategt
ltMonthgtOctoberlt/Monthgt ltYeargt1998lt/Yeargt
lt/Dategt lt/Publicationgt ltAuthor
ID"SA"gtS. Abiteboullt/Authorgt ltAuthor
ID"RG"gtR. Goldmanlt/Authorgt ltAuthor ID"TL"gtT.
Lahirilt/Authorgt ltAuthor ID"JM"gtJ.
McHughlt/Authorgt ltAuthor ID"JW"gtJ.
Widomlt/Authorgt
ltULgt ltLIgt R. Goldman, J. McHugh, and J.
Widom. ltA href"ftp//db.stanford.edu/pub/paper
s/xml.ps"gt From Semistructured Data to XML
Migrating the Lore Data Model and Query
Language lt/Agt. Proceedings of the 2nd
International Workshop on the Web and
Databases (WebDB '99), pages 25-30,
Philadelphia, Pennsylvania, June 1999.
ltLIgt T. Lahiri, S. Abiteboul, and J. Widom.
ltA href"ftp//db.stanford.edu/pub/papers/ozone.ps
"gt Ozone Integrating Structured and
Semistructured Data lt/Agt. Technical Report,
Stanford Database Group, October 1998.
lt/ULgt
45
Current directions
  • Integrate minimization source-call ordering
    phases
  • Model cost-quality tradeoffs
  • Handling run-time exceptions
  • unavailability of sources etc.
  • Tracking time and solution quality statistics
  • Improve the granularity of the HTBP table

46
The EMERAC Crowd
  • Eric Lambrecht
  • Senthil Gnanaprakasam
  • Zaiqing Nie
  • Yourself??
  • Sharp
  • Background in AI/DB
  • Good Java hacking
Write a Comment
User Comments (0)
About PowerShow.com