Searching and Integrating Information on the Web - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Searching and Integrating Information on the Web

Description:

Searching and Integrating Information on the Web Seminar 2: Data Integration Professor Chen Li UC Irvine – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 56
Provided by: Chen2166
Category:

less

Transcript and Presenter's Notes

Title: Searching and Integrating Information on the Web


1
Searching and Integrating Information on the Web
  • Seminar 2 Data Integration
  • Professor Chen Li
  • UC Irvine

2
Motivation
Biblio sever
Legacy database
Plain text files
Support seamless access to autonomous and
heterogeneous information sources.
3
Applications
  • Comparison shopping

Lowest price of the DVD The Matrix?
  • Supply-chain management

Buyer 1
Supplier 1
Buyer 2
Supplier 2
Integrator


Supplier M
Buyer M
4
Mediation architecture
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
TSIMMIS (Stanford), Garlic (IBM), Infomaster
(Stanford), Disco (INRIA), Information Manifold
(ATT), Hermes(UMD), Tukwila (UW), InfoSleuth
(MCC),
5
Challenges
  • Sources are heterogeneous
  • Different data models relational,
    object-oriented, XML,
  • Different schemas and representations
  • Keanu Reeves or Reeves, Keanu or Reeves,
    K. etc.
  • Describe source contents
  • Use source data to answer queries
  • Sources have limited query capabilities
  • Data quality

6
Outline
  • Basics theories of conjunctive queries
  • Global-as-view (GAV) approach to data
    integration
  • Local-as-view (LAV) approach to data integration

7
Basics conjunctive queries
  • Reading Ashok K. Chandra and Philip M. Merlin,
    Optimal implementation of conjunctive queries in
    relational data bases, STOC, 77-90, 1977.
  • Fundamental for data integration
  • Source content description
  • Query description
  • Plan formulation

8
Conjunctive Queries (CQs)
  • Most common form of query equivalent to
    select-project-join (SPJ) queries
  • Useful for data integration
  • Form q(X) - p1(X1),p2(X2),,pn(Xn)
  • Head q(X) represents the query answers
  • Body p1(X1),p2(X2),,pn(Xn) represents the query
    conditions
  • Each pi(Xi) is called a subgoal
  • Shared variables represent join conditions
  • Constants represent Attributeconst selection
    conditions
  • A relation can appear in multiple predicates
    (subgoals)

9
Conjunctive Queries example
  • student(name,courseNum), course(number,instructor)
  • SELECT name
  • FROM student, course
  • WHERE student.courseNumcourse.number
    AND instructorLi
  • Equal to
  • ans(SN) - student(SN, CN), course(CN,Li)
  • Predicates student and course correspond to
    relations names
  • Two subgoals student(SN, CN) and course(CN,Li)
  • Variables SN, CN. Constant Li
  • Shared variable, CN, corresponds to
    student.courseNumcourse.number
  • Variable SN in the head the answer to the query

10
Answer to a CQ
  • For a CQ Q on database D, the answer Q(D) is set
    of heads of Q if we
  • Substitute constants for variables in the body of
    Q in all possible ways
  • Require all subgoals to be true
  • Example ans(SN) - student(SN, CN),
    course(CN,Li)
  • Tuples are also called EDB (external database)
    facts student(Jack, 184), student(Tom,215), ,
    course(184,Li), course(215,Li),
  • Answer Jack SN?Jack,CN?184
  • Answer Tom SN?Tom,CN?215
  • Answer Jack SN?Jack,CN?215 (duplicate
    eliminated)

Course
Student
11
Query containment
  • For two queries Q1 and Q2, we say Q1 is contained
    in Q2, denoted Q1?Q2, if any database D, we have
    Q1(D) ?Q2(D).
  • We say Q1 and Q2 are equivalent, denoted Q1?Q2,
    if Q1(D) ?Q2(D) and Q1(D) ? Q2(D).
  • Example
  • Q1 ans(SN) - student(SN, CN), course(CN,Li)
  • Q2 ans(SN) - student(SN, CN), course(CN,INS)
  • We have Q1(D) ? Q2(D).

12
Another example
  • Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y)
  • Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y)
  • We have Q2 ?Q1
  • Proof
  • For any DB D, suppose p(x,y) is in Q2(D). Then
    there is a w such that r(x,w), b(w,w), and r(w,y)
    are in D.
  • For Q1, consider the substitution X? x, W? w, Z?
    w, Y? y.
  • Thus the head of Q1 becomes p(x,y), meaning that
    p(x,y) is also in Q1(D).
  • In general, how to test containment of CQs?
  • Containment mappings
  • Canonical databases

13
Containment mappings
  • Mapping from variables of CQ Q2 to variables of
    CQ Q1, such that
  • Head of Q2 becomes head of Q1
  • Each subgoal of Q2 becomes some subgoal of Q2
  • It is not necessary that every subgoal of Q1 is
    the target of some subgoal of Q2.
  • Example
  • Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y)
  • Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y)
  • Containment mapping from Q1 to Q2 X ? X, Y ? Y,
    W ? W, Z ? W
  • No containment mapping from Q2 to Q1
  • For b(W,W) in Q2, its only possible target in Q1
    is b(W,Z)
  • However, we cannot have a mapping W?W and W?Z,
    since each variable cannot be mapped to two
    different variables

14
Example of containment mappings
  • Example C1 p(X) - a(X,Y), a(Y,Z), a(Z,W)
  • C2 p(X) - a(X,Y), a(Y,X)
  • Containment mapping from C1 to C2 X ? X, Y ? Y,
    Z ? X, W ? Y
  • No containment mapping from C2 to C1. Proof
  • For the two heads, the mapping must have X ? X
  • For a(X,Y) in C2, its target in C1 can only be
    a(X,Y) (since X?X). Thus Y?Y.
  • However, for a(Y,X) in C2, its target, which must
    be a(Y,X), does not exist in C1.

15
Theorem of Containment Mappings
  • Theorem Q1 ?Q2 iff there is a containment
    mapping from Q2 to Q1.
  • Notice the direction is the opposite
  • Proof (If)
  • Suppose ? is a containment mapping from Q2 to Q1
  • For any DB D, let tuple t is in Q1(D)
  • t is produced by a substitution ? on the
    variables of Q1 that makes all Q1s subgoals
    facts in D.
  • Therefore, ? ? ? is a substitution for variables
    of Q2 that produces t
  • Thus each t in Q1(D) must be in Q2(D)

Q1 p(X) - G1, G2, Gk
?
Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y) Q2 p(X,Y)
- r(X,W), b(W,W), r(W,Y)
?
Q2 p(X) - H1, H2, Hj
16
Proof (only if)
  • Key idea frozen CQ
  • Use a unique constant to replace a variable
  • Frozen Q is a DB consisting of all the subgoals
    of Q, with the chosen constants substituted for
    variables
  • This DB is called a canonical database of the
    query.
  • Example
  • Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y)
  • Frozen Q1 X replaced by constant x0, W by
    constant w0, Z by z0, Y by y0
  • Result DB with r(x0, w0), b(w0, z0), r(z0, y0)

17
Proof (only if) -- cont
  • Let Q1 ?Q2. Let D be the frozen Q1. Let ? be the
    substitution from those constants to the
    variables in Q1.
  • Since we chose a unique constant for each
    variable, this substitution exists.
  • Since Q1 ?Q2 the frozen head of Q1 must be in
    Q2(D). Thus there is a substitution ? from Q2 to
    D.
  • We can show that ? ? ? is a containment mapping
    from Q2 to Q1
  • The head of Q2 is mapped to the head of Q1.
  • Each subgoal in Q2 is mapped to a subgoal in Q2.

Q1 p(X) - G1, G2, Gk
?
Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y) Q2 p(X,Y)
- r(X,W), b(W,W), r(W,Y)
?
Q2 p(X) - H1, H2, Hj
18
Testing query containment
  • To test Q1 ?Q2.
  • Get a canonical DB D of Q1.
  • Compute Q2(D)
  • If Q2(D) contains the frozen head of Q1, then Q1
    ?Q2. otherwise not.
  • Testing containment between CQs is NP-complete.
  • Some polynomial-time algorithms exist in special
    cases.

19
Extending CQs
  • CQs with built-in predicates
  • We can add more conditions to variables in a CQ.
  • Example
  • student(name, GPA, courseNum),
    course(number,instructor,year)
  • ans(SN) - student(SN, G, CN), course(CN,Li),
    Ggt3.5
  • ans(SN) - student(SN, G, CN), course(CN,Li,
    Y), Ggt3.5, Y lt 2002
  • More results on CQs with built-in predicates
  • Datalog queries
  • a (possibly infinite) set of CQs with (possibly)
    recursion
  • Example r(Parent, Child)
  • Query finding all ancestors of Tom
  • ancestor(P,C) - r(P, C)
  • ancestor(P,C) - ancestor(P,X), r(X, C)
  • result(P) - ancestor(P, tom)

20
Further Reading
  • Jeff Ullman, Principles of Database and
    Knowledge Systems, Computer Science Press, 1988,
    Volume 2.

21
Outline
  • Basics theories of conjunctive queries
  • Global-as-view (GAV) approach to data
    integration
  • Local-as-view (LAV) approach to data integration

22
GAV approach to data integration
  • Readings
  • Jeffrey Ullman, Information Integration Using
    Logical Views, ICDT 1997.
  • Ramana Yerneni, Chen Li, Hector Garcia-Molina,
    and Jeffrey Ullman, Computing Capabilities of
    Mediators, SIGMOD 1999.

23
Global-as-view Approach
med(Dealer,City,Make,Year) R S
Mediator
R1(Dealer,City)
R2(Dealer, Make, Year)
  • Mediator exports views defined on source
    relations
  • med(Dealer,City,Make,Year) R1 R2
  • A query is posted on mediator views
  • SELECT FROM med
  • WHERE Year 2001 ans(D,C,M) -
    med(D,C,M,2001)
  • Mediator expands query to source queries
  • SELECT FROM R1, R2
  • WHERE Year 2001 ans(D,C,M,Y) - R1(D,C),
    R2(D,M,2001)

24
GAV Approach (cont)
  • Project TSIMMIS at Stanford
  • Advantages
  • User queries easy to define
  • Plan generation is straightforward
  • Disadvantages
  • Not all source information is exported
  • What if users want to get dealers that may not
    the city information?
  • Those dealers are not visible.
  • Not easily scalable every time a new source is
    added, mediator views need to be changed
  • Research issues
  • Efficient query execution?
  • Deal with limited source capabilities?

25
Limited source capabilities
  • Complete scans of relations not possible
  • Reasons
  • Legacy databases or structured files limited
    interfaces
  • Security/Privacy
  • Performance concerns
  • Example 1 legacy databases with restrictive
    interfaces

title
author
Given an author, return the books.
Ullman
DBMS
Knuth
TeX


26
Another example Web search forms
www.imdb.com
27
Problems
  • How to describe source restrictions?
  • How to compute mediator restrictions from
    sources?
  • How to answer queries efficiently given these
    restrictions?
  • How to compute as many answers as possible to a
    query?

28
Describe source capabilities using attribute
adornments. f free b bound u
unspecified cS chosen from a list S of
constants, e.g., state oS optional if
chosen, must be from a list S of constants A
search form is represented as multiple
templates (Title, Author, ISBN, Format,
Subject) b f u u
u ? 1 f b u u
u ? 1 u u u o
o ? 2 u u b u
u ? 3
1
2
3
29
Computing mediator restrictions
  • Motivation do not want users to be frustrated by
    submitting a query that cannot be answerable by
    the mediator
  • Example
  • Source 1 book(author, title, price)
  • Capability bff
  • I.e., we must provide a title, and can get author
    and price info
  • Source 2 review(title, reviewer, rate)
  • Capability bff
  • I.e., we must provide a book title, and can get
    other info
  • Mediator view
  • MedView(A,T,P,RV,RT) - book(A,T,P),review(T,RV,RT
    )
  • Query on the mediator view
  • Ans(RT) - MedView(A, db, P, RV, RT).
  • I.e., find the review rates of DB books
  • But the mediator cannot answer this query, since
    we do not know the authors.
  • We want to tell the user beforehand what queries
    can be answered

30
Solutions Compute mediator capabilities
  • Need algorithms that do the following
  • Given
  • Source relations with restrictions.
  • Mediator views defined on source relations
  • Union
  • Join
  • Selection
  • Projection
  • Main idea of the algorithms
  • compute restrictions on mediator views
  • minimize number of view templates

31
Union views
  • Assumption
  • MedView - V1?V2
  • We want to get all tuples from two sources that
    satisfy a query condition
  • No mediator post-processing power
  • Table to compute view adornments
  • E.g., f, os3 ? os3
  • cs2, os3 ? cs2?s3
  • Invalid combination b,u ? -

V2
V1
32
Union views with postprocessing
  • Mediator can postprocess results from a source,
    and check if the results satisfy certain
    conditions
  • Thus some entries are more relaxing
  • Essentially o can be treated as f, and u
    can be treated as f
  • E.g., f, os3 ? f instead of os3
  • cs2, os3 ? cs2 instead of cs2?s3
  • b,u ? b instead of invalid combination

V2
V1
33
Join views with passing bindings
  • Assumption
  • MedView - V1 JOIN V2
  • The mediator can pass bindings from V1 to V2
  • So the join order matters

V2
V1
34
Other views
  • Union
  • Join
  • Selection
  • Projection
  • Multiple views

35
Concise template description
  • Some adornments subsume other adornments
  • E.g. f subsumes b, since every query
    supported by b is also supported by f
  • Adornment graph subsumption relationships
  • Use the graph to compress templates
    experiments shrank 26 ? 8 templates

f
n1
n2
Adornment n1 is at least as restrictive as
adornment n2
b
o
n1
n2
Adornment n1 is at least as restrictive as
adornment n2, if the constant set of n1 is a
subset of that of n2
u
c
Adornment graph
36
Outline
  • Basics theories of conjunctive queries
  • Global-as-view (GAV) approach to data
    integration
  • Local-as-view (LAV) approach to data integration

37
Local-as-view (LAV) approach
Mediator
sources
  • There are global predicates, e.g., car,
    person, book, etc.
  • They can been seen as mediator views
  • The content of each source is described using
    these global predicates
  • A query to the mediator is also defined on the
    global predicates
  • The mediator finds a way to answer the query
    using the source contents

38
Example
Mediator
S1(Dealer,City)
S2(Dealer,Make,Year)
  • Global predicates Loc(Dealer,City),Sell(Dealer,Ma
    ke,Year)
  • Source content defined on global predicates
  • S1(Dealer,City) - Loc(Dealer,City)
  • S2(Dear,Make,Year) - Sell(Dear,Make,Year)
  • In general, each definition could be more
    complicated, rather than direct copies.
  • Queries defined on global predicates.
  • Q ans(D,M,Y) - Loc(D,irvine), Sell(D,M,Y)
  • Users do not know source views.
  • The mediator decides how to use source views to
    answer queries.
  • Answering queries using views
  • ans(D,M,Y) - S1(D,irvine), S2(D,M,Y)

39
Another LAV Example
  • Mediator predicates car(C), sell(Car, Dealer),
    loc(dealer, city)
  • Views
  • v1(x) - car(x)
  • v2(x) - car(x), sell(x, d)
  • v3(x,d) - sell(x, d), loc(d, la)
  • v4(x) - sell(x, d), loc(d, la)
  • Query q(x) - car(x), sell(x, d), loc(d, la)

40
Open-world assumption (OWA) and Close-world
assumption (CWA)
W1(Make, Dealer) - car(Make, Dealer) W2(Make,
Dealer) - car(Make, Dealer)
  • W1 and W2 have some car tuples.
  • E.g. W1 and W2 are from two different web sites.
  • W1 and W2 have all car tuples.
  • E.g. W1 and W2 are computed from the same car
    table in a database.

41
Projects using the LAV approach
  • Projects Information Manifold, Infomaster,
    Tukwila,
  • Advantages
  • Scalable new sources easy to add without
    modifying the mediator views
  • All we need to do is to define the new source
    using the existing mediator views (predicates)
  • Disadvantages
  • Hard to decide how to answer a query using views

42
Reading
  • Alon Halevy, Answering Queries Using Views A
    Survey.

43
Answering queries using views
Mediator
Query
V(D,C,M,Y) - Loc(D,C),Sell(D,M,Y)
  • Source views can be complicated SPJs, arithmetic
    comparisons,
  • Not easy to decide how to answer a query using
    source views
  • Query ans(D,M) - Loc(D,'irvine'),
    Sell(D,M,Y).
  • Rewriting ans(D,M) - V(D,irvine, M,Y)
  • Equivalent rewriting compute the same answer
    as the query
  • A rewriting can join multiple source views
  • This problem exists in many other applications
  • data warehousing
  • web caching
  • query optimizations

44
Arithmetic comparisons
Mediator
V(D,C,M,Y)- Loc(D,C),Sell(D,M,Y),Ylt1970
  • Comparisons can make the problem even trickier
  • Query ans(D,M) - Loc(D,'irvine'),
    Sell(D,M,Y).
  • Rewriting ans(D,M) - V(D,irvine, M,Y)
  • Contained rewriting only retrieve cars before
    1970.
  • Query ans(D,M) - Loc(D, 'irvine'), Sell(D,M,Y),
    Y lt 1960
  • Rewriting ans(D,M) - V(D,irvine,M,Y), Y lt 1960

45
Dropping attributes in views
Mediator
Drop Year in the view V(D,C,M)-
Loc(D,C),Sell(D,M,Y),Ylt1970
  • A variable in a CQ is called
  • distinguished if it appears in the querys
    head
  • nondistinguished otherwise
  • The problem becomes even harder when we have
    nondistinguished variables.
  • Query ans(D,M) - Loc(D,'irvine'), Sell(D,M,Y),
    Ylt1960
  • No rewriting! Since we do not have Year
    information.
  • Query ans(D,M) - Loc(D,'irvine'), Sell(D,M,Y),
    Ylt1980
  • Contained rewriting ans(D,M) - V(D, irvine,
    M)

46
Problems
Query
Source views
  • How to answer a query using views?
  • We will focus on the case where both the query
    and views are simply conjunctive.

47
Query Expansion
  • For each query P on views, we can expand P using
    the view definitions, and get a new query,
    denoted as Pexp, on the base tables.
  • Pexp can be considered to be the real meaning
    of the query.
  • Example
  • View V(D,C,M) - Loc(D,C), Sell(D,M,Y)
  • A query P using V ans(D,M) - V(D,la,M)
  • Expansion ans(D,M) - Loc(D,la), Sell(D,M,Y)

Query P ans() - v1(), v2(), , vk()
Expansion Pexp ans()- p1,1(),,p1,i1(),,
pk,1(),,pk,ik()
48
Rewritings
  • Given a query Q and a set of views V
  • A conjunctive query P is called a rewriting of
    Q using V if P only uses views in V, and P
    computes a partial answer of Q. That is Pexp ?Q.
    A rewriting is also called a contained
    rewriting (CR).
  • A conjunctive query P is called an equivalent
    rewriting (ER) of Q using V if P only uses views
    in V, and P computes the exact answer of Q. That
    is Pexp ? Q.
  • A query P is called a maximally-contained
    rewriting of Q using V if P is a union of CRs of
    Q using V, and for any CR P1of Q, the answer to P
    contains the answer to query P1, that is, P1exp ?
    Pexp.
  • See earlier slides for examples
  • Notice that all these definitions depend on the
    language of the rewriting considered. Here we
    consider conjunctive queries.

49
Focus MiniCon algorithm
  • MiniCon Algorithm Rachel Pottinger and Alon
    Levy, A scalable algorithm for answering queries
    using views, VLDB 2000.
  • See also The Shared-variable-bucket algorithm by
    Prasenjit Mitra "An Algorithm for Answering
    Queries Efficiently Using Views" in Proceedings
    of the Australasian Database Conference, Jan
    2001.
  • Formulation
  • Input a conjunctive query Q and a set V of
    conjunctive views
  • Output an maximally-contained rewriting (MCR) of
    Q using V
  • Main idea
  • For each query subgoal and for each view
  • Check if the view can be used to answer the
    query subgoal, and if so, in what form
  • Some shared variables are treated carefully
  • Combine views to answer all query subgoals
  • Reduced to a set-cover problem

50
Example
  • Query q(x) - car(x), sell(x, d), loc(d, la)
  • Views
  • v1(x) - car(x)
  • v2(x) - car(x), sell(x, d)
  • v3(x,d) - sell(x, d), loc(d, la)
  • v4(x) - sell(x, d), loc(d, la)

51
MCDs (enhanced Buckets)
  • For query subgoal car(x), its MCD includes all
    views that can answer this subgoal
  • v1(x), v2(x)
  • MCD of query subgoal sell(x,d)
  • v3(x,d) only
  • but not v2(x)! Because
  • Variable d is nondistinguished, i.e., it is not
    exported.
  • Variable d is shared by another query subgoal,
    loc(d,la). If we were to use v2(x) to answer
    query subgoal sell(x,d), we cannot get the dealer
    info to join with the other view to answer
    loc(d,la).
  • MCD of query subgoal loc(d,la)
  • v3(x,d)

52
Multi-subgoal MCD
  • MCD of query subgoals sell(x,d),loc(d,la)
  • v4(x)
  • If v4(x) is used to answer query subgoal
    sell(x,d), then the query subgoal loc(d,la)
    must be answered using v4(x) as well.
  • The reason is that d is shared by two query
    subgoals, and the corresponding variable in v4(x)
    is not exported.

53
General rules
  • For a query subgoal G and a view subgoal H in
    view W, the MiniCon algorithm considers a mapping
    from G to H
  • In this mapping, a query variable X is mapped to
    a view variable A
  • Four possible cases
  • Case 1 X is dist., A is dist.. OK.
  • A is exported, so can join with other views.
  • Case 2 X is nondist., A is dist.. OK.
  • Same as above
  • Case 3 X is dist., A is nondist.. NOT OK.
  • X needs to be in the answer, but A is not
    exported.
  • Case 4 X is nondist., A is nondist..
  • Then all the query subgoals using X must be able
    to be mapped to other subgoals in view W.
  • Reason since A is not exported in W, its
    impossible for W to join with other views to
    answer conditions involving X.
  • I.e., either NONE or ALL.

54
Combine MCDs to cover query subgoals
  • Problem
  • q(x) - car(x), sell(x,d), loc (d,la")
  • v1(x) - car(x)
  • v2(x) - car(x), sell(x,d)
  • v3(x,d) - sell(x,d), loc(d,la")
  • v4(x) - sell(x,d), loc (d,la")
  • MCDs
  • car(x) v1(x), v2(x)
  • sell(x,d) v3(x,d)
  • loc(d,"ca") v3(x,d)
  • sell(x,d),loc(d,la") v4(x)
  • Contained rewritings - using MCDs to cover all
    query subgoals, without overlap
  • P1 q(x) - v1(x), v3(x,d), v3(x,d)
  • P2 q(x) - v2(x), v3(x,d), v3(x,d)
  • P3 q(x) - v1(x), v4(x)
  • P4 q(x) - v2(x), v4(x)
  • MCR union of these four contained rewritings.

55
Related references
Query
Source views
  • Other algorithms on AQUV
  • Bucket, Inverse-rule
  • Generating efficient equivalent rewritings of
    queries using views
  • CoreCover algorithm Afrati, Li, Ullman,
    SIGMOD01
  • Handling arithmetic comparisons and dropped
    attributes
  • Afrati, Li, Mitra, PODS02
  • Afrati, Li, Mitra, EDBT04
Write a Comment
User Comments (0)
About PowerShow.com