Title: Data%20Exchange:%20Semantics%20and%20Query%20Answering
1Data Exchange Semantics and Query Answering
- Ronald Fagin -- IBM Almaden Research
Center - Phokion G. Kolaitis -- UC Santa Cruz
- Renee J. Miller -- University of Toronto
- Lucian Popa -- IBM Almaden Research
Center
IBM Almaden - November 12, 2002
(To appear in ICDT 2003)
2Motivation and Overview
- Data exchange problem
- How to restructure data from a source schema to a
target schema, according to a given specification - Main motivation for this work
- Understanding of fundamental issues that lie
underneath data exchange systems such as EXPRESS
and Clio - Main challenge
- Inherent under-specification
- Specification (as we shall see) must be simple
and intuitive, but - There are many ways in which the restructuring
can be performed ! - Question did we make the right choice in the
design of Clio ? - Our approach (only relational case, so far)
- Define and study universal solutions
- Show this is the best way of performing data
exchange - Study computational aspects
- Study what happens after data exchange query
answering
3 The Data Exchange Problem
Source schema S
Target schema T
?t
?st
I
J
- Assume a data exchange setting
- source schema S,
- target schema T with a set ?t dependencies (see
next) - set ?st source-to-target dependencies (see next)
- The data exchange problem is the following
- Input
- source instance I
- Output
- target instance J such that ltI, Jgt ? ?st and
J ? ?t - (call such J a solution for I )
4 Source-to-target Dependencies
- For most practical purposes, ?st contains
- source-to-target tuple-generating dependencies
(tgds) - ?S(x) ? ?y ?T(x, y)
- e.g.
- DeptEmp(did, mgr_name, eid)
- ? ?M. Dept (did, M, mgr_name) ? Emp (eid, did)
-
- (Move data from source table DeptEmp into two
target tables, Dept and Emp. The existential
variable M is an unspecified manager id)
Dept did mgr_id mgr_name
DeptEmp did mgr_name eid
Emp eid did
5 Target Dependencies
- The second, equally important, part of the
specification, are the target dependencies ?t - tgds ?T(x) ? ?y ?T(x, y)
- e.g.
- Dept (did, mgr_id, mgr_name)
- ? ?D. Emp (mgr_id, D)
- (A foreign key constraint in the target)
-
- equality generating dependencies (egds)
- ?T(x) ? (x1x2)
- e.g.
- Emp (e, d1) ? Emp (e, d2) ? (d1 d2)
-
- (A target key constraint)
Dept did mgr_id mgr_name
Emp eid did
6Questions (To be Answered Next)
- When more than one solution exists, how do we
choose a best one ? - How do we compute a best solution ?
- Is there always a solution ? Is there always a
best solution ? - How does query answering on the chosen solution
behave ?
7Universal Solutions Best Solutions
8Existence of Multiple Solutions
source
target
X0 , Y0 , Z0 represent unknown values (or
nulls)
P
A B C
T
Q
A B C
A B C
R
A B C
- There may be many solutions for the target
instance (J, J1, J2, etc.) - However, J seems to be more general
- there exist homomorphisms h1 J ? J1 and h2 J ?
J2 (see definition next) - but none from J1 or J2 to J
- intuitively, J1 and J2 have extra information
9Homomorphisms
- As we have seen, the values of a target instance
can be either - constants (i.e. values coming from the source
instance), or - nulls (unknown values)
- Definition. Assume J1 and J2 are such target
instances. A homomorphism h J1 -gt J2 is a
mapping from values of J1 to values of J2 such
that - h(c) c, for constants c
- (nulls of J1 can be mapped to
any values of J2) - for every tuple lta1, , angt in relation T of
instance J1 - lt h(a1), h(an) gt must be a tuple in
relation T of instance J2 - Example
10 Universal Solution
- Definition. Assume a data exchange setting (S, T,
?st, ?t). Given source instance I, a universal
solution for I is a target instance J such that
- (1) J is a solution for I
- (2) for every solution J for I,
- there exists homomorphism h J ? J
- For the previous example, J is a universal
solution. J1 and J2 are not. - Among all solutions, universal solutions are
special - They contain no more and no less than the amount
of information given by the specification
11- Fact
- Uniqueness up to homomorphic equivalence
- If J1 and J2 are universal for I then there are
homomorphisms between J1 and J2 in both
directions - Representation of the space of solutions
- Sol(I1) Sol(I2) iff J1 and J2 are
homomorphically equivalent - We adopt the universal solution as the notion of
best solution. - Later we will see another justification for
universal solutions in terms of query answering.
12- When do universal solutions exist ?
- How do we compute a universal solution ?
13Computing Universal Solutions
14Chase
- We canonically generate a universal solution by
using the chase - Given source instance I, start with an empty
target instance J - Generate tuples in J by applying the
dependencies in ?st and ?t. - Example
Dept did mgr_id mgr_name
DeptEmp did mgr_name eid
Emp eid did
15- This process is repeatedly applied
- for all the source tuples and for the generated
tuples, - as long as there are dependencies that are not
yet satisfied - The chase may be infinite (cyclic ?t )
- or it may fail (e.g. target key constraints
that are not possible to satisfy for the given
source data) - (details in the paper)
- However, if the chase successfully terminates,
the resulting target instance is a solution.
16Canonical Generation of Universal Solutions
- Theorem. Assume a data exchange setting (S, T,
?st, ?t). Given source instance I - If the chase is finite and successful then its
result is a universal solution. - If the chase fails then there is no solution.
- Thus, the chase is a procedure for computing
universal solutions, provided that - Solutions exist, and
- The chase is finite
- We call universal solutions computed by the chase
canonical universal solutions
When can we guarantee that the chase is finite ?
17Weakly Acyclic Sets of Dependencies
- Some cyclic sets of dependencies may cause
infinite chases - In such case no universal solution may exist, and
the semantics of the data exchange is undefined - Still there are cyclic sets of dependencies that
behave well and are quite useful - Weakly acyclic sets of dependencies (defined in
the paper) - Cover many practical cases of target constraints
- Allow for restricted cyclicity
- The chase is guaranteed to be finite
18Polynomial-Time Chase
Theorem. Let ? be a weakly acyclic set of
dependencies. For every instance K, the chase of
K with ? can be computed in polynomial time.
- Corollary. Assume a data exchange setting (S, T,
?st, ?t) such that ?t is a weakly acyclic set of
dependencies. - For every source instance I, the existence of a
solution can be checked in polynomial time - For every source instance I, if a solution
exists then a universal solution can be produced
in polynomial time.
19- Next what happens after data exchange ?
- In particular, how is subsequent query answering
affected by our choice of a solution (universal
solution) ?
20Query Answering
21Query Evaluation on a Solution
q
Target schema T
Source schema S
?t
?st
I
J
- Assume a fixed data exchange setting with a
source instance I. Suppose that a System 1
chooses a solution J for data exchange. - A query q can now be asked against the target.
- The evaluation of q, in System 1, is q(J).
- However, a System 2 materializing a different
solution J may give a different evaluation
q(J). - Different choices of J (for the same I) imply
possibly different query evaluations. - Is there a notion of the right set of answers
to q with respect to I ?
22Certain Answers
- We will use a notion that has been around in the
context of data integration and incomplete
databases, where queries are asked against a set
of possible databases.
- Definition. Given I and q, a tuple t is a certain
answer if - t ? q(J), for every solution J
- Notation certain(q, I) the set of all
certain answers
- Thus, t is certain if it is in the answer of q on
every solution. - The certain answers provide well-defined
semantics to query answering because they are
independent of the choice of a solution.
23- Can we compute the certain answers based just on
our chosen (universal) solution ?
24Positive Queries
- Proposition. Assume a data exchange setting (S,
T, ?st, ?t) and a source instance I. - Let q be a positive query. If J is a universal
solution, then certain(q, I) q(J)? . - Let J be a solution such that for every positive
query q we have that certain(q, I) q(J)? . Then
J is a universal solution. - Note In the above
- Positive query means union of SPJ queries
- q(J)? means evaluate q on J and then throw away
tuples that contain nulls)
- Thus, the certain answers of positive queries can
be computed by evaluating them on any universal
solution. - Moreover, this property characterizes universal
solutions.
25Conjunctive Queries with Inequalities
- The situation changes when negation is involved
(even in the very simple form of ?). - Example
- It can be verified that
- lta0,a0gt ? q(J)?, but lta0,a0gt ? q(J2) (thus, not
a certain answer). - Hence certain(q, I) ? q(J)?
- The universal solution gives extra answers
26- For conjunctive queries with inequalities, we
have seen that simple query evaluation on a
universal solution is not enough for computing
the certain answers. - Question Can we find a different SQL query q
such that when evaluated on a universal solution
gives the set of certain answers of q ? - There are examples for which such query q
exists. - However, we show next that the answer is no, in
general.
27Complexity Two or More Inequalities
Theorem. Computing the certain answers of unions
of conjunctive queries with at most two
inequalities per term is coNP-hard, even in a
restricted data exchange setting (LAV).
- AD98 proved a similar result for the case of
conjunctive queries with six or more
inequalities. - The coNP-hardness implies
- the certain answers cannot be computed by
evaluating the query q (or any other SQL query
q) on a polynomial-time generated universal
solution (unless P NP).
28Complexity One Inequality
Theorem. Assume a data exchange setting (S, T,
?st, ?t) such that ?t is a weakly acyclic set of
dependencies. Let q be a union of conjunctive
queries with at most one inequality per term. Let
I be a source instance and let J be an arbitrary
universal solution for I. Then there exists a
polynomial-time algorithm with input J that
computes certain(q, I).
- Thus, computing the certain answers for such
queries is a tractable problem. - Moreover, this computation can take place on any
universal solution. - The universal solution has all the information
needed to compute the certain answers. - We show next that the problem of computing the
certain answers, even for this tractable case,
cannot be solved by means of SQL query
evaluation.
29First-Order Inexpressibility
Theorem. There exists a data exchange setting and
a boolean conjunctive query q with one
inequality, for which there is no first-order
query q over the canonical universal solution
such that certain(q, I) q(J) .
- This is a strong inexpressibility result that
shows that in data exchange we cannot use the
notion of certain answers for answering queries
with inequalities. - (In practice, instead of going for certain
answers, we should just use query evaluation on
the universal solution) - The proof uses an original combination of finite
model theory techniques and the chase.
30Conclusions
- Universal solutions are a good candidate for
using in data exchange - Clio produces such universal solution (in the
relational case) - All universal solutions are equally good for
answering positive queries. - Simple query evaluation has the same semantics as
that of the certain answers. - For queries with inequalities, different
universal solutions may give different query
evaluations which may yet be different from the
certain answers. - There is no hope to find the certain answers by
means of SQL query evaluation on a universal
solution
31Future Work
- Among all universal solutions, is there a
universal solution that approximates, in a best
way, the certain answers ? If yes, can this be
computed efficiently ? - Extension to semantics and query answering for
data exchange in the nested (XML) case.
32The End
33 Relationship to Clio
- Source-to-target tgds are the same formalism that
Clio uses internally (for the relational case) - ?st is generated by Clio in the semantic
translation phase VLDB02 from correspondences.
- User input
- Then Clio generates, based on ?st , a set of
queries in the data translation phase VLDB02 - These queries compute a solution
- Is Clios solution a good one ? (Since other
solutions are also possible) - Here we try to understand, formally, the concept
of good solutions - ?t is more general than the target constraints
that Clio can currently handle.