Title: CHAPTER%202:%20MANIPULATING%20QUERY%20EXPRESSIONS
1CHAPTER 2 MANIPULATING QUERY EXPRESSIONS
PRINCIPLES OF DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
2Introduction
- How does a data integration system decide which
sources are relevant to a query? Which are
redundant? How to combine multiple sources to
answer a query? - Answer by reasoning about the contents of data
sources. - Data sources are often described by queries /
views. - This chapter describes the fundamental tools for
manipulating query expressions and reasoning
about them.
3Outline
- Review of basic database concepts
- Query unfolding
- Query containment
- Answering queries using views
4Basic Database Concepts
- Relational data model
- Integrity constraints
- Queries and answers
- Conjunctive queries
- Datalog
5Relational Terminology
- Relational schemas
- Tables, attributes
- Relation instances
- Sets (or multi-sets) of tuples
- Integrity constraints
- Keys, foreign keys, inclusion dependencies
6 Attribute names
Table/relation name
Product
PName Price Category Manufacturer
Gizmo 19.99 Gadgets GizmoWorks
Powergizmo 29.99 Gadgets GizmoWorks
SingleTouch 149.99 Photography Canon
MultiTouch 203.99 Household Hitachi
Tuples or rows
7SQL (very basic)
Interview candidate, date, recruiter,
hireDecision, grade
EmployeePerf empID, name, reviewQuarter,
grade, reviewer
select recruiter, candidate from Interview,
EmployeePerf where recruitername AND
grade lt 2.5
8Query Answers
- Q(D) the set (or multi-set) of rows resulting
from applying the query Q on the database D. - Unless otherwise stated, we will consider sets
rather than multi-sets.
9SQL (w/aggregation)
EmployeePerf empID, name, reviewQuarter,
grade, reviewer
select reviewer, Avg(grade) from
EmployeePerf where reviewQuarter1/2007
10Integrity Constraints (Keys)
- A key is a set of columns that uniquely determine
a row in the database - There do not exist two tuples, t1 and t2 such
that t1 ? t2 and t1 and t2 have the same values
for the key columns. - (EmpID, reviewQuarter) is a key for EmployeePerf
11Integrity Constraints (Functional Dependencies)
- A set of attribute A functionally determines a
set of attributes B if whenever , t1 and t2
agree on the values of A , they must also agree
on the values of B. - For example, (EmpID, reviewQuarter) functionally
determine (grade). - Note a key dependency is a functional dependency
where the key determines all the other columns.
12Integrity Constraints (Foreign Keys)
- Given table T with key B and table S with key A
A is a foreign key of B in T if whenever a S has
a row where the value of A is v, then T must have
a row where the value of B is v. - Example the empID attribute of EmployeePerf is a
foreign key for attribute emp of Employee.
13General Integrity Constraints
Tuple generating dependencies (TGDs)
Equality generating dependencies (EGDs) right
hand side contains only equalities.
Exercise express the previous constraints using
general integrity constraints.
14Conjunctive Queries
Q(X,T) - Interview(X,D,Y,H,F),
EmployeePerf(E,Y,T,W,Z), W lt 2.5.
Joins are expressed with multiple occurrences of
the same variable
select recruiter, candidate from Interview,
EmployeePerf where recruitername AND
grade lt 2.5
15Conjunctive Queries (interpreted predicates)
Q(X,T) - Interview(X,D,Y,H,F),
EmployeePerf(E,Y,T,W,Z), W lt 2.5.
Interpreted (or comparison) predicates. Variables
must also appear in regular atoms.
select recruiter, candidate from Interview,
EmployeePerf where recruitername AND
grade lt 2.5
16Conjunctive Queries (negated subgoals)
Q(X,T) - Interview(X,D,Y,H,F),
EmployeePerf(E,Y,T,W,Z), ?OfferMade(X,
date).
Safety every head variable must appear in a
positive subgoal.
17Unions of Conjunctive Queries
Multiple rules with the same head predicate
express a union
Q(R,C) - Interview(X,D,Y,H,F),
EmployeePerf(E,Y,T,W,Z), W lt 2.5.
Q(R,C) - Interview(X,D,Y,H,F),
EmployeePerf(E,Y,T,W,Z), Manager(y), W gt 3.9.
18Datalog (recursion)
Database edge(X,Y) describing edges in a
graph. Recursive query finds all paths in the
graph.
Path(X,Y) - edge(X,Y)
Path(X,Y) - edge(X,Z), path(Z,Y)
19Outline
- Review of basic database concepts
- Query unfolding
- Query containment
- Answering queries using views
20Query Unfolding
- Query composition is an important mechanism for
writing complex queries. - Build query from views in a bottom up fashion.
- Query unfolding unwinds query composition.
- Important for
- Comparing between queries expressed with views
- Query optimization (to examine all possible join
orders) - Unfolding may even discover that the composition
of two satisfiable queries is unsatisfiable!
(exercise find such an example).
21Query Unfolding Example
The unfolding of Q3 is
22Query Unfolding Algorithm
- Find a subgoal p(X1 ,,Xn) such that p is defined
by a rule r. - Unify p(X1 ,,Xn) with the head of r.
- Replace p(X1 ,,Xn) with the result of applying
the unifier to the subgoals of r (use fresh
variables for the existential variables of r). - Iterate until no unifications can be found.
- If p is defined by a union of r1, , rn, create n
rules, for each of the rs.
23Query Unfolding Summary
- Unfolding does not necessarily create a more
efficient query! - Just lets the optimizer explore more evaluation
strategies. - Unfolding is the opposite of rewriting queries
using views (see later). - The size of the resulting query can grow
exponentially (exercise show how).
24Outline
- Review of basic database concepts
- Query unfolding
- Query containment
- Answering queries using views
25Query Containment Motivation (1)
Intuitively, the unfolding of Q3 is equivalent to
Q4
How can we justify this intuition formally?
26Query Containment Motivation (2)
Furthermore, the query Q5 that requires going
through two hubs is contained in Q3
We need algorithms to detect these relationships.
27Query Containment and Equivalence Definitions
Query Q1 contained in query Q2 if for every
database D Q1(D) ? Q2(D)
Query Q1 is equivalent to query Q2 if Q1(D) ?
Q2(D) and Q2(D) ? Q1(D)
Note containment and equivalence are properties
of the queries, not of the database!
28Notation Apology
- Powerpoint does not have square
- The book uses the square notation for
containment, but the slides use the rounded
version.
29Reality Check 1
30Reality Check 2
31Why Do We Need It?
- When sources are described as views, we use
containment to compare among them. - If we can remove sub-goals from a query, we can
evaluate it more efficiently. - Actually, containment arises everywhere
32Reconsidering the Example
Relations Flight(source, destination)
Hub(city)
Views Q1(X,Y) - Flight(X,Z), Hub(Z),
Flight(Z,Y) Q2(X,Y) - Hub(Z), Flight(Z,X),
Flight(X,Y)
Query Q3(X,Z) - Q1(X,Y), Q2(Y,Z)
Unfolding Q3(X,Z) - Flight(X,U), Hub(U),
Flight(U,Y), Hub(W),
Flight(W,Y), Flight(Y,Z)
33Remove Redundant Subgoals
Redundant subgoals? Q3(X,Z) - Flight(X,U),
Hub(U), Flight(U,Y), Hub(W),
Flight(W,Y), Flight(Y,Z) ? Q3(X,Z) -
Flight(X,U), Hub(U), Flight(U,Y),
Flight(Y,Z)
Is Q3 equivalent to Q3? Q3(X,Z) -
Flight(X,U), Hub(U), Flight(U,Y)
Hub(W), Flight(W,Y), Flight(Y,Z)
34Containment Conjunctive Queries
No interpreted predicates (?,?) or negation for
now.
Recall semantics if ? maps the body subgoals to
tuples in D then, is an answer.
35Containment Mappings
- ? Vars(Q1) ?Vars(Q2)
- is a containment mapping if
-
and
36Example Containment Mapping
Q3(X,Z) - Flight(X,U), Hub(U), Flight(U,Y),
Hub(W), Flight(W,Y),
Flight(Y,Z) Q3(X,Z) - Flight(X,U), Hub(U),
Flight(U,Y), Flight(Y,Z)
Identity mapping on all variables, except
37TheoremChandra and Merlin, 1977
Q1 contains Q2 if and only if there is
a containment mapping from Q1 to Q2.
Deciding whether Q1 contains Q2 is NP-complete.
38Proof (sketch)
(if) assume ? exists
Let t be an answer to Q2 over database D (i.e.,
there is a mapping ? from Vars(Q2) to D)
Consider ? ?.
39Proof (only-if direction)
Assume containment holds.
Consider the frozen database D of Q2.
(variables of Q2 are constants in D).
The mapping from Q1 to D containment map.
40Frozen Database
Q(X,Z) - Flight(X,U), Hub(U), Flight(U,Y),
Flight(Y,Z) Frozen database for
Q(X,Z) is Flight (X,U), (U,Y), (Y,Z)
Hub (U)
41Two Views of this Result
- Variable mapping
- a condition on variable mappings that guarantees
containment - Representative (canonical) databases
- We found a (single) database that would offer a
counter example if there was one - Containment results typically fall into one of
these two classes.
42Union of Conjunctive Queries
Theorem a CQ is contained in a union of CQs if
and only if it is contained in one of the
conjunctive queries. Corollary containment is
still NP-complete.
43CQs with Comparison Predicates
A tweak on containment mappings provides a
sufficient condition
and
44Example Containment Mapping
45Containment Mappings are not Sufficient
No containment mapping, but
46Query Refinements
We consider the refinements of Q2
The red containment mapping applies for the first
refinement and the blue to the second.
47Constructing Query Refinements
- Consider all complete orderings of the variables
and constants in the query. - For each complete ordering, create a conjunctive
query. - The result is a union of conjunctive queries.
48Complete Orderings
Given a conjunction C of interpreted atoms
over a set of variables X1,,Xn a set
of constants a1,,am CT is a complete ordering
if CT C, and
49Query Refinements
Let CT be a complete ordering of C1
Then
is a refinement of Q1
50Theorem Queries with Interpreted
PredicatesKlug, 88, van der Meyden, 92
Q1 contains Q2 if and only if there is
a containment mapping from Q1 to every refinement
of Q2.
Deciding whether Q1 contains Q2 is
In practice, you can do much better (see
CQIPContainment Algorithm in Sec. 2.3.4
51Queries with Negation
Queries assumed safe every head variable
appears in a positive sub-goal in the body.
Revised containment mappings map negative
subgoals in Q1 to negative subgoals in Q2. ?
Sufficient condition, but not necessary.
52Containment with no Containment Mapping
x
z
y
a
b
c
d
53Theorem Queries with Negation
B total number of variables and constants
in Q2. Q1 contains Q2 if and only if Q1(D)?Q2(D)
for all databases D with at most B constants.
Deciding whether Q1 contains Q2 is
54Bag Semantics
Origin Destination Departure Time
SF Seattle 8AM
SF Seattle 10AM
Seattle Anchorage 1PM
Set semantics (SF, Anchorage)
Bags (SF, Anchorage), (SF, Anchorage)
55Theorem Conjunctive Queries, Bag Semantics
Q1 is equivalent to Q2 if and only if there is a
1-1 containment mapping.
Trivial example on non-equivalence
Query containment?
56Grouping and Aggregation
- Count queries are sensitive to multiplicity
- Max queries are not sensitive to multiplicity
- and many more results (see Section 2.3.6).
57Outline
- Review of basic database concepts
- Query unfolding
- Query containment
- Answering queries using views
58Motivating Example (Part 1)
Movie(ID,title,year,genre) Director(ID,director) A
ctor(ID, actor)
Containment is enough to show that V1 can be used
to answer Q.
59Motivating Example (Part 2)
Containment does not hold, but intuitively, V2
and V3 are useful for answering Q.
How do we express that intuition? Answering
queries using views!
60Problem Definition
Input Query Q View definitions V1,,Vn
A rewriting a query Q that refers only to the
views and interpreted predicates
An equivalent rewriting of Q using V1,,Vn a
rewriting Q, such that Q ? Q.
61Motivating Example (Part 3)
Movie(ID,title,year,genre) Director(ID,director) A
ctor(ID, actor)
maximally-contained rewriting
62Maximally-Contained Rewritings
Input Query Q Rewriting query language
L View definitions V1,,Vn
Q is a maximally-contained rewriting of Q given
V1,,Vn and L if
- 1. Q ? L,
- 2. Q ? Q, and
- 3. there is no Q in L such that
- Q ? Q and Q? Q
63Motivation (in words)
- LAV-style data integration
- Need maximally-contained rewritings
- Query optimization
- Need equivalent rewritings
- Implemented in most commercial DBMS
- Physical database design
- Describe storage structures as views
64Exercise which of these views can be used to
answer Q?
65Algorithms for answering queries using views
- Step 1 well bound the space of possible query
rewritings we need to consider (no interpreted
predicates) - Step 2 well find efficient methods for
searching the space of rewritings - Bucket Algorithm, MiniCon Algorithm
- Step 2b we consider logical approaches to the
problem - The Inverse-Rules Algorithm
- Well consider interpreted predicates,
66Bounding the Rewriting Length
Theorem if there is an equivalent erwriting,
there is one with at most n subgoals.
Proof Only n subgoals in Q can contribute to the
image of the containment mapping ?
67Complexity ResultLMSS, 1995
- Applies to queries with no interpreted
predicates. - Finding an equivalent rewriting of a query using
views is NP-complete - Need only consider rewritings of query length or
less. - Maximally-contained rewriting
- Union of all conjunctive rewritings of length n
or less.
68The Bucket Algorithm
- Key idea
- Create a bucket for each subgoal g in the query.
- The bucket contains view atoms that contribute to
g. - Create rewritings from the Cartesian product of
the buckets.
69Bucket Algorithm in Action
View atoms that can contribute to Movie
V1(ID,year), V2(ID,A), V4(ID,D,year)
70The Buckets and Cartesian product
Movie(ID,title, year,genre) Revenues(ID, amount) Director(ID,dir)
V1(ID,year) V1(ID,Y) V4(ID,Dir,Y)
V2(ID,A) V2(ID,amount)
V4(ID,D,year)
Consider first candidate rewriting first V1
subgoal is redundant, and V1 and V4 are mutually
exclusive.
71Next Candidate Rewriting
Movie(ID,title, year,genre) Revenues(ID,amount) Director(ID,dir)
V1(ID,year) V1(ID,Y) V4(ID,Dir,Y)
V2(ID,A) V2(ID,amount)
V4(ID,D,year)
72The Bucket Algorithm Summary
- Cuts down the number of rewriting that need to be
considered, especially if views apply many
interpreted predicates. - The search space can still be large because the
algorithm does not consider the interactions
between different subgoals. - See next example.
73The MiniCon Algorithm
Intuition The variable I is not in the head of
V5, hence V5 cannot be used in a
rewriting. MiniCon discards this option early on,
while the Bucket algorithm does not notice the
interaction.
74MinCon Algorithm Steps
- Create MiniCon descriptions (MCDs)
- Homomorphism on view heads
- Each MCD covers a set of subgoals in the query
with a set of subgoals in a view - Combination step
- Any set of MCDs that covers the query subgoals
(without overlap) is a rewriting - No need for an additional containment check!
75MiniCon Descriptions (MCDs)An atomic fragment of
the ultimate containment mapping
MCD mapping covered subgoals of Q 2,3
76MCDs Detail 1
Need to specialize the view first
MCD mapping covered subgoals of Q 2,3
77MCDs Detail 2
Note the third subgoal of the view is not
included in the MCD.
MCD mapping covered subgoals of Q still
2,3
78Inverse-Rules Algorithm
- A logical approach to AQUV
- Produces maximally-contained rewriting in
polynomial time - To check whether the rewriting is equivalent to
the query, you still need a containment check. - Conceptually simple and elegant
- Depending on your comfort with Skolem functions
79Inverse Rules by Example
Given the following view
And the following tuple in V7
V7(79,Manhattan,1979,Comedy) Then we can infer
the tuple Movie(79,Manhattan,1979,Come
dy) Hence, the following rule is sound IN1
Movie(I,T,Y,G) - V7(I,T,Y,G)
80Skolem Functions
- Now suppose we have the tuple
- V7(79,Manhattan,1979,Comedy)
- Then we can infer that there exists some
director. Hence, the following rules hold (note
that they both use the same Skolem function) - IN2 Director(I,f1(I,T,Y,G))- V7(I,T,Y,G)
- IN3 Actor(I,f1(I,T,Y,G))- V7(I,T,Y,G)
81Inverse Rules in GeneralRewriting Inverse
Rules Query
Given Q2, the rewriting would include IN1, IN2,
IN3, Q2.
Given input V7(79,Manhattan,1979,Comedy) Inverse
rules produce Movie(79,Manhattan,1979,Comedy)
Director(79,f1(79,Manhattan,1979,Comedy))
Actor(79,f1(79,Manhattan,1979,Comedy))
Movie(Manhattan,1979,Comedy) (the last tuple is
produced by applying Q2).
82Comparing Algorithms
- Bucket algorithm
- Good if there are many interpreted predicates
- Requires containment check. Cartesian product can
be big - MiniCon
- Good at detecting interactions between subgoals
83Algorithm Comparison (Continued)
- Inverse-rules algorithm
- Conceptually clean
- Can be used in other contexts (see later)
- But may produce inefficient rewritings because it
undoes the joins in the views (see next slide) - Experiments show MiniCon is most efficient.
- Recently MiniCon improved by using constraint
satisfaction techniques.
84Inverse Rules Inefficiency Example
Now we need to re-compute the join
85View-Based Query Answering
- Maximally-contained rewritings are parameterized
by query language. - More general question
- Given a set of view definitions, view instances
and a query, what are all the answers we can
find? - We introduce certain answers as a mechanism for
providing a formal answer.
86View Instances Possible DBs
Consider the two views
And suppose the extensions of the views are
V8 Allen, Copolla
V9 Keaton, Pacino
87Possible Databases
There are multiple databases that satisfy the
above view definitions (we ignore the first
argument of Movie below) DB1. (Allen, Keaton),
(Coppola, Pacino) DB2. (Allen, Pacino),
(Coppola, Keaton) If we ask whether Allen
directed a movie in which Keaton acted, we cant
be sure.
Certain answers are those true in all databases
that are consistent with the views and their
extensions.
88Certain Answers Formal Definition Open-world
Assumption
- Given
- Views V1,,Vn
- View extensions v1,vn
- A query Q
- A tuple t is a certain answer to Q under the
open-world assumption if t ? Q(D) for all
databases D such that - Vi(D) ? vi for all i.
89Certain AnswersClosed-world Assumption
- Given
- Views V1,,Vn
- View extensions v1,vn
- A query Q
- A tuple t is a certain answer to Q under the
open-world assumption if t ? Q(D) for all
databases D such that - Vi(D) vi for all i.
90Certain Answers Example
V8 Allen
V9 Keaton
Under closed-world assumption single DB
possible ? (Allen, Keaton) Under open-world
assumption no certain answers.
91The Good News
- The MiniCon and Inverse-rules algorithms produce
all certain answers - Assuming no interpreted predicates in the query
(ok to have them in the views) - Under open-world assumption
- Corollary they produce a maximally-contained
rewriting
92In Other News
- Under closed-world assumption finding all certain
answers is co-NP hard!
Proof encode a graph -- G (V,E)
q has a certain tuple iff G is not 3-colorable
93Interpreted Predicates
- In the views no problem (all results hold)
- In the query Q
- If the query contains interpreted predicates,
finding all certain answers is co-NP-hard even
under open-world assumption - Proof reduction to CNF.
94Summary of Chapter 2
- Query containment and answering queries using
views are fundamental tools in our arsenal. - In general, they are NP-complete (or worse), but
in practice they are not the bottleneck. - Certain answers are the formalism we use to model
answers in data integration systems - They capture our knowledge about what tuples must
be in the instance of the mediated schema.