Title: Querying Big Data by Accessing Small Data
1Querying Big Data by Accessing Small Data
- Wenfei Fan University of Edinburgh Beihang
University - Floris Geerts University of Antwerp
- Yang Cao University of Edinburgh Beihang
University - Ting Deng Beihang University
- Ping Lu Beihang University
2Challenges introduced by big data
- Traditional computational complexity theory of 50
years - The ugly PSPACE-hard, EXPTIME-hard, ,
undecidable - The bad NP-hard (intractable)
- The good polynomial time computable (PTIME)
What happens when it comes to big data?
- Using SSD of 6G/s, a linear scan of a data set D
would take - 1.9 days when D is of 1PB (1015B)
- 5.28 years when D is of 1EB (1018B)
- O(n) time is already beyond reach on big data in
practice!
Can we still answer queries on big data with
limited resource?
1
3Bounded evaluability
- Input A class L of queries
- Question Can we find, for any query Q ? L and
any (possibly big) dataset D, a fraction DQ of D
such that - Q(D) Q(DQ), and
- DQ can be identified in time determined by Q?
D
Q( )
Q( )
DQ
DQ
- Scales with D no matter how big D grows
Making the cost of computing Q(D) independent of
D!
2
4Graph Search (Facebook)
- Find me restaurants in New York my friends have
been to in 2014
1.38 billion person tuples, and over 140 billion
friend tuples
- select rid
- from friend(pid1, pid2), person(pid, name,
city), - dine(pid, rid, dd, mm, yy)
- where pid1 p0 and pid2 person.pid and
- pid2 dine.pid and city NYC and
yy 2014
Data semantics in constraints
- Facebook 5000 friends per person
- Each year has at most 366 days
- Each person dines at most once per day
- pid is a key for relation person
Build an index from pid1 to pid2 for friend(pid1,
pid2)
Boundedly evaluable with indices under
constraints?
5Bounded query evaluation
- Find me restaurants in New York my friends have
been to in 2014
- Q(rid) ? p, p1, n, c, dd, mm, yy (friend(p,
p1) ? person(p, n, c) ? dine(p, rid, dd, mm, yy)
? p p0 ? c NYC ? yy 2014)
A query plan under the constraints indices
- Fetch 5000 pids for friends of p0 -- 5000
friends per person - For each pid, check whether she lives in NYC
5000 person tuples - For pids living in NYC, find restaurants where
they dined in 2014 5000 366 tuples at most
In contrast to 1.38 billion person tuples, and
over 140 billion friend tuples
Accessing 5000 5000 5000 366 tuples in
total
4
6Overview
- Formalization of bounded query plans and queries
- The complexity of deciding the bounded
evaluability for - CQ (SPJ), UCQ, ?FO (SPJU), FO
- Effective syntax for boundedly evaluable queries
- Approximate query answering with bounded
evaluability - Bounded envelopes
- Bounded query specialization
- We only know that bounded evaluability is
- undecidable for FO PODS 2014
- in PTME for CQ with very restricted query plans
VLDB 2014
Previous work bounded query plans are not
properly defined
7Boundedly evaluable queries formulation
8Access constraints to capture data semantics
Combining cardinality constraints and index
On a relation schema R X ? (Y, N)
- X, Y sets of attributes of R
- for any X-value, there exist at most N distinct Y
values - Index on X for Y given an X value, find relevant
Y values -
Examples
- friend(pid1, pid2) pid1 ? (pid2, 5000) 5000
friends per person - dine(pid, rid, dd, mm, yy) pid, yy ? (rid,
366) each year has at most 366 days and each
person dines at most once per day - person(pid, name, city) pid ? (city, 1) pid
is a key for person
Discovery functional dependencies, simple
aggregate queries
Access schema A set of access constraints
6
9Bounded plans for query Q
In the presence of access schema A
?(Q, R) T1 ?1, , Tn ?n, where ?i
is
Y ? X ? Y
- a a constant in query Q
- Fetch(X ? Tj, R, Y) via access constraint R X ?
(Y, N), j lt i - ?Y(Tj),?C(Tj), ?(Tj) projection, selection,
renaming - Tj ? Tk, Tj ? Tk, Tj - Tk Cartesian
product, union, set difference, for j lt I, k lt i - The length of ?(Q, R) bounded by an exponential
in R, Q and A
not very practical for plans beyond exponential
Fetch data by making use of indices in A
Independent of the size of instances D of R
7
10Boundedly evaluable queries Q
Q has a bounded query plan ?(Q, R) under an
access schema A
- CQ only a , Fetch(X ? Tj, R, Y),
?Y(Tj),?C(Tj), ?(Tj), Tj ? Tk - UCQ ? at the end only
- ?FO a , Fetch, ?, ?, ?, ?, ?,
- FO a , Fetch, ?, ?, ?, ?, ?, ?
Coping with big data
8
11Deciding bounded evaluability
12The bounded evaluability problem (BEP(L))
- Input A relational schema R, an access schema A,
and a query Q in a query language L - Question Is Q boundedly evaluable under A?
- When Q has a bounded query plan under A.
Undecidable for FO PODS 2014
- Is BEP decidable for CQ? UCQ? ?FO?
- If so, what is the complexity?
The bounded evaluability analysis is nontrivial
9
13Example of bounded evaluable queries
- Schema R(A, B, C)
- Access schema A R(? ? C, 1), R(AB ? C, N)
- A CQ query
- Q(x, y) ? x1, x2, z1, z2, z3 (R(x1, x2, x) ?
R(z1, z2, y ) ? R(x, y, z3) ? x1 1 ? x2 1)
Is Q boundedly evaluable?
- Yes, Q is A-equivalent to Q(x, x) R (1, 1, x),
which is boundedly evaluable - x y z3
- ? z1, z2 (R(1, 1, x) ? R(z1, z2, y)) is entailed
by R(1, 1, x)
- With indices in A,
- nontrivial variables are fetchable
- combinations are indexed
10
We need to reason about A-equivalence and
nontrivial variables
14The complexity of BEP
- BEP is EXPSPACE-complete for CQ, UCQ and ?FO
- good news decidable
- bad news to expensive to be practical
lower bound by reduction from the non-emptiness
problem for parameterized regular expressions
Upper bound a characterization based on
A-equivalence and nontrivial variables for
boundedly evaluable queries
Can we make practical use of bounded evaluability?
11
15Effective syntax for boundedly evaluable queries
16An effective syntax for bounded CQ
- A form of queries covered by an access schema A
- A CQ is boundedly evaluable under A iff it is
A-equivalent to a CQ covered by A - All CQ queries covered by A are boundedly
evaluable under A - It is in PTIME to syntactically check whether a
CQ is covered by A in Q, A and R
A CQ Q is covered by A if
- all free variables and variables that participate
in selection / join of Q are accessible via
indices in A - combination of such variables in each atom R(x)
is indexed by a single access constraint
12
A syntactic characterization of boundedly
evaluable CQ
17More on covered queries
- Schema R(A, B, C)
- Access schema A R(? ? C, 1), R(AB ? C, N)
- Q(x, y) ? x1, x2, z1, z2, z3 (R(x1, x2, x) ?
R(z1, z2, y ) ? R(x, y, z3) ? x1 1 ? x2 1)
covered
A query in ?FO is covered by A if for each
CQ-subquery Qi
- either Qi is covered by A,
- or for each A-instance ?(Ti) of Qi, there exists
a CQ-subquery Qj of Q such that Qi (?(Ti)) ? Qj
(?(Ti)) and Qj is covered
?2p-complete to decide whether a query in ?FO
is covered
13
18Bounded envelopes
19Bounded envelopes
What can we do if query Q in L is not boundedly
evaluable under A?
- We find QL and QU in the same language L such
that - QL and QU are boundedly evaluable under A
- for all instances D that satisfy A
- QL(D) ? Q(D) ? QU(D), and
- NL ? Q(D) ? QL(D) , and NU ? QU(D) ?
Q(D) , - where NL and NU are constants
QL and QU upper and lower envelopes of Q
S. Chaudhuri and P. G. Kolatis. Can datalog be
approximated? JCSS 55(2), 1997
QL(D) and QU(D) are not too far from Q(D)
Approximate query answering
14
20Example bounded envelopes
- Schema R(A, B)
- Access schema A R(A ? B, N)
- Q(x) ? y, z, w (R(w, x) ? R(y, w) ? R(x, z)
? w 1)
not boundedly evaluable
relaxation
- Bounded envelopes
- Upper QU(x) ? y, z (R(1, x) ? R(x, z))
- Lower QL(x) ? y, z (R(1, x) ? R(y, 1) ?
R(x, y) ? R(x, z))
expansion
Q(x, y) ? w (R(w, x) ? R(y, w) ? w 1)
Bounded envelopes may not exist
15
21The bounded envelope problems
- UPE(L)
- Input A relational schema R, an access schema A,
and a query Q in a query language L - Question Does Q have a bounded upper envelope
under A? - Similarly LPE(L) for lower envelopes.
- We consider covered envelopes when Q is in CQ,
UCQ or ?FO
Complexity bounds
- For CQ, UEP and LEP are NP-complete
- For UCQ, UPE is ?2p-complete and LEP is
NP-complete - For ?FO, UPE is ?2p-complete and LEP is
DP-complete - For FO, UEP and LEP are undecidable
16
22Bounded specialized queries
23Bounded query specialization
- Access schema A, and query Q with a set X of
parameters (variables) - Q(x c) Q ? x c x ? X, valuation c is a
constant tuple - bounded evaluable under A for all valuations c
- Consider covered queries when Q is in CQ, UCQ or
?FO
- Find me restaurants in New York my friends have
been to in 2014
Q(p, rid) ? p, p1, n, c, dd, mm, yy
(friend(p, p1) ? person(p, n, c) ? dine(p, rid,
dd, mm, yy) ? p p0 ? c NYC ? yy 2014)
All valuations p0
Instantiate a minimum set of parameters and make
Q bounded
17
24The bounded specialization problem (QSP(L))
- Input A relational schema R, an access schema A,
a query Q in a query language L, a set X of
parameters of Q, and a positive integer k - Question Does Q have a bounded specialization
Q(x c) with k ? x ?
Complexity bounds
- NP-complete for CQ
- ?2p-complete for UCQ and ?FO
- undecidable for FO
18
25Summing up
26Bounded evaluability of queries
- Challenges querying big data is cost-prohibitive
- Bounded evaluability allows us to make big data
small - However, the bounded evaluability analysis is
expensive
- Nonetheless, we can make practical use of bounded
evaluability - Effective syntax covered queries for CQ, UCQ and
?FO - Approximate query answering
- Bounded envelopes with a constant bound
- Bounded specialization for parameterized queries
Decidability and complexity
An approach to effectively querying big data
26