Querying Big Data by Accessing Small Data - PowerPoint PPT Presentation

About This Presentation
Title:

Querying Big Data by Accessing Small Data

Description:

Querying Big Data by Accessing Small Data Wenfei Fan University of Edinburgh & Beihang University Floris Geerts University of Antwerp Yang Cao University of Edinburgh ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 27
Provided by: wenf7
Category:

less

Transcript and Presenter's Notes

Title: Querying Big Data by Accessing Small Data


1
Querying Big Data by Accessing Small Data
  • Wenfei Fan University of Edinburgh Beihang
    University
  • Floris Geerts University of Antwerp
  • Yang Cao University of Edinburgh Beihang
    University
  • Ting Deng Beihang University
  • Ping Lu Beihang University

2
Challenges introduced by big data
  • Traditional computational complexity theory of 50
    years
  • The ugly PSPACE-hard, EXPTIME-hard, ,
    undecidable
  • The bad NP-hard (intractable)
  • The good polynomial time computable (PTIME)

What happens when it comes to big data?
  • Using SSD of 6G/s, a linear scan of a data set D
    would take
  • 1.9 days when D is of 1PB (1015B)
  • 5.28 years when D is of 1EB (1018B)
  • O(n) time is already beyond reach on big data in
    practice!

Can we still answer queries on big data with
limited resource?
1
3
Bounded evaluability
  • Input A class L of queries
  • Question Can we find, for any query Q ? L and
    any (possibly big) dataset D, a fraction DQ of D
    such that
  • Q(D) Q(DQ), and
  • DQ can be identified in time determined by Q?

D
Q( )
Q( )
DQ
DQ
  • Scales with D no matter how big D grows

Making the cost of computing Q(D) independent of
D!
2
4
Graph Search (Facebook)
  • Find me restaurants in New York my friends have
    been to in 2014

1.38 billion person tuples, and over 140 billion
friend tuples
  • select rid
  • from friend(pid1, pid2), person(pid, name,
    city),
  • dine(pid, rid, dd, mm, yy)
  • where pid1 p0 and pid2 person.pid and
  • pid2 dine.pid and city NYC and
    yy 2014

Data semantics in constraints
  • Facebook 5000 friends per person
  • Each year has at most 366 days
  • Each person dines at most once per day
  • pid is a key for relation person

Build an index from pid1 to pid2 for friend(pid1,
pid2)
Boundedly evaluable with indices under
constraints?
5
Bounded query evaluation
  • Find me restaurants in New York my friends have
    been to in 2014
  • Q(rid) ? p, p1, n, c, dd, mm, yy (friend(p,
    p1) ? person(p, n, c) ? dine(p, rid, dd, mm, yy)
    ? p p0 ? c NYC ? yy 2014)

A query plan under the constraints indices
  • Fetch 5000 pids for friends of p0 -- 5000
    friends per person
  • For each pid, check whether she lives in NYC
    5000 person tuples
  • For pids living in NYC, find restaurants where
    they dined in 2014 5000 366 tuples at most

In contrast to 1.38 billion person tuples, and
over 140 billion friend tuples
Accessing 5000 5000 5000 366 tuples in
total
4
6
Overview
  • Formalization of bounded query plans and queries
  • The complexity of deciding the bounded
    evaluability for
  • CQ (SPJ), UCQ, ?FO (SPJU), FO
  • Effective syntax for boundedly evaluable queries
  • Approximate query answering with bounded
    evaluability
  • Bounded envelopes
  • Bounded query specialization
  • We only know that bounded evaluability is
  • undecidable for FO PODS 2014
  • in PTME for CQ with very restricted query plans
    VLDB 2014

Previous work bounded query plans are not
properly defined
7
Boundedly evaluable queries formulation
8
Access constraints to capture data semantics
Combining cardinality constraints and index
On a relation schema R X ? (Y, N)
  • X, Y sets of attributes of R
  • for any X-value, there exist at most N distinct Y
    values
  • Index on X for Y given an X value, find relevant
    Y values

Examples
  • friend(pid1, pid2) pid1 ? (pid2, 5000) 5000
    friends per person
  • dine(pid, rid, dd, mm, yy) pid, yy ? (rid,
    366) each year has at most 366 days and each
    person dines at most once per day
  • person(pid, name, city) pid ? (city, 1) pid
    is a key for person

Discovery functional dependencies, simple
aggregate queries
Access schema A set of access constraints
6
9
Bounded plans for query Q
In the presence of access schema A
?(Q, R) T1 ?1, , Tn ?n, where ?i
is
Y ? X ? Y
  • a a constant in query Q
  • Fetch(X ? Tj, R, Y) via access constraint R X ?
    (Y, N), j lt i
  • ?Y(Tj),?C(Tj), ?(Tj) projection, selection,
    renaming
  • Tj ? Tk, Tj ? Tk, Tj - Tk Cartesian
    product, union, set difference, for j lt I, k lt i
  • The length of ?(Q, R) bounded by an exponential
    in R, Q and A

not very practical for plans beyond exponential
Fetch data by making use of indices in A
Independent of the size of instances D of R
7
10
Boundedly evaluable queries Q
Q has a bounded query plan ?(Q, R) under an
access schema A
  • CQ only a , Fetch(X ? Tj, R, Y),
    ?Y(Tj),?C(Tj), ?(Tj), Tj ? Tk
  • UCQ ? at the end only
  • ?FO a , Fetch, ?, ?, ?, ?, ?,
  • FO a , Fetch, ?, ?, ?, ?, ?, ?

Coping with big data
8
11
Deciding bounded evaluability
12
The bounded evaluability problem (BEP(L))
  • Input A relational schema R, an access schema A,
    and a query Q in a query language L
  • Question Is Q boundedly evaluable under A?
  • When Q has a bounded query plan under A.

Undecidable for FO PODS 2014
  • Is BEP decidable for CQ? UCQ? ?FO?
  • If so, what is the complexity?

The bounded evaluability analysis is nontrivial
9
13
Example of bounded evaluable queries
  • Schema R(A, B, C)
  • Access schema A R(? ? C, 1), R(AB ? C, N)
  • A CQ query
  • Q(x, y) ? x1, x2, z1, z2, z3 (R(x1, x2, x) ?
    R(z1, z2, y ) ? R(x, y, z3) ? x1 1 ? x2 1)

Is Q boundedly evaluable?
  • Yes, Q is A-equivalent to Q(x, x) R (1, 1, x),
    which is boundedly evaluable
  • x y z3
  • ? z1, z2 (R(1, 1, x) ? R(z1, z2, y)) is entailed
    by R(1, 1, x)
  • With indices in A,
  • nontrivial variables are fetchable
  • combinations are indexed

10
We need to reason about A-equivalence and
nontrivial variables
14
The complexity of BEP
  • BEP is EXPSPACE-complete for CQ, UCQ and ?FO
  • good news decidable
  • bad news to expensive to be practical

lower bound by reduction from the non-emptiness
problem for parameterized regular expressions
Upper bound a characterization based on
A-equivalence and nontrivial variables for
boundedly evaluable queries
Can we make practical use of bounded evaluability?
11
15
Effective syntax for boundedly evaluable queries
16
An effective syntax for bounded CQ
  • A form of queries covered by an access schema A
  • A CQ is boundedly evaluable under A iff it is
    A-equivalent to a CQ covered by A
  • All CQ queries covered by A are boundedly
    evaluable under A
  • It is in PTIME to syntactically check whether a
    CQ is covered by A in Q, A and R

A CQ Q is covered by A if
  • all free variables and variables that participate
    in selection / join of Q are accessible via
    indices in A
  • combination of such variables in each atom R(x)
    is indexed by a single access constraint

12
A syntactic characterization of boundedly
evaluable CQ
17
More on covered queries
  • Schema R(A, B, C)
  • Access schema A R(? ? C, 1), R(AB ? C, N)
  • Q(x, y) ? x1, x2, z1, z2, z3 (R(x1, x2, x) ?
    R(z1, z2, y ) ? R(x, y, z3) ? x1 1 ? x2 1)

covered
A query in ?FO is covered by A if for each
CQ-subquery Qi
  • either Qi is covered by A,
  • or for each A-instance ?(Ti) of Qi, there exists
    a CQ-subquery Qj of Q such that Qi (?(Ti)) ? Qj
    (?(Ti)) and Qj is covered

?2p-complete to decide whether a query in ?FO
is covered
13
18
Bounded envelopes
19
Bounded envelopes
What can we do if query Q in L is not boundedly
evaluable under A?
  • We find QL and QU in the same language L such
    that
  • QL and QU are boundedly evaluable under A
  • for all instances D that satisfy A
  • QL(D) ? Q(D) ? QU(D), and
  • NL ? Q(D) ? QL(D) , and NU ? QU(D) ?
    Q(D) ,
  • where NL and NU are constants

QL and QU upper and lower envelopes of Q
S. Chaudhuri and P. G. Kolatis. Can datalog be
approximated? JCSS 55(2), 1997
QL(D) and QU(D) are not too far from Q(D)
Approximate query answering
14
20
Example bounded envelopes
  • Schema R(A, B)
  • Access schema A R(A ? B, N)
  • Q(x) ? y, z, w (R(w, x) ? R(y, w) ? R(x, z)
    ? w 1)

not boundedly evaluable
relaxation
  • Bounded envelopes
  • Upper QU(x) ? y, z (R(1, x) ? R(x, z))
  • Lower QL(x) ? y, z (R(1, x) ? R(y, 1) ?
    R(x, y) ? R(x, z))

expansion
Q(x, y) ? w (R(w, x) ? R(y, w) ? w 1)
Bounded envelopes may not exist
15
21
The bounded envelope problems
  • UPE(L)
  • Input A relational schema R, an access schema A,
    and a query Q in a query language L
  • Question Does Q have a bounded upper envelope
    under A?
  • Similarly LPE(L) for lower envelopes.
  • We consider covered envelopes when Q is in CQ,
    UCQ or ?FO

Complexity bounds
  • For CQ, UEP and LEP are NP-complete
  • For UCQ, UPE is ?2p-complete and LEP is
    NP-complete
  • For ?FO, UPE is ?2p-complete and LEP is
    DP-complete
  • For FO, UEP and LEP are undecidable

16
22
Bounded specialized queries
23
Bounded query specialization
  • Access schema A, and query Q with a set X of
    parameters (variables)
  • Q(x c) Q ? x c x ? X, valuation c is a
    constant tuple
  • bounded evaluable under A for all valuations c
  • Consider covered queries when Q is in CQ, UCQ or
    ?FO
  • Find me restaurants in New York my friends have
    been to in 2014

Q(p, rid) ? p, p1, n, c, dd, mm, yy
(friend(p, p1) ? person(p, n, c) ? dine(p, rid,
dd, mm, yy) ? p p0 ? c NYC ? yy 2014)
All valuations p0
Instantiate a minimum set of parameters and make
Q bounded
17
24
The bounded specialization problem (QSP(L))
  • Input A relational schema R, an access schema A,
    a query Q in a query language L, a set X of
    parameters of Q, and a positive integer k
  • Question Does Q have a bounded specialization
    Q(x c) with k ? x ?

Complexity bounds
  • NP-complete for CQ
  • ?2p-complete for UCQ and ?FO
  • undecidable for FO

18
25
Summing up
26
Bounded evaluability of queries
  • Challenges querying big data is cost-prohibitive
  • Bounded evaluability allows us to make big data
    small
  • However, the bounded evaluability analysis is
    expensive
  • Nonetheless, we can make practical use of bounded
    evaluability
  • Effective syntax covered queries for CQ, UCQ and
    ?FO
  • Approximate query answering
  • Bounded envelopes with a constant bound
  • Bounded specialization for parameterized queries

Decidability and complexity
An approach to effectively querying big data
26
Write a Comment
User Comments (0)
About PowerShow.com