Querying Big Data - PowerPoint PPT Presentation

About This Presentation

Title:

Querying Big Data

Description:

TDD: Research Topics in Distributed Databases Schema matching, schema mapping, data integration Schema matching Schema mapping: XML Data integration – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 37

Provided by: homepage91

Category:

more less

Transcript and Presenter's Notes

Title: Querying Big Data

1

TDD Research Topics in Distributed Databases

Querying Big Data
Tractability revisited for querying big data
BD-tractability
Reductions, complete problems, separation results
Querying big data
Scale independence
Making big data small
Approximate query answering
Relaxing query semantics
Data-driven approximation

2
Big data

Volume in PB (1015B) or EB (1018B) or
Variety heterogeneous, semi-structured or
unstructured
Velocity dynamic
Veracity trust in its quality

The new challenges introduced by big data?

Computer science is the topic about

the computation of function f(x)
in fact, any data that cannot be handled with
your available resources

x is big PB (1015B) or EB (1015B)

2
3
A new complexity theory for big data
3
4
The good, the bad and the ugly

Traditional computational complexity theory of
almost 50 years
The good polynomial time computable (PTIME)
The bad NP-hard (intractable)
The ugly PSPACE-hard, EXPTIME-hard, undecidable

What happens when it comes to big data?

Assuming SSD of 6G/s. A linear scan of a data set
D would take
1.9 days when D is of 1PB (1015B)
5.28 years when D is of 1EB (1018B)
O(n) time is already beyond reach on big data in
practice!

Polynomial time queries become intractable on big
data
4
5
Tractability revisited for queries on big data

A class Q of queries is BD-tractable if there
exists a PTIME preprocessing function ? such
that
for any database D on which queries of Q are
defined,
D ?(D)
for all queries Q in Q defined on D, Q(D) can be
computed by evaluating Q on D in parallel
polylog time (NC)

hence D is of polynomial size
possible rewriting
parallel logk(D, Q)
Q1(?(D))
?
D
?(D)
Q2(?(D))
? ?

Does it work? If a linear scan of D could be done
in log(D) time
15 seconds when D is of 1 PB instead of 1.99 days
18 seconds when D is of 1 EB rather than 5.28
years

BD-tractable queries are feasible on big data
5
6
BD-tractable queries

A class Q of queries is BD-tractable if there
exists a PTIME preprocessing function ? such
that
for any database D on which queries of Q are
defined,
D ?(D)
for all queries Q in Q defined on D, Q(D) can be
computed by evaluating Q on D in parallel
polylog time (NC)

?TQ0 the set of all BD-tractable query classes
in parallel with more resources

Preprocessing
one-time process, offline, once for all queries
in Q
indices, compression, views, incremental
computation,

not necessarily reduce the size of D
Preprocessing a common practice of database
people
6
7
What query classes are BD-tractable?

Boolean selection queries
Input A dataset D
Query Does there exist a tuple t in D such that
tA c?
Build a B-tree on the A-column values in D. Then
all such selection queries can be answered in
O(log(D)) time.

Graph reachability queries
Input A directed graph G
Query Does there exist a path from node s to t
in G?

NL-complete
What else?
Relational algebra set recursion on ordered
relational databases
Some natural query classes are BD-tractable
7
8
Deal with queries that are not BD-tractable
Starts at a node s, and visits all its children,
pushing them onto a stack in the reverse order
induced by the vertex numbering. After all of s
children are visited, it continues with the node
on the top of the stack, which plays the role of s
Many query classes are not BD-tractable.

Breadth-Depth Search (BDS)
Input An unordered graph G (V, E) with a
numbering on its nodes, and a pair (u, v) of
nodes in V
Question Is u visited before v in the
breadth-depth search of G?

Is this problem (query class) BD-tractable?
D is empty, Q is (G, (u, v))

No. The problem is well known to be P-complete!
We need PTIME to process each query (G, (u, v))
!
Preprocessing does not help us answer such
queries.

Can we make it BD-tractable?
8
9
Make queries BD-tractable
Factorization partition instances to identify a
data part D for preprocessing, and a query part Q
for operations

Breadth-Depth Search (BDS)
Input An unordered graph G (V, E) with a
numbering on its nodes, and a pair (u, v) of
nodes in V
Question Is u visited before v in the
breadth-depth search of G?

Factorization D is G (V, E), Q is (u, v)

Preprocessing ?(G) performs BDS on G, and
returns a list M consisting of nodes in V in the
same order as they are visited
For all queries (u, v), whether u occurs before v
can be decided by a binary search on M, in
log(M) time

after proper factorization
?TQ The set of all query classes that can be
made BD-tractable
9
10
Fundamental problems for BD-tractability
BD-tractable queries help practitioners determine
what query classes are tractable on big data.
Are we done yet?

No, a number of questions in connection with a
complexity class!
Reductions how to transform a problem to
another in the class that we know how to solve,
and hence make it BD-tractable?
Complete problems Is there a natural problem (a
class of queries) that is the hardest one in the
complexity class? A problem to which all problems
in the complexity class can be reduced
How large is ?TQ? ?TQ0? Compared to P? NC?

Analogous to our familiar NP-complete problems
Why do we care?
Fundamental to any complexity classes P, NP,
10
11
Reductions
transformations for making queries BD-tractable
Departing from our familiar polynomial-time
reductions, we need reductions that are in NC,
and deal with both data D and query Q!

NC-factor reductions ?NC a pair of NC functions
that allow re-factorizations (repartition data
and query part), for ?TQ
F-reductions ?F a pair of NC functions that
do not allow re-factorizations, for ?TQ0

to determine whether a query class is BD-tractable

Properties
transitivity if Q1 ?NC Q2 and Q2 ?NC Q3, then
Q1 ?NC Q3 (also ?F)
compatibility
if Q1 ?NC Q2 and Q2 is in ?TQ, then so is Q1.
if Q1 ?F Q2 and Q2 is in ?TQ0, then so is Q1.

transform a given problem to one that we know how
to solve
11
12
Complete problems

A query class Q is complete for ?TQ if Q is in
?TQ, and moreover, for any query class Q in
?TQ, Q ?NC Q
A query class Q is complete for ?TQ0 if Q is in
?TQ0, and for any query class Q in ?TQ0, Q ?F Q

Is there a complete problems for ?TQ (?TQ0)?

There exists a natural query class Q that is
complete for ?TQ

Not for ?TQ0
Unless P NC, a query class complete for ?TQ0 is
a witness for P \ NC (as hard as the
big open whether P NC)

Whether P NC is as hard as whether P NP

It is hard to find a complete problem for ?TQ0
12
13
Comparing with P and NC
How large is ?TQ? How large is ?TQ0?

NC ? ?TQ P
All PTIME query classes can be made
BD-tractable!
Unless P NC, NC ? ?TQ0 ? P
Unless P NC, not all PTIME query classes are
BD-tractable

separation
need proper factorizations to answer PTIME
queries on big data
PTIME
Properly contained in P
not BD-tractable
BD-tractable
13
13
14
What can we get from BD-tractability?
Guidelines for the following.

What query classes are feasible on big data? ?TQ0
What query classes can be made feasible to answer
on big data? ?TQ
How to determine whether it is feasible to answer
a class Q of queries on big data?
Reduce Q to a complete problem Qc for ?TQ via
?NC
If so, how to answer queries in Q?
Identify factorizations (?NC reductions) such
that Q ?NC Qc
Compose the reduction and the algorithm for
answering queries of Qc

A revision of the classical computational
complexity theory
14
15
Making big data small
15
16
Scale independence

The scale independence problem
Input A dataset D, a query Q, and a bound M
Query Does there exist a subset DQ of D such
that
DQ ? M, and
Q(D) Q(DQ)?

A more general setting
Input A query Q defined over a schema R, and a
bound M
Query Is it for all instances D of R, there
exists a subset DQ of D such that
DQ ? M, and
Q(D) Q(DQ)?

The cost of query processing is independent of
D!
Scalable with big data D, when D grows!

16
Why do we care?
17
Scale independent queries in practice?
Personalized social search queries (Facebook
Graph Search)

Find me all my friends who live in Edinburgh and
like cycling
Find all restaurants rated A that are in King
of Prussia Mall
Find me all restaurants in Edinburgh where my
friends dined in 2013.

Bounded number of tuples

Why bounded?
Facebook at most 5000 friends per person
At most K restaurants in a mall
At most 5000 friends, there are 365 days each
year, and each person dines at most once per day
(a normal person)

To answer a query, we need to access a bounded
amount of data
17
18
Query processing

Access schemas (R, X, N)
index on X for instances D of X
there exist at most N tuples sharing the same X
values in D (e.g., 365 days per year), and these
tuples can be fetched efficiently
find a query plan, visiting a bounded amount of
data

decide whether a query is scale independent

Complexity the scale independence problem is
?3p-complete for conjunctive queries (SPC)
PSPACE-complete for first-order logic queries
(SQL) but
in O(1) time for Boolean conjunctive queries if
Q ? M!

there are sufficient conditions for this, based
on rules
Incremental scale independence? Using views?
18
19
How to make a query tractable on big data?

Querying big data
Input Query Q, and big data G,
Output Q (G), the set of answers to Q in G

A number of techniques
Distributed query processing
Query preserving data compression
Query answering using views
Bounded incremental evaluation
Top-k query answering with early termination

Too costly
The cost of query processing a function of G
and Q
O(G) time is already beyond reach in practice!
Can we effectively query big data?

Approximate or inexact algorithms
Exact algorithms?

Make the cost of query processing independent
of G!
MapReduce is not the only solution, and is not
even the best one!
19
20
Distributed query processing
O(n2) or O(n3) is too costly
The cost of evaluation algorithm f(G, Q)
It is unlikely that we can lower its complexity,
but can we reduce the size of its parameter G?
manageable sizes

Divide and conquer
partition G into fragments (G1, , Gn),
distributed to various sites

evaluate Q on smaller Gi

upon receiving a query Q,
evaluate Q( Gi ) in parallel
collect partial answers at a coordinator site,
and assemble them to find the answer Q( G ) in
the entire G

Performance guarantees for evaluating regular
reachability queries based on partial evaluation
Network traffic and response time Independent of
G
20
21
Query preserving data compression
The cost of query processing f(G, Q)
reduce the parameter?

Query preserving compression ltR, Pgt for a class L
of queries
For any data collection G, C R(G)
For any Q in L, Q( G ) P(Q, Gc)

Compress big G into a smaller Gc
21
22
What is new about query preserving compression?

Query preserving compression ltR, Pgt for a class L
of queries
For any dataset G, Gc R(G)
For any Q in L, Q( G ) P(Q, Gc)

Relative to a class L of queries of users choice
Better compression ratio only information about
L queries

no need to decompress Gc

For any Q in L, Q(Gc) can be directly computed
Any algorithms and indexing structures for G can
be used for Gc

In contrast to lossless compression, no need to
restore the original graph G

Gc is computed once for all queries Q in L

Incrementally maintained
Reduction 95 in average for reachability queries
22
23
Answering queries using views
The cost of query processing f(G, Q)
can we compute Q(G) without accessing G, i.e.,
independent of G?

Query answering using views given a query Q in a
language L and a set V views, find another query
Q such that
Q and Q are equivalent
Q only accesses V(G )

for any G, Q(G) Q(G)

Answering graph pattern queries on big social
graphs
Regardless of how big G is the cost is
independent of G
V(G ) is often much smaller than G (4 -- 12
on real-life data)

Improvement 97 for graph pattern matching
The complexity is no longer a function of G
23
24
Incremental query answering
5/week in Web graphs

Real-life data is dynamic constantly changes,
?G
Re-compute Q(G??G) starting from scratch?

Changes ?G are typically small

Compute Q(G) once, and then incrementally
maintain it
Changes to the input
Old output

Incremental query processing
Input Q, G, Q(G), ?G
Output ?M such that Q(G??G) Q(G) ? ?M

Changes to the output
New output
When changes ?G to the data G are small,
typically so are the changes ?M to the output
Q(G??G)
Minimizing unnecessary recomputation
24
25
Complexity of incremental problems

Incremental query answering
Input Q, G, Q(G), ?G
Output ?M such that Q(G??G) Q(G) ? ?M

Incremental algorithms?
The cost of query processing a function of G
and Q

incremental algorithms CHANGED, the size of
changes in
the input ?G, and
the output ?M

The updating cost that is inherent to the
incremental problem itself

Bounded the cost is expressible as f(CHANGED)?
Optimal in O(CHANGED)?

The amount of work absolutely necessary to
perform for any incremental algorithm
Effective on graph pattern matching
Complexity analysis in terms of the size of
changes
25
26
Top-k query answering
Traditional query answering compute Q(G)

It is expensive to compute when G is large
The result Q(G) is excessively large for the
users to inspect larger than G

Top-k query answering
Input Query Q, dataset G and a positive
integer k.
Output A top-ranked set of k elements in Q(G)

Improvement 65 on graph pattern matching
Early termination return top-k matches without
computing Q(G)
26
27
Answering queries on big data
Yes, MapReduce is useful, but it is not the only
way!

Partial evaluation for distributed query
processing can we get performance guarantees?
Query preserving compression convert big data to
small data
Query answering using views make big data small
Bounded incremental query answering depending on
the size of the changes rather than the size of
the original big data
Top-k query answering and early termination find
answers without traversing the entire data set

Prerocessing methods
Make big data small
Combinations of these can do better than
MapReduce!
27
28

Further reading
W. Fan, J. Li, X. Wang, and Y. Wu. Query
Preserving Graph Compression, SIGMOD, 2012.
W. Fan. Graph Pattern Matching Revised for Social
Network Analysis, ICDT 2012 (invited).
W. Fan, X. Wang, and Y. Wu. Performance
Guarantees for Distributed Reachability Queries,
VLDB, 2012.
W. Fan, X. Wang, and Y. Wu. Diversified Top-k
Graph Pattern Matching, VLDB, 2014.
W. Fan, J. Li, X. Wang, and Y. Wu. Incremental
Graph Pattern Matching, SIGMOD, 2011 (TODS 38(3),
2013).
W. Fan J. Li, S. Ma, and H. Wang, and Y. Wu.
Graph Homomorphism Revisited for Graph Matching,
VLDB 2010.
W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu.
Graph pattern matching From intractable to
polynomial time, VLDB, 2010.

29
Approximate query answering
29
30
Graph Pattern Matching

Given a pattern graph Q and a data graph G, find
all the matches of Q in G.
subgraph isomorphism

Applications
pattern recognition
knowledge discovery
intelligence analysis
transportation network analysis
Web site classification,
social position detection.
User targeted advertising

a bijective function f on nodes (u,u ) ? Q
iff (f(u), f(u)) ? G
Widely used in social network analysis
30
31
Problems
Facebook 1B users, 140B links

Real-life social graphs are typically large
Subgraph isomorphism
What is for the complexity for determining
whether there exists a match of a pattern Q in a
graph G?
Given a pattern Q and a graph Q, how many matches
of Q can possibly exist in G?

NP-complete
Possibly exponential
O(G) time is already beyond reach in practice!

Nonetheless, we need to conduct graph pattern
matching on social networks, among other things

What can we do if a class of queries is NOT
BD-tractable?
subgraph isomorphism is too costly for social
network analysis
31
32
Relaxing the semantics of queries

Graph simulation

Much cheaper
Complexity of computing the set of matches
quadratic time
The number of matches of Q in G there exists a
unique, maximum match relation S 1

a binary relation S on nodes
for each node u in Q, there exists v in G such
that (u,v)? S,
for each pair (u,v)? S, each edge (u,u) in Q is
mapped to an edge (v, v ) in G, such that (u,v
)? S

Effective
Social position detection
User targeted advertising,

Quadratic time is still too expensive! How to
deal with it?
A variety of extensions to capture topology, with
low complexity
So, graph simulation for social data analysis,
instead of subgraph isomorphism
32
33
The approximation theory revisited

If a query class is not BD-tractable and its
semantics cant be relaxed, is it still feasible
to answer such queries on big data?

Yes, approximation

When exact algorithms are infeasible, we find
inexact algorithms with performance guarantees
cant be too far!
feasible on big data reducing big data to small
data
performance guarantees whenever possible

The need for revising the traditional
approximation theory, for querying big data
Data-driven approximation
33
34
Data-driven approximation

Resource-bounded query answering
Input A dataset D, a class Q of queries, a
resource ratio ? ? 0, 1)
Question Develop an algorithm that given any
query Q ? Q computes Q(D) by accessing at most
?G amount of data

Make big data small!

Personalized social searches and reachability
queries
Find me all my friends who live in Nanjing and
like cycling
Does Michael connect to lady Gaga through
social links?

We can do personalized social search with ?
0.0015!

1.5 10-6 1PB (1015B) 15 109 15GB
We are making big data of PB size as small as
15GB!

We can make big data of PB size, fit into our
memory!
34
35
Summing up
35
36
Summary and Review

What is BD-tractability? Why do we care about
it?
What is scale independence?
How to make big data small?
Is MapReduce the only way for querying big data?
Can we do better than it?
What is query preserving data compression? Query
answering using views? Bounded incremental query
answering? Top-k query answering?
If a class of queries is known not to be
BD-tractable, how can we process the queries in
the context of big data?
Develop an algorithm for processing a class of
queries on big data, by combining various methods
discussed