Title: Join Synopses for Approximate Query Answering
1Join Synopses for Approximate Query Answering
- Swarup Acharya, Philip B. Gibbons, Viswanath
Poosala, Sridhar Ramaswamy
By Vladimir Gamaley
2Abstract
- In large data environments its difficult to
provide fast and reliable answers. - In this paper we demonstrate the difficulties in
the traditional approach and propose the new
technics for evaluation and maintenance of Join
Synopses.
3Introduction
- Tradition query processing approach (exact answer
though minimal time use) - Not always an exact answer is needed
- Sometimes appropriate answer is enough
4Introduction (continued)
- Schemes for providing approximate answers that
rely on basic relations alone suffer from serious
disadvantages. - Use of precomputed small sets of distinguished
joins.
5Introduction (continued)
- Careful allocation of place
- Allocation heuristics
- Providing approximate bounds
- Join synopses maintenance
- Experimental study results
6AQUA System
- The goal of AQUA system is to improve response
times by avoiding accesses to the original data. - Maintenance of small synopses of various samples
and histograms.
7AQUA System (Components)
- Statistics Collection
- Query Rewriting
- Maintenance
8AQUA System (Architecture)
9Problems with joins
- Uniform random samples provide
- Non uniform result samples
- Small join results sizes
10Problems with joins (example)
Base probability for tuple to be selected 1/r a1
and a2 - 1/r3 a1 and b1 - 1/r4 for k way foreign
join - 1/rk
11Join Synopses
Foreign Key Join A two way join r1 r2 is a
foreign key join if the join attribute is a
foreign key in r1 (a key in r2). For k gt 2, a
k-way foreign join if there is an ordering
r1,r2..rk and for j 1,2,.. K, si-1 ri is a
2-way foreign join where si-1 is a relation
obtained joining r1, r2, ri-1
12Join synopses
TPC-D scheme
13Join synopses (continued)
Lemma 1 The subgraph of G on the k nodes in any
k - way foreign key join must be a connected
graph with a single root node
Lemma 2 There is a 1-1 correspondence between
tuples in a relation r1 and tuple in any k-way
foreign key join.
14Join Synopses (continued)
Join Synopses For each node u in G,
corresponding to a relation r1, define J(u) to be
the output of the maximum foreign key join
r1,r2..rk with source k1. Let Su be a uniform
random sample of r1. Define a join synopses J(Su)
to be the output of join Su, r2, ..rk. The join
synopses for scheme consists of join synopses for
all us.
15Join Synopses (continued)
Theorem Let r1,r2rk, kgt3 be an arbitrary k-way
foreign join, with source relation r1. Let u be
the node in G corresponding to r1 and let Su be a
uniform random sample of r1. Let A be the set of
attributes in r1,r1rk Then 1. J(Su) is a
uniform random sample of J(u) with Su tuples 2.
Join r1, r2rk is a projection of J(u) on the
attributes in r1, r2rk
16Join Synopses (continued)
Lemma From a single join synopses for a node
whose maximum foreign key has k relations we can
extract uniform random sample of between k-1 to
2k-1 -1 distinct foreign key joins
Lemma For any node u whose maximum foreign key
join is a k-way join, number of tuples in its
renormalized join synopsis J(Su) is at most kSu
17Space allocation strategies
ni - numbr of tuples allocated to the join fi-
fraction of queries for which the join is a
relation or the source of foreign key
join Theorem ni N N N/ si - size
of join tuple
18Space allocation strategies
Heuristics EqJoin Equally between
relations CubeJoin In proportion to the cube
root of their join synopses tuple size PropJoin
In proportion to their join synopses size.
19Maintenance of Join Synopses
Adding a tuple Deleting a tuple
20Experiments Results
TestBed
TPC-D decision benchmark. Scale factor 0.3
(database of about 300 megabytes). 296MHz
UltraSparc-II. I/O - 5 MB/sec
21Experiment 1accuracy - summary size
EquiBase, PropBase - produce answers only when
the summary size exceeds 1.5 of the
database. EquiJoin, PropJoin - good results even
for 0.1 of the database.
22Experiment 2execution timing
Actual execution time - 122 seconds. The
response time increases with the summary
size. Query using Join Synopses needs in two
orders less time!
23Related Work
Approximate query answering Statistical techniques
24Conclusions
Schemes based on join synopses provide better
answer than those, based only on the basic
relations samples. Approximate answering is
becoming extremely important in new application
of data warehouses. However, there are still more
problems group-bys, ranks etc...