Title: Space-Efficient Online Computation of Quantile Summaries
1Space-Efficient Online Computation of Quantile
Summaries
- Michael Greenwald Sanjeev Khanna
- University of Pennsylvania
- Presented by nir levy
2Introduction
- The problem
- We introduced a very large data sets and we
wish to compute F-quatiles - in a single pass using space-efficient
computation . - Def The F-quantiles of an ordered sequence of N
data items is the value with - rank FN. (the element in the FN
position) - We are going to see an online algorithm for
computing e-approximate quatile summaries of a
very large data sequence. - Def An e-approximate quantile summaries of a
sequence of N elements is a - data structure that can answer
quantile queries about the sequence to - within a precision of eN.
- Def A quantile summary consists of a small
number of points from the input data - sequence, and uses those quantile
estimates to give approximate responses to any
- arbitrary quantile query.
-
3Introduction cont
- EXAMPLE
- Input data 14, 2, 12, 5, 6, 19, 1, 14, 4, 9, 12,
3, 8, 11, 15, 4. - Ordered 19, 15, 14, 14, 12, 12, 11, 9, 8,
6, 5, 4, 4, 3, 2, 1 - Rank 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16 - what is the 2nd biggest number? (15)
- What is 25th number? (14)
- Summary 19, 14, 11, 6, 4, 1
- rank 1 4 7 10 13
16 - what is the 2nd biggest number? 2nd ? 1st (19)
- What is 25th number? 160.254 ? 4th (14)
4Quantile estimation for Database Applications
- Estimate the size of intermediate results, to
allow query optimizers to estimate the cost of
competing plans to resolve database queries. - Partition data into roughly equal partitions for
parallel database. - Prevent expensive and incorrect queries from
being issued By estimate results sizes and give
feedback to the users - Characterize the distribution of real world data
sets for database users.
5Properties
- Properties for quantile estimators
- provide tunable and explicit guarantees on the
precision of the approximation. - That is, for any given rank r, an
e-approximate quantile summary return a value
whose rank r is guaranteed to be within the
interval r-eN , reN. - 2. be data independent. That is, neither affected
by the arrival order or distribution of the
values nor should it require a priori knowledge
of the size of the dataset. - 3. execute in a single pass over the data.
- 4. have as small of memory footprints as possible
(apply to temporary storage during the
computation)
6Previous Work
- Mnku, Rajagopalan and Lindsay presented
single-pass algorithm, ?-approximate quantile
summary, requires O(1/e log2(eN) space but need
and advanced knowledge of N ( otherwise they
provide a probabilistic guarantee on the
precision) (MRL). - Gibson, Matis and Poosala presented multiple pass
algorithm with probabilistic guarantee - Munro and Paterson showed that any algorithm that
exactly compute F-quantile in in only P passes
requires a space of ?(N1/p)
7This algorithm
- present a worse-case space requirement of
O(1/?log (?N)), thus improving upon the previous
best result of O(1/?log2(?N)). - in contrast to earlier algorithms, the algorithm
doesnt require a priori knowledge of the length
of the input sequence - based on a novel data structure that effectively
maintains the range of possible ranks for each
quantile that they store. - The behavior is based on the fact that no input
sequence can be bad across the entire
distribution that is, the input sequence cannot
present new observations that must be stored
without deleting old stored observations.
8The Data Structure
- Assume w.l.og. That every new observation arrives
after each unit of time. - Denote n to be the number of observation seen so
far as well as the current time. - Denote e to be the given precision requirement
- Denote SS(n) to be the summary data structure
at all time. - S(n) consists of an ordered sequence
elements corresponding to a subset of the
observations seen thus far - For each observation v in S, maintain an implicit
bound on the minimum and the maximum possible
rank of v among the first n observations. (Denote
by Rmin(v) and Rmax(v))
9Data structure cont
- More formally
- let S(n) be the set of tuples t0,t1,,ts-1
where ti(Vi,gi,?i) - Vi is one of the elements for the data stream
- gi is equal Rmin(Vi) - Rmin(Vi-1)
- ?I is equal Rmax(Vi) - Rmin(Vi)
- ?jltI gj Rmin(Vi) - Rmin(Vi-1) Rmin(Vi-1) -
Rmin(Vi2) ... Rmin(V1)- Rmin(V0) Rmin(Vi) - (?jltI gi)?I Rmax(Vi) - Rmin(Vi) Rmin(Vi)
Rmax(Vi)
10Data structure cont
- At all time ensure that V0 and Vs-1 correspond to
the minimum and maximum element seen so far. - gi?i-1 is the upper bound on the total number of
observations that may have fallen between vi and
vi-1 - ?i gi is the number of observations seen so far
11Answering Quantile Queries
- Proposition 1
- Given a quantile summary S in the above form
a F-quantile can always be identified to within
an error of MAXi(gi?i)/2. - Proof.
- let r ?Fn? and let eMAXi(gi?i)/2.
- - search for an index i such that r-e lt
Rmin(Vi) and Rmax(vi)lt re -
Maxi(gi?i)
V0
Vs-1
Rmax(Vi)
?Fn?
Rmin(Vi)
Vi
? vi approximates the F-quantile within the
claimed error bound.
12Answering Quantile Queries cont
- All is left to see is that such an index I must
always exist.
Consider the case rgtn-e
n-e
Vs-1
r
V0
We have Rmin(Vs-1)Rmax(Vs-1)n and therefore
is-1 is valid Otherwise rltn-e Choose the
smallest j such Rmax(Vj)gtre it follows that
Rmin(Vj-1)gtr-e Since for Rmin(Vj-1)ltr-e we get
Rmax(Vj)Rmin(Vj-1)gj?j gt Rmin(Vj-1)2e
Rmax(Vj)
V0
re
r-e
r
Rmin(Vj-1)
Vs-1
? Contradiction to the assumption that
eMAXi(gi?i)/2
13Answering Quantile Queries cont
- By assumption Rmax(Vj-1)ltre therefore j-1 is an
example of an index i with the desired property. - Corollary 1
- if at any time n, the summery S(n) satisfied the
property that - MAXi(gi?i) lt2en, then we can answer any
F-quantile query to within an en precision. -
-
14Data structure cont
- At high level
- On a new observation insert in the summary a
tuple corresponding to this observation. - Periodically, perform a sweep over the summary to
merge some of the tuples into their neighbors
so as to free space - Maintain several condition in order to bound the
space used by S at any time. - By corollary 1 in suffice to ensure that at all
time MAXi(gi?i) lt2en. - Def An individual tuple is full if gi?i?2en?.
- Def The capacity of an individual tuple is the
maximum number of observations that can be
counted by gi before the tuple become full
15BANDS
- General strategy delete tuples with small
capacities and preserve tuples with large
capacities. - In the merge phase, free up space by merging
tuples with small capacities into tuples with
similar or larger capacities. - We say , two tuples ti and tj have similar
capacities, if - log capacity(ti)? log capacity(tj)
- This notion of similarity partition the possible
values of ? into Bands - we try to divide the ?s in bands that lie
between elements of - 0, ½(2en), ¾(2en),..((2i-1)/2i)(2en),,
2en-1, 2en - this boundaries correspond to capacities of 2en,
en, 1/2en,,(1/2i)en,..8,4,2,1
16BANDS cont
- Define banda to be the set of all ? such that
- p - 2a - (p mod 2a) lt ? lt p - 2a-1 (p mod
2a-1) - where
- p?2en? and a 1 .. ?log(2en)?
- The above definition ensure that if two ?s are
ever in the same band, they never appear in
different bands as n increases - Define band0 simply to be p
- Consider the first 1/2e observations, with ? 0
to be in a band of their own.
17BANDS cont
- Example
- Consider e1/8.
- a b c
d e f
g - ? 0,0,0,0,1,1,1,1,2,2, 2, 2, 3, 3, 3, 3,
4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6 - N1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28
25..28 21..24 17..20 13..16 9..12 5..8
g f e d c b Band0
f d,e d b,c b Band1
b,c,d,e b,c b,c Band2
18BANDS cont
19BANDS cont
- Proposition 2
- at any point in time n and for any agt1
banda(n) contains either 2a or 2a-1 - distinct value of ?.
- PROOF
- according to the upper and lower bounds of banda
- 2en - 2a - (2en mod 2a) lt ? lt 2en - 2a-1
(2en mod 2a-1) - If ( 2en mod 2a ) lt 2a-1 then ( 2en mod 2a )
( 2en mod 2a-1) - ? banda 2a - 2a-1 2a-1 distinct
values of ? - If ( 2en mod 2a ) gt 2a-1 then ( 2en mod 2a )
2a-1 ( 2en mod 2a-1) - ? banda 2a-1 2a-1 2a distinct
values of ?
20A tree representation
- For S t0, t1, .,ts-1 Impose a tree structure
T over the tuples of S. - Assign a special root node R
- for every tuple ti assign a node Vi
- The parent of every node Vi is the node Vj such
that j is the least index greater than i with
band(tj) gt band(ti). If no such j exist than set
R to be the parent. - All children (and all descendants) of a given
node Vi have ? values larger than ?I .
21A tree representation
- Proposition 4
- for any node V, the set of all its descendants
in T form a contiguous segment in S - Proposition 3
- the children of any node in T are always
arranged in non-increasing order of band in S
22Operations
- To compute e-approximate F-quantile from S(n)
after n observations - During the operations we wish to maintain correct
relationship between gi , ?I , Rmin and Rmax - QUANTILE(F) compute the rank r?Fn?
- find i such that r-Rmin(Vi)lt en and
Rmax(Vi)-rlten return Vi . - INSERT(V) find the smallest i such that Vi-1lt
V ltVi and insert the tuple (V,1,?2en?) between
ti-1 and ti . If V is the new minimum or maximum
seen, then insert (v,1,0) -
23Operations Cont
- INSERT(V) maintains maintain correct
relationship between gi , ?I , Rmin and Rmax - If V is inserted before Vi the value of Rmin(V)
may be as small as Rmin(Vi-1)1 - similarly Rmax(V) may be as large as the current
Rmax(Vi) which is bounded by ?2en?. - Note that Rmin(Vi) and Rmax(Vi) get increased by
1 after insertion.
24Operations Cont
- DELETE(Vi) replace the tuple (Vi,gi,?i) and
(Vi1,gi1,?i1) with the new tuple
(Vi1,gigi1,?i1). - Deleting Vi has no effect on Rmin(Vi1)
Rmax(Vi1) so it should simply preserve them. - The relationship between Rmin(Vi1) and
Rmax(Vi1) is preserved as long as ?i1 is
unchanged . - since Rmin(Vi1) ?jltI1 gi and we deleted gi
we must increase gi1 by gi to keep Rmin(Vi1).
25COMPRESS
- The operation COMPRESS tries to merge together a
node and all its descendents into either its
parent node or into its right sibling (by
deleting them). - During compress we must ensure that the tuple
results after the merging is not full - Two adjacent tuples ti,ti1are mergeable if the
resulting tuple is not full and
band(ti,n)ltband(ti1,n). - Note that pair of tuples that are not mergeable
at some point in time may be come so at later
point as the term ?2en? increases over time. - Let gi denote the sum of g-values of tuple ti
and all its descendents in T .
26Operations Cont
- COMPRESS()
- for i from s-2 to 0 do
- if(BAND(?i,2?n) BAND(?i1,2?n))
- (gigi1 ?i1lt 2?n) then
- delete all descendants of ti and the
tuple ti itself - end if
- end for
- Compress inspect tuples from right (highest
index) to left. it first combine children (and
all their subtree of descendents) into their
parents and only when the parent is full it
combine children.
27Operations Cont
- Initial State
- S? F s0 n0.
- Algorithm
- To add the n1st observation, v, to summary
S(n) - if(n 0 mod 1/(2?) ) then
- COMPRESS()
- end if
- INSERT(v)
- nn1
28Analysis
- The insert and compress operations always ensure
that gi?ilt2en - We will see now that the total number of tuples
in the summary S(n) is bounded by (11/(2e) log
(2en)). - Def coverage we say that a tuple ti in S(n)
covers an observation v at any time n if either
the tuple for v had been directly merged into ti
or a tuple t that covered v has been merged into
ti . - A tuple always cover itself.
- It is easy to see that the number of observations
covered by ti is exactly given by gigi(n)
29Analysis Cont
- Lemma 1
- At no point in time a tuple from band a
covers an observation from a band gt a. - Lemma 2
- At any point in time n, and for any integer a,
the total number of observations covered
cumulatively by all tuples with band value in
0..a is bounded by 2a/e . - Lemma 3
- At any time n and for any given a, there are
at most 3/2e nodes in T(n) that have a child with
band value of a. That is, there are at most 3/2e
parents of nodes from banda(n)
30Analysis Cont
- PROOF of lemma 4
- Let mmin,mmax denote the earliest and the latest
time at which a node from banda could be seen. - mmin(2en-2a-(2en mod 2a))/2e
- mmax(2en-2a-1-(2en mod 2a-1))/2e
- Choose a child parent pair (Vi,Vj) Vj is in banda
- Since Vj exist we can show that
- Since at time mj (when Vj showed up) we had
- gi(mj)?ilt2emj
31Analysis Cont
- Since for all pairs (vi,vj) we have distinct
observations - The number of observations that came after mmin
is n-mmin - We get (n-mmin)/(2e(n-mmax))3/(2e)
32Analysis Cont
- Def Given a full pair of tuples (ti-1,ti), we
say that a tuple ti-1 is left partner and ti is
right partner in this full pair. - Lemma 4
- At any time n and for any given a, there are
at most 4/e tuples from banda(n) that are right
partners in a full tuple pair. - PROOF
- Let ti,ti1, ,tip-1 be the longest
contiguous segment of tuples from - banda(n) in S(n).
- Since they existed after the compress operation
in must be the case - gj-1 gj?jgt2en for all iltjltip
33Analysis Cont
- According to lemma 2 the first term is bounded by
2a1/e - The second term is bounded by p(2en-2a-1)
- Summing the two bounds we get plt4/e
34Analysis Cont
- for non- contiguous segments just consider the
above summations over all such segments - Lemma 5
- At any time n and for any given a, the
maximum number of tuples possible from each
banda(n) is 11/2e . - Proof
- Each node of banda(n) is either
- 1. a right partner in a full pair
- 2. a left partner in a full pair
- 3. not participate in any full pair
- The first case is bounded by 4/e ( lemma 4)
- The last two are bounded by 3/2e
- And the claim follow.
35Analysis Cont
- Theorem 1
- At any time n, the number of tuples stored
in S(n) is at most - (11/(2e) log (2en)).
- PROOF
- There are at most 1?log(2en)? bands at time n
- Summing over their sizes we get ? (11/(2e) log
(2en)).
36Experiments results
- The experiments were done on 3 different classes
of input data - 1. Hard Case.
- - an adversarial manner data sequence that is,
place the next observation in the largest current
gap of the quantile summary. - 2. sorted input data.
- - the data arrives in sorted order.
- 3. random input data.
- - select each datum by selecting an element
(without replacement) from a uniform distribution
of all remaining elements in the data set
37Experiments results cont
- Sorted and random input data are used after the
MRL experimental results - Random input data can give an insight to the
behavior of the algorithm on average inputs. - In general, the algorithm used less space than
indicated by the analysis. And turned out to be
better than the MRLs space requirement.
38Experiments results cont
- For each case we have 2 different kind of
experiment - 1. Adaptive the regular algorithm ( with a
slight variation) - 2. Pre-allocated used the same space as used
in the MRL - We will see that in the later case the observed
error is significantly better then the one of the
MRL. - differences in the algorithm used for the
experiment - 1. An observation is inserted as a tuple
(v,1,gi?i-1) and not (v,1,?2?n?). - the latter is strictly to simplify
theoretical analysis. - 2. Rather than running the COMPRESS after every
1/2e observations - for each observation inserted one tuple was
deleted when possible. - ? if no tulpe could be deleted without
making is successor full the size of S grew by
1.
39Experiments results cont
- We apply the following measurements
- 1. The maximum space used to produce the
summary counting the number of stored tuples (
multiple by 3 for comparison with MRL to account
the Rmin and Rmax values stored in each tuple ) - 2. The observed precision of the results.
40Experiments results cont
- The required number of quantile is approximately
a factor of 11 less than the worst case bound of
the analysis - We almost always require less space than the MRL.
- The only exception is in epsilon.001 and N105
where MRL require less space
41Experiments results cont
- SORTED INPUT
- Fix e.001 and construct summaries of sorted
sequences of size 105,106 and 107 - Sample 15 quantiles at (qi/16)N for qi1..15
and compute the maximum error over all possible
quantile queries. - Compare 3 algorithms
- 1. MRL preallocated the storage required by
MRL as a function of N and e. - 2. pre-allocated using 1/3 as many stored
quantiles as MRL. - 3. adaptive storage allocated for new quantile
only if no quantile could be
deleted without exceeding a precision of .001n
42Experiments results cont
- S - the number of stored quantiles need to
achieve the desired precision - Max e-the maximum error of all possible quantile
queries of the summaries - The remaining rows lists the approximation error
of the response to the query for the qi/16th
quantile.
43Experiments results cont
- RANDOM INPUT
- Same measurements as in the sorted input (e and
sequence length) - Run each experiment 50 times and report the max,
min, mean and std for every measurement.
44Experiments results cont
45Experiments results cont
46Conclusions
- Improves upon the earlier results in two
significant ways - It improves the space complexity by a factor of O
(log(eN)). - 2. It doesnt require a priori knowledge of the
parameter N that is, it allocates more space
dynamically as the data sequence grows in size.