Space-Efficient Online Computation of Quantile Summaries

About This Presentation

Title:

Space-Efficient Online Computation of Quantile Summaries

Description:

Space-Efficient Online Computation of Quantile Summaries. Michael Greenwald & Sanjeev Khanna ... sequence, and uses those quantile estimates to give approximate ... – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 47

Provided by: nnn69

Category:

more less

Transcript and Presenter's Notes

Title: Space-Efficient Online Computation of Quantile Summaries

1
Space-Efficient Online Computation of Quantile
Summaries

Michael Greenwald Sanjeev Khanna
University of Pennsylvania
Presented by nir levy

2
Introduction

The problem
We introduced a very large data sets and we
wish to compute F-quatiles
in a single pass using space-efficient
computation .
Def The F-quantiles of an ordered sequence of N
data items is the value with
rank FN. (the element in the FN
position)
We are going to see an online algorithm for
computing e-approximate quatile summaries of a
very large data sequence.
Def An e-approximate quantile summaries of a
sequence of N elements is a
data structure that can answer
quantile queries about the sequence to
within a precision of eN.
Def A quantile summary consists of a small
number of points from the input data
sequence, and uses those quantile
estimates to give approximate responses to any
arbitrary quantile query.

3
Introduction cont

EXAMPLE
Input data 14, 2, 12, 5, 6, 19, 1, 14, 4, 9, 12,
3, 8, 11, 15, 4.
Ordered 19, 15, 14, 14, 12, 12, 11, 9, 8,
6, 5, 4, 4, 3, 2, 1
Rank 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16
what is the 2nd biggest number? (15)
What is 25th number? (14)
Summary 19, 14, 11, 6, 4, 1
rank 1 4 7 10 13
16
what is the 2nd biggest number? 2nd ? 1st (19)
What is 25th number? 160.254 ? 4th (14)

4
Quantile estimation for Database Applications

Estimate the size of intermediate results, to
allow query optimizers to estimate the cost of
competing plans to resolve database queries.
Partition data into roughly equal partitions for
parallel database.
Prevent expensive and incorrect queries from
being issued By estimate results sizes and give
feedback to the users
Characterize the distribution of real world data
sets for database users.

5
Properties

Properties for quantile estimators
provide tunable and explicit guarantees on the
precision of the approximation.
That is, for any given rank r, an
e-approximate quantile summary return a value
whose rank r is guaranteed to be within the
interval r-eN , reN.
2. be data independent. That is, neither affected
by the arrival order or distribution of the
values nor should it require a priori knowledge
of the size of the dataset.
3. execute in a single pass over the data.
4. have as small of memory footprints as possible
(apply to temporary storage during the
computation)

6
Previous Work

Mnku, Rajagopalan and Lindsay presented
single-pass algorithm, ?-approximate quantile
summary, requires O(1/e log2(eN) space but need
and advanced knowledge of N ( otherwise they
provide a probabilistic guarantee on the
precision) (MRL).
Gibson, Matis and Poosala presented multiple pass
algorithm with probabilistic guarantee
Munro and Paterson showed that any algorithm that
exactly compute F-quantile in in only P passes
requires a space of ?(N1/p)

7
This algorithm

present a worse-case space requirement of
O(1/?log (?N)), thus improving upon the previous
best result of O(1/?log2(?N)).
in contrast to earlier algorithms, the algorithm
doesnt require a priori knowledge of the length
of the input sequence
based on a novel data structure that effectively
maintains the range of possible ranks for each
quantile that they store.
The behavior is based on the fact that no input
sequence can be bad across the entire
distribution that is, the input sequence cannot
present new observations that must be stored
without deleting old stored observations.

8
The Data Structure

Assume w.l.og. That every new observation arrives
after each unit of time.
Denote n to be the number of observation seen so
far as well as the current time.
Denote e to be the given precision requirement
Denote SS(n) to be the summary data structure
at all time.
S(n) consists of an ordered sequence
elements corresponding to a subset of the
observations seen thus far
For each observation v in S, maintain an implicit
bound on the minimum and the maximum possible
rank of v among the first n observations. (Denote
by Rmin(v) and Rmax(v))

9
Data structure cont

More formally
let S(n) be the set of tuples t0,t1,,ts-1
where ti(Vi,gi,?i)
Vi is one of the elements for the data stream
gi is equal Rmin(Vi) - Rmin(Vi-1)
?I is equal Rmax(Vi) - Rmin(Vi)
?jltI gj Rmin(Vi) - Rmin(Vi-1) Rmin(Vi-1) -
Rmin(Vi2) ... Rmin(V1)- Rmin(V0) Rmin(Vi)
(?jltI gi)?I Rmax(Vi) - Rmin(Vi) Rmin(Vi)
Rmax(Vi)

10
Data structure cont

At all time ensure that V0 and Vs-1 correspond to
the minimum and maximum element seen so far.
gi?i-1 is the upper bound on the total number of
observations that may have fallen between vi and
vi-1
?i gi is the number of observations seen so far

11
Answering Quantile Queries

Proposition 1
Given a quantile summary S in the above form
a F-quantile can always be identified to within
an error of MAXi(gi?i)/2.
Proof.
let r ?Fn? and let eMAXi(gi?i)/2.
- search for an index i such that r-e lt
Rmin(Vi) and Rmax(vi)lt re

Maxi(gi?i)
V0
Vs-1
Rmax(Vi)
?Fn?
Rmin(Vi)
Vi
? vi approximates the F-quantile within the
claimed error bound.
12
Answering Quantile Queries cont

All is left to see is that such an index I must
always exist.

Consider the case rgtn-e
n-e
Vs-1
r
V0
We have Rmin(Vs-1)Rmax(Vs-1)n and therefore
is-1 is valid Otherwise rltn-e Choose the
smallest j such Rmax(Vj)gtre it follows that
Rmin(Vj-1)gtr-e Since for Rmin(Vj-1)ltr-e we get
Rmax(Vj)Rmin(Vj-1)gj?j gt Rmin(Vj-1)2e
Rmax(Vj)
V0
re
r-e
r
Rmin(Vj-1)
Vs-1
? Contradiction to the assumption that
eMAXi(gi?i)/2
13
Answering Quantile Queries cont

By assumption Rmax(Vj-1)ltre therefore j-1 is an
example of an index i with the desired property.
Corollary 1
if at any time n, the summery S(n) satisfied the
property that
MAXi(gi?i) lt2en, then we can answer any
F-quantile query to within an en precision.

14
Data structure cont

At high level
On a new observation insert in the summary a
tuple corresponding to this observation.
Periodically, perform a sweep over the summary to
merge some of the tuples into their neighbors
so as to free space
Maintain several condition in order to bound the
space used by S at any time.
By corollary 1 in suffice to ensure that at all
time MAXi(gi?i) lt2en.
Def An individual tuple is full if gi?i?2en?.
Def The capacity of an individual tuple is the
maximum number of observations that can be
counted by gi before the tuple become full

15
BANDS

General strategy delete tuples with small
capacities and preserve tuples with large
capacities.
In the merge phase, free up space by merging
tuples with small capacities into tuples with
similar or larger capacities.
We say , two tuples ti and tj have similar
capacities, if
log capacity(ti)? log capacity(tj)
This notion of similarity partition the possible
values of ? into Bands
we try to divide the ?s in bands that lie
between elements of
0, ½(2en), ¾(2en),..((2i-1)/2i)(2en),,
2en-1, 2en
this boundaries correspond to capacities of 2en,
en, 1/2en,,(1/2i)en,..8,4,2,1

16
BANDS cont

Define banda to be the set of all ? such that
p - 2a - (p mod 2a) lt ? lt p - 2a-1 (p mod
2a-1)
where
p?2en? and a 1 .. ?log(2en)?
The above definition ensure that if two ?s are
ever in the same band, they never appear in
different bands as n increases
Define band0 simply to be p
Consider the first 1/2e observations, with ? 0
to be in a band of their own.

17
BANDS cont

Example
Consider e1/8.
a b c
d e f
g
? 0,0,0,0,1,1,1,1,2,2, 2, 2, 3, 3, 3, 3,
4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6
N1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28

25..28 21..24 17..20 13..16 9..12 5..8
g f e d c b Band0
f d,e d b,c b Band1
b,c,d,e b,c b,c Band2
18
BANDS cont
19
BANDS cont

Proposition 2
at any point in time n and for any agt1
banda(n) contains either 2a or 2a-1
distinct value of ?.
PROOF
according to the upper and lower bounds of banda
2en - 2a - (2en mod 2a) lt ? lt 2en - 2a-1
(2en mod 2a-1)
If ( 2en mod 2a ) lt 2a-1 then ( 2en mod 2a )
( 2en mod 2a-1)
? banda 2a - 2a-1 2a-1 distinct
values of ?
If ( 2en mod 2a ) gt 2a-1 then ( 2en mod 2a )
2a-1 ( 2en mod 2a-1)
? banda 2a-1 2a-1 2a distinct
values of ?

20
A tree representation

For S t0, t1, .,ts-1 Impose a tree structure
T over the tuples of S.
Assign a special root node R
for every tuple ti assign a node Vi
The parent of every node Vi is the node Vj such
that j is the least index greater than i with
band(tj) gt band(ti). If no such j exist than set
R to be the parent.
All children (and all descendants) of a given
node Vi have ? values larger than ?I .

21
A tree representation

Proposition 4
for any node V, the set of all its descendants
in T form a contiguous segment in S
Proposition 3
the children of any node in T are always
arranged in non-increasing order of band in S

22
Operations

To compute e-approximate F-quantile from S(n)
after n observations
During the operations we wish to maintain correct
relationship between gi , ?I , Rmin and Rmax
QUANTILE(F) compute the rank r?Fn?
find i such that r-Rmin(Vi)lt en and
Rmax(Vi)-rlten return Vi .
INSERT(V) find the smallest i such that Vi-1lt
V ltVi and insert the tuple (V,1,?2en?) between
ti-1 and ti . If V is the new minimum or maximum
seen, then insert (v,1,0)

23
Operations Cont

INSERT(V) maintains maintain correct
relationship between gi , ?I , Rmin and Rmax
If V is inserted before Vi the value of Rmin(V)
may be as small as Rmin(Vi-1)1
similarly Rmax(V) may be as large as the current
Rmax(Vi) which is bounded by ?2en?.
Note that Rmin(Vi) and Rmax(Vi) get increased by
1 after insertion.

24
Operations Cont

DELETE(Vi) replace the tuple (Vi,gi,?i) and
(Vi1,gi1,?i1) with the new tuple
(Vi1,gigi1,?i1).
Deleting Vi has no effect on Rmin(Vi1)
Rmax(Vi1) so it should simply preserve them.
The relationship between Rmin(Vi1) and
Rmax(Vi1) is preserved as long as ?i1 is
unchanged .
since Rmin(Vi1) ?jltI1 gi and we deleted gi
we must increase gi1 by gi to keep Rmin(Vi1).

25
COMPRESS

The operation COMPRESS tries to merge together a
node and all its descendents into either its
parent node or into its right sibling (by
deleting them).
During compress we must ensure that the tuple
results after the merging is not full
Two adjacent tuples ti,ti1are mergeable if the
resulting tuple is not full and
band(ti,n)ltband(ti1,n).
Note that pair of tuples that are not mergeable
at some point in time may be come so at later
point as the term ?2en? increases over time.
Let gi denote the sum of g-values of tuple ti
and all its descendents in T .

26
Operations Cont

COMPRESS()
for i from s-2 to 0 do
if(BAND(?i,2?n) BAND(?i1,2?n))
(gigi1 ?i1lt 2?n) then
delete all descendants of ti and the
tuple ti itself
end if
end for
Compress inspect tuples from right (highest
index) to left. it first combine children (and
all their subtree of descendents) into their
parents and only when the parent is full it
combine children.

27
Operations Cont

Initial State
S? F s0 n0.
Algorithm
To add the n1st observation, v, to summary
S(n)
if(n 0 mod 1/(2?) ) then
COMPRESS()
end if
INSERT(v)
nn1

28
Analysis

The insert and compress operations always ensure
that gi?ilt2en
We will see now that the total number of tuples
in the summary S(n) is bounded by (11/(2e) log
(2en)).
Def coverage we say that a tuple ti in S(n)
covers an observation v at any time n if either
the tuple for v had been directly merged into ti
or a tuple t that covered v has been merged into
ti .
A tuple always cover itself.
It is easy to see that the number of observations
covered by ti is exactly given by gigi(n)

29
Analysis Cont

Lemma 1
At no point in time a tuple from band a
covers an observation from a band gt a.
Lemma 2
At any point in time n, and for any integer a,
the total number of observations covered
cumulatively by all tuples with band value in
0..a is bounded by 2a/e .
Lemma 3
At any time n and for any given a, there are
at most 3/2e nodes in T(n) that have a child with
band value of a. That is, there are at most 3/2e
parents of nodes from banda(n)

30
Analysis Cont

PROOF of lemma 4
Let mmin,mmax denote the earliest and the latest
time at which a node from banda could be seen.
mmin(2en-2a-(2en mod 2a))/2e
mmax(2en-2a-1-(2en mod 2a-1))/2e
Choose a child parent pair (Vi,Vj) Vj is in banda
Since Vj exist we can show that

Since at time mj (when Vj showed up) we had
gi(mj)?ilt2emj

31
Analysis Cont

Since mj is at most mmax

Since for all pairs (vi,vj) we have distinct
observations
The number of observations that came after mmin
is n-mmin
We get (n-mmin)/(2e(n-mmax))3/(2e)

32
Analysis Cont

Def Given a full pair of tuples (ti-1,ti), we
say that a tuple ti-1 is left partner and ti is
right partner in this full pair.
Lemma 4
At any time n and for any given a, there are
at most 4/e tuples from banda(n) that are right
partners in a full tuple pair.
PROOF
Let ti,ti1, ,tip-1 be the longest
contiguous segment of tuples from
banda(n) in S(n).
Since they existed after the compress operation
in must be the case
gj-1 gj?jgt2en for all iltjltip

33
Analysis Cont

Summing over all j

According to lemma 2 the first term is bounded by
2a1/e
The second term is bounded by p(2en-2a-1)
Summing the two bounds we get plt4/e

34
Analysis Cont

for non- contiguous segments just consider the
above summations over all such segments
Lemma 5
At any time n and for any given a, the
maximum number of tuples possible from each
banda(n) is 11/2e .
Proof
Each node of banda(n) is either
1. a right partner in a full pair
2. a left partner in a full pair
3. not participate in any full pair
The first case is bounded by 4/e ( lemma 4)
The last two are bounded by 3/2e
And the claim follow.

35
Analysis Cont

Theorem 1
At any time n, the number of tuples stored
in S(n) is at most
(11/(2e) log (2en)).
PROOF
There are at most 1?log(2en)? bands at time n
Summing over their sizes we get ? (11/(2e) log
(2en)).

36
Experiments results

The experiments were done on 3 different classes
of input data
1. Hard Case.
- an adversarial manner data sequence that is,
place the next observation in the largest current
gap of the quantile summary.
2. sorted input data.
- the data arrives in sorted order.
3. random input data.
- select each datum by selecting an element
(without replacement) from a uniform distribution
of all remaining elements in the data set

37
Experiments results cont

Sorted and random input data are used after the
MRL experimental results
Random input data can give an insight to the
behavior of the algorithm on average inputs.
In general, the algorithm used less space than
indicated by the analysis. And turned out to be
better than the MRLs space requirement.

38
Experiments results cont

For each case we have 2 different kind of
experiment
1. Adaptive the regular algorithm ( with a
slight variation)
2. Pre-allocated used the same space as used
in the MRL
We will see that in the later case the observed
error is significantly better then the one of the
MRL.
differences in the algorithm used for the
experiment
1. An observation is inserted as a tuple
(v,1,gi?i-1) and not (v,1,?2?n?).
the latter is strictly to simplify
theoretical analysis.
2. Rather than running the COMPRESS after every
1/2e observations
for each observation inserted one tuple was
deleted when possible.
? if no tulpe could be deleted without
making is successor full the size of S grew by
1.

39
Experiments results cont

We apply the following measurements
1. The maximum space used to produce the
summary counting the number of stored tuples (
multiple by 3 for comparison with MRL to account
the Rmin and Rmax values stored in each tuple )
2. The observed precision of the results.

40
Experiments results cont

HARD INPUT

The required number of quantile is approximately
a factor of 11 less than the worst case bound of
the analysis
We almost always require less space than the MRL.
The only exception is in epsilon.001 and N105
where MRL require less space

41
Experiments results cont

SORTED INPUT
Fix e.001 and construct summaries of sorted
sequences of size 105,106 and 107
Sample 15 quantiles at (qi/16)N for qi1..15
and compute the maximum error over all possible
quantile queries.
Compare 3 algorithms
1. MRL preallocated the storage required by
MRL as a function of N and e.
2. pre-allocated using 1/3 as many stored
quantiles as MRL.
3. adaptive storage allocated for new quantile
only if no quantile could be
deleted without exceeding a precision of .001n

42
Experiments results cont

S - the number of stored quantiles need to
achieve the desired precision
Max e-the maximum error of all possible quantile
queries of the summaries
The remaining rows lists the approximation error
of the response to the query for the qi/16th
quantile.

43
Experiments results cont

RANDOM INPUT
Same measurements as in the sorted input (e and
sequence length)
Run each experiment 50 times and report the max,
min, mean and std for every measurement.

44
Experiments results cont
45
Experiments results cont
46
Conclusions

Improves upon the earlier results in two
significant ways
It improves the space complexity by a factor of O
(log(eN)).
2. It doesnt require a priori knowledge of the
parameter N that is, it allocates more space
dynamically as the data sequence grows in size.

Write a Comment

User Comments (0)