Space-Efficient Online Computation of Quantile Summaries - PowerPoint PPT Presentation

About This Presentation
Title:

Space-Efficient Online Computation of Quantile Summaries

Description:

Space-Efficient Online Computation of Quantile Summaries. Michael Greenwald & Sanjeev Khanna ... sequence, and uses those quantile estimates to give approximate ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 47
Provided by: nnn69
Category:

less

Transcript and Presenter's Notes

Title: Space-Efficient Online Computation of Quantile Summaries


1
Space-Efficient Online Computation of Quantile
Summaries
  • Michael Greenwald Sanjeev Khanna
  • University of Pennsylvania
  • Presented by nir levy

2
Introduction
  • The problem
  • We introduced a very large data sets and we
    wish to compute F-quatiles
  • in a single pass using space-efficient
    computation .
  • Def The F-quantiles of an ordered sequence of N
    data items is the value with
  • rank FN. (the element in the FN
    position)
  • We are going to see an online algorithm for
    computing e-approximate quatile summaries of a
    very large data sequence.
  • Def An e-approximate quantile summaries of a
    sequence of N elements is a
  • data structure that can answer
    quantile queries about the sequence to
  • within a precision of eN.
  • Def A quantile summary consists of a small
    number of points from the input data
  • sequence, and uses those quantile
    estimates to give approximate responses to any
  • arbitrary quantile query.

3
Introduction cont
  • EXAMPLE
  • Input data 14, 2, 12, 5, 6, 19, 1, 14, 4, 9, 12,
    3, 8, 11, 15, 4.
  • Ordered 19, 15, 14, 14, 12, 12, 11, 9, 8,
    6, 5, 4, 4, 3, 2, 1
  • Rank 1, 2, 3, 4, 5, 6, 7, 8, 9,
    10, 11, 12, 13, 14, 15, 16
  • what is the 2nd biggest number? (15)
  • What is 25th number? (14)
  • Summary 19, 14, 11, 6, 4, 1
  • rank 1 4 7 10 13
    16
  • what is the 2nd biggest number? 2nd ? 1st (19)
  • What is 25th number? 160.254 ? 4th (14)

4
Quantile estimation for Database Applications
  • Estimate the size of intermediate results, to
    allow query optimizers to estimate the cost of
    competing plans to resolve database queries.
  • Partition data into roughly equal partitions for
    parallel database.
  • Prevent expensive and incorrect queries from
    being issued By estimate results sizes and give
    feedback to the users
  • Characterize the distribution of real world data
    sets for database users.

5
Properties
  • Properties for quantile estimators
  • provide tunable and explicit guarantees on the
    precision of the approximation.
  • That is, for any given rank r, an
    e-approximate quantile summary return a value
    whose rank r is guaranteed to be within the
    interval r-eN , reN.
  • 2. be data independent. That is, neither affected
    by the arrival order or distribution of the
    values nor should it require a priori knowledge
    of the size of the dataset.
  • 3. execute in a single pass over the data.
  • 4. have as small of memory footprints as possible
    (apply to temporary storage during the
    computation)

6
Previous Work
  • Mnku, Rajagopalan and Lindsay presented
    single-pass algorithm, ?-approximate quantile
    summary, requires O(1/e log2(eN) space but need
    and advanced knowledge of N ( otherwise they
    provide a probabilistic guarantee on the
    precision) (MRL).
  • Gibson, Matis and Poosala presented multiple pass
    algorithm with probabilistic guarantee
  • Munro and Paterson showed that any algorithm that
    exactly compute F-quantile in in only P passes
    requires a space of ?(N1/p)

7
This algorithm
  • present a worse-case space requirement of
    O(1/?log (?N)), thus improving upon the previous
    best result of O(1/?log2(?N)).
  • in contrast to earlier algorithms, the algorithm
    doesnt require a priori knowledge of the length
    of the input sequence
  • based on a novel data structure that effectively
    maintains the range of possible ranks for each
    quantile that they store.
  • The behavior is based on the fact that no input
    sequence can be bad across the entire
    distribution that is, the input sequence cannot
    present new observations that must be stored
    without deleting old stored observations.

8
The Data Structure
  • Assume w.l.og. That every new observation arrives
    after each unit of time.
  • Denote n to be the number of observation seen so
    far as well as the current time.
  • Denote e to be the given precision requirement
  • Denote SS(n) to be the summary data structure
    at all time.
  • S(n) consists of an ordered sequence
    elements corresponding to a subset of the
    observations seen thus far
  • For each observation v in S, maintain an implicit
    bound on the minimum and the maximum possible
    rank of v among the first n observations. (Denote
    by Rmin(v) and Rmax(v))

9
Data structure cont
  • More formally
  • let S(n) be the set of tuples t0,t1,,ts-1
    where ti(Vi,gi,?i)
  • Vi is one of the elements for the data stream
  • gi is equal Rmin(Vi) - Rmin(Vi-1)
  • ?I is equal Rmax(Vi) - Rmin(Vi)
  • ?jltI gj Rmin(Vi) - Rmin(Vi-1) Rmin(Vi-1) -
    Rmin(Vi2) ... Rmin(V1)- Rmin(V0) Rmin(Vi)
  • (?jltI gi)?I Rmax(Vi) - Rmin(Vi) Rmin(Vi)
    Rmax(Vi)

10
Data structure cont
  • At all time ensure that V0 and Vs-1 correspond to
    the minimum and maximum element seen so far.
  • gi?i-1 is the upper bound on the total number of
    observations that may have fallen between vi and
    vi-1
  • ?i gi is the number of observations seen so far

11
Answering Quantile Queries
  • Proposition 1
  • Given a quantile summary S in the above form
    a F-quantile can always be identified to within
    an error of MAXi(gi?i)/2.
  • Proof.
  • let r ?Fn? and let eMAXi(gi?i)/2.
  • - search for an index i such that r-e lt
    Rmin(Vi) and Rmax(vi)lt re

Maxi(gi?i)
V0
Vs-1
Rmax(Vi)
?Fn?
Rmin(Vi)
Vi
? vi approximates the F-quantile within the
claimed error bound.
12
Answering Quantile Queries cont
  • All is left to see is that such an index I must
    always exist.

Consider the case rgtn-e
n-e
Vs-1
r
V0
We have Rmin(Vs-1)Rmax(Vs-1)n and therefore
is-1 is valid Otherwise rltn-e Choose the
smallest j such Rmax(Vj)gtre it follows that
Rmin(Vj-1)gtr-e Since for Rmin(Vj-1)ltr-e we get
Rmax(Vj)Rmin(Vj-1)gj?j gt Rmin(Vj-1)2e
Rmax(Vj)
V0
re
r-e
r
Rmin(Vj-1)
Vs-1
? Contradiction to the assumption that
eMAXi(gi?i)/2
13
Answering Quantile Queries cont
  • By assumption Rmax(Vj-1)ltre therefore j-1 is an
    example of an index i with the desired property.
  • Corollary 1
  • if at any time n, the summery S(n) satisfied the
    property that
  • MAXi(gi?i) lt2en, then we can answer any
    F-quantile query to within an en precision.

14
Data structure cont
  • At high level
  • On a new observation insert in the summary a
    tuple corresponding to this observation.
  • Periodically, perform a sweep over the summary to
    merge some of the tuples into their neighbors
    so as to free space
  • Maintain several condition in order to bound the
    space used by S at any time.
  • By corollary 1 in suffice to ensure that at all
    time MAXi(gi?i) lt2en.
  • Def An individual tuple is full if gi?i?2en?.
  • Def The capacity of an individual tuple is the
    maximum number of observations that can be
    counted by gi before the tuple become full

15
BANDS
  • General strategy delete tuples with small
    capacities and preserve tuples with large
    capacities.
  • In the merge phase, free up space by merging
    tuples with small capacities into tuples with
    similar or larger capacities.
  • We say , two tuples ti and tj have similar
    capacities, if
  • log capacity(ti)? log capacity(tj)
  • This notion of similarity partition the possible
    values of ? into Bands
  • we try to divide the ?s in bands that lie
    between elements of
  • 0, ½(2en), ¾(2en),..((2i-1)/2i)(2en),,
    2en-1, 2en
  • this boundaries correspond to capacities of 2en,
    en, 1/2en,,(1/2i)en,..8,4,2,1

16
BANDS cont
  • Define banda to be the set of all ? such that
  • p - 2a - (p mod 2a) lt ? lt p - 2a-1 (p mod
    2a-1)
  • where
  • p?2en? and a 1 .. ?log(2en)?
  • The above definition ensure that if two ?s are
    ever in the same band, they never appear in
    different bands as n increases
  • Define band0 simply to be p
  • Consider the first 1/2e observations, with ? 0
    to be in a band of their own.

17
BANDS cont
  • Example
  • Consider e1/8.
  • a b c
    d e f
    g
  • ? 0,0,0,0,1,1,1,1,2,2, 2, 2, 3, 3, 3, 3,
    4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6
  • N1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,
    20,21,22,23,24,25,26,27,28

25..28 21..24 17..20 13..16 9..12 5..8
g f e d c b Band0
f d,e d b,c b Band1
b,c,d,e b,c b,c Band2
18
BANDS cont
19
BANDS cont
  • Proposition 2
  • at any point in time n and for any agt1
    banda(n) contains either 2a or 2a-1
  • distinct value of ?.
  • PROOF
  • according to the upper and lower bounds of banda
  • 2en - 2a - (2en mod 2a) lt ? lt 2en - 2a-1
    (2en mod 2a-1)
  • If ( 2en mod 2a ) lt 2a-1 then ( 2en mod 2a )
    ( 2en mod 2a-1)
  • ? banda 2a - 2a-1 2a-1 distinct
    values of ?
  • If ( 2en mod 2a ) gt 2a-1 then ( 2en mod 2a )
    2a-1 ( 2en mod 2a-1)
  • ? banda 2a-1 2a-1 2a distinct
    values of ?

20
A tree representation
  • For S t0, t1, .,ts-1 Impose a tree structure
    T over the tuples of S.
  • Assign a special root node R
  • for every tuple ti assign a node Vi
  • The parent of every node Vi is the node Vj such
    that j is the least index greater than i with
    band(tj) gt band(ti). If no such j exist than set
    R to be the parent.
  • All children (and all descendants) of a given
    node Vi have ? values larger than ?I .

21
A tree representation
  • Proposition 4
  • for any node V, the set of all its descendants
    in T form a contiguous segment in S
  • Proposition 3
  • the children of any node in T are always
    arranged in non-increasing order of band in S

22
Operations
  • To compute e-approximate F-quantile from S(n)
    after n observations
  • During the operations we wish to maintain correct
    relationship between gi , ?I , Rmin and Rmax
  • QUANTILE(F) compute the rank r?Fn?
  • find i such that r-Rmin(Vi)lt en and
    Rmax(Vi)-rlten return Vi .
  • INSERT(V) find the smallest i such that Vi-1lt
    V ltVi and insert the tuple (V,1,?2en?) between
    ti-1 and ti . If V is the new minimum or maximum
    seen, then insert (v,1,0)

23
Operations Cont
  • INSERT(V) maintains maintain correct
    relationship between gi , ?I , Rmin and Rmax
  • If V is inserted before Vi the value of Rmin(V)
    may be as small as Rmin(Vi-1)1
  • similarly Rmax(V) may be as large as the current
    Rmax(Vi) which is bounded by ?2en?.
  • Note that Rmin(Vi) and Rmax(Vi) get increased by
    1 after insertion.

24
Operations Cont
  • DELETE(Vi) replace the tuple (Vi,gi,?i) and
    (Vi1,gi1,?i1) with the new tuple
    (Vi1,gigi1,?i1).
  • Deleting Vi has no effect on Rmin(Vi1)
    Rmax(Vi1) so it should simply preserve them.
  • The relationship between Rmin(Vi1) and
    Rmax(Vi1) is preserved as long as ?i1 is
    unchanged .
  • since Rmin(Vi1) ?jltI1 gi and we deleted gi
    we must increase gi1 by gi to keep Rmin(Vi1).

25
COMPRESS
  • The operation COMPRESS tries to merge together a
    node and all its descendents into either its
    parent node or into its right sibling (by
    deleting them).
  • During compress we must ensure that the tuple
    results after the merging is not full
  • Two adjacent tuples ti,ti1are mergeable if the
    resulting tuple is not full and
    band(ti,n)ltband(ti1,n).
  • Note that pair of tuples that are not mergeable
    at some point in time may be come so at later
    point as the term ?2en? increases over time.
  • Let gi denote the sum of g-values of tuple ti
    and all its descendents in T .

26
Operations Cont
  • COMPRESS()
  • for i from s-2 to 0 do
  • if(BAND(?i,2?n) BAND(?i1,2?n))
  • (gigi1 ?i1lt 2?n) then
  • delete all descendants of ti and the
    tuple ti itself
  • end if
  • end for
  • Compress inspect tuples from right (highest
    index) to left. it first combine children (and
    all their subtree of descendents) into their
    parents and only when the parent is full it
    combine children.

27
Operations Cont
  • Initial State
  • S? F s0 n0.
  • Algorithm
  • To add the n1st observation, v, to summary
    S(n)
  • if(n 0 mod 1/(2?) ) then
  • COMPRESS()
  • end if
  • INSERT(v)
  • nn1

28
Analysis
  • The insert and compress operations always ensure
    that gi?ilt2en
  • We will see now that the total number of tuples
    in the summary S(n) is bounded by (11/(2e) log
    (2en)).
  • Def coverage we say that a tuple ti in S(n)
    covers an observation v at any time n if either
    the tuple for v had been directly merged into ti
    or a tuple t that covered v has been merged into
    ti .
  • A tuple always cover itself.
  • It is easy to see that the number of observations
    covered by ti is exactly given by gigi(n)

29
Analysis Cont
  • Lemma 1
  • At no point in time a tuple from band a
    covers an observation from a band gt a.
  • Lemma 2
  • At any point in time n, and for any integer a,
    the total number of observations covered
    cumulatively by all tuples with band value in
    0..a is bounded by 2a/e .
  • Lemma 3
  • At any time n and for any given a, there are
    at most 3/2e nodes in T(n) that have a child with
    band value of a. That is, there are at most 3/2e
    parents of nodes from banda(n)

30
Analysis Cont
  • PROOF of lemma 4
  • Let mmin,mmax denote the earliest and the latest
    time at which a node from banda could be seen.
  • mmin(2en-2a-(2en mod 2a))/2e
  • mmax(2en-2a-1-(2en mod 2a-1))/2e
  • Choose a child parent pair (Vi,Vj) Vj is in banda
  • Since Vj exist we can show that
  • Since at time mj (when Vj showed up) we had
  • gi(mj)?ilt2emj

31
Analysis Cont
  • Since mj is at most mmax
  • Since for all pairs (vi,vj) we have distinct
    observations
  • The number of observations that came after mmin
    is n-mmin
  • We get (n-mmin)/(2e(n-mmax))3/(2e)

32
Analysis Cont
  • Def Given a full pair of tuples (ti-1,ti), we
    say that a tuple ti-1 is left partner and ti is
    right partner in this full pair.
  • Lemma 4
  • At any time n and for any given a, there are
    at most 4/e tuples from banda(n) that are right
    partners in a full tuple pair.
  • PROOF
  • Let ti,ti1, ,tip-1 be the longest
    contiguous segment of tuples from
  • banda(n) in S(n).
  • Since they existed after the compress operation
    in must be the case
  • gj-1 gj?jgt2en for all iltjltip

33
Analysis Cont
  • Summing over all j
  • According to lemma 2 the first term is bounded by
    2a1/e
  • The second term is bounded by p(2en-2a-1)
  • Summing the two bounds we get plt4/e

34
Analysis Cont
  • for non- contiguous segments just consider the
    above summations over all such segments
  • Lemma 5
  • At any time n and for any given a, the
    maximum number of tuples possible from each
    banda(n) is 11/2e .
  • Proof
  • Each node of banda(n) is either
  • 1. a right partner in a full pair
  • 2. a left partner in a full pair
  • 3. not participate in any full pair
  • The first case is bounded by 4/e ( lemma 4)
  • The last two are bounded by 3/2e
  • And the claim follow.

35
Analysis Cont
  • Theorem 1
  • At any time n, the number of tuples stored
    in S(n) is at most
  • (11/(2e) log (2en)).
  • PROOF
  • There are at most 1?log(2en)? bands at time n
  • Summing over their sizes we get ? (11/(2e) log
    (2en)).

36
Experiments results
  • The experiments were done on 3 different classes
    of input data
  • 1. Hard Case.
  • - an adversarial manner data sequence that is,
    place the next observation in the largest current
    gap of the quantile summary.
  • 2. sorted input data.
  • - the data arrives in sorted order.
  • 3. random input data.
  • - select each datum by selecting an element
    (without replacement) from a uniform distribution
    of all remaining elements in the data set

37
Experiments results cont
  • Sorted and random input data are used after the
    MRL experimental results
  • Random input data can give an insight to the
    behavior of the algorithm on average inputs.
  • In general, the algorithm used less space than
    indicated by the analysis. And turned out to be
    better than the MRLs space requirement.

38
Experiments results cont
  • For each case we have 2 different kind of
    experiment
  • 1. Adaptive the regular algorithm ( with a
    slight variation)
  • 2. Pre-allocated used the same space as used
    in the MRL
  • We will see that in the later case the observed
    error is significantly better then the one of the
    MRL.
  • differences in the algorithm used for the
    experiment
  • 1. An observation is inserted as a tuple
    (v,1,gi?i-1) and not (v,1,?2?n?).
  • the latter is strictly to simplify
    theoretical analysis.
  • 2. Rather than running the COMPRESS after every
    1/2e observations
  • for each observation inserted one tuple was
    deleted when possible.
  • ? if no tulpe could be deleted without
    making is successor full the size of S grew by
    1.

39
Experiments results cont
  • We apply the following measurements
  • 1. The maximum space used to produce the
    summary counting the number of stored tuples (
    multiple by 3 for comparison with MRL to account
    the Rmin and Rmax values stored in each tuple )
  • 2. The observed precision of the results.

40
Experiments results cont
  • HARD INPUT
  • The required number of quantile is approximately
    a factor of 11 less than the worst case bound of
    the analysis
  • We almost always require less space than the MRL.
  • The only exception is in epsilon.001 and N105
    where MRL require less space

41
Experiments results cont
  • SORTED INPUT
  • Fix e.001 and construct summaries of sorted
    sequences of size 105,106 and 107
  • Sample 15 quantiles at (qi/16)N for qi1..15
    and compute the maximum error over all possible
    quantile queries.
  • Compare 3 algorithms
  • 1. MRL preallocated the storage required by
    MRL as a function of N and e.
  • 2. pre-allocated using 1/3 as many stored
    quantiles as MRL.
  • 3. adaptive storage allocated for new quantile
    only if no quantile could be
    deleted without exceeding a precision of .001n

42
Experiments results cont
  • S - the number of stored quantiles need to
    achieve the desired precision
  • Max e-the maximum error of all possible quantile
    queries of the summaries
  • The remaining rows lists the approximation error
    of the response to the query for the qi/16th
    quantile.

43
Experiments results cont
  • RANDOM INPUT
  • Same measurements as in the sorted input (e and
    sequence length)
  • Run each experiment 50 times and report the max,
    min, mean and std for every measurement.

44
Experiments results cont
45
Experiments results cont
46
Conclusions
  • Improves upon the earlier results in two
    significant ways
  • It improves the space complexity by a factor of O
    (log(eN)).
  • 2. It doesnt require a priori knowledge of the
    parameter N that is, it allocates more space
    dynamically as the data sequence grows in size.
Write a Comment
User Comments (0)
About PowerShow.com