Space-Efficient%20Online%20Computation%20of%20Quantile%20Summaries - PowerPoint PPT Presentation

About This Presentation

Title:

Space-Efficient%20Online%20Computation%20of%20Quantile%20Summaries

Description:

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery – PowerPoint PPT presentation

Number of Views:185

Avg rating:3.0/5.0

Slides: 32

Provided by: emoryEdu

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Space-Efficient%20Online%20Computation%20of%20Quantile%20Summaries

1
Space-Efficient Online Computation of Quantile
Summaries

SIGMOD 01
Michael Greenwald Sanjeev Khanna
Presented by ellery

2
Outline

Introduction
The summary data structure
Operation and algorithm
Tree representation
Analysis and experimental result
Conclusion

3
Introduction

Space-efficient computation of quantile summaries
of very large data sets in a single pass.
Quantile queries Given a quantile, ?, return the
value whose rank is ??N?

4
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
5
Requirements

Explicit tunable a priori guarantees on the
precision of the approximation
As small a memory footprint as possible
Online Single pass over the data
Data Independent Performance guarantees should
be unaffected by arrival order, distribution of
values, or cardinality of observations.
Data Independent Setup no a priori knowledge
required about data set (size, range,
distribution, order).

6
e- approximate

A quantile summary for a data sequence is e-
approximate if, for any given rank r, it returns
a value whose rank r is guaranteed to be within
the interval r -eN , r eN
Example A data stream with 100 elements,
0.5 quantile with e 0.1 returns a value v.
The true rank of v is within 40,60

7
The Summary Data Structure

Let rmin(v) and rmax(v) denote the lower and
upper bounds on the rank of v
Each tuple ti (vi , gi ,?i)

8
Example
??.01, N1750
28,7
10,1
15,2
192
201
204
501,503
539,540
529,536
9
Query

Sketch S is e- approximate, That is for each ?
(0,1 , there is a (vi ,
rmin(vi) , rmax(vi) ) in S such that
vi is our answer for ?-quantile

10
Corollary

If at any time n, the summary S(n) satisfies the
property that
then we can answer any ?-quantile query to
within an en precision.

11
Overview of Summary Data Structure
? .29
r ?N 522
??.01, N1800
15,2
28,7
10,1
192
201
204
529,536
539,540
501,503

Quantile ? .29? Compute r and choose best vi

12
Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
192
201
204
529,536
539,540
501,503

If (rmax(vi1) - rmin(vi)) ? 2?N, then
?-approximate summary.
Our goal always maintain this property.
Tuple formulation of this rule gi ?I ? 2?N

13
Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
192
201
204
539,540
501,503
529,536

Goal always maintain ?-approximate
summary(rmax(vi1) - rmin(vi)) (gi ?I) ? 2?N
Insert new observations into summary

14
Overview of Summary Data Structure
??.01, N1800
15,2
28,7
10,1
2?N36
197
192
201
204
502,536
501,503
529,536
539,540

Goal always maintain ?-approximate
summary(rmax(vi1) - rmin(vi)) (gi ?I) ? 2?N
Insert new observations into summary

15
Overview of Summary Data Structure
??.01, N1801
15,2
28,7
10,1
1,34
2?N36.02
197
192
204
201
502,536
530,537
540,541
501,503

Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N
Insert new observations into summary
Insert tuple before the ith tuple. gnew 1 ?new
gi ?I - 1

16
Overview of Summary Data Structure
??.01, N1801
28,7
15,2
10,1
1,34
2?N36.02
197
192
201
204
502,536
540,541
530,537
501,503

Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N
Insert new observations into summary
Delete all superfluous entries.

17
Overview of Summary Data Structure
??.01, N1801
28,7
15,2
1,34
10,1
2?N36.02
192
201
204
530,537
540,541
501,503

Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N
Insert new observations into summary
Delete all superfluous entries.

18
Overview of Summary Data Structure
??.01, N1801
29,7
15,2
10,1
2?N36.02
192
201
204
530,537
540,541
501,503

Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi)) (gi ?I) ?
2?N
Insert new observations into summary
Delete all superfluous entries. gi gi gi-1

19
Overview of Summary Data Structure
??.01, N1801
15,2
29,7
10,1
2?N36.02
192
201
204
501,503
530,537
540,541

Insert gnew 1 ?new gi ?I - 1
Delete gi gi gi-1

20
Terminology

Full tuple A tuple is full if gi ?I 2?N
Full tuple pair A pair of tuples is full if
deleting the left-hand tuple would overfill the
right one
Capacity number of observations that can be
counted by gi before the tuple becomes full. (
2?N - ?I)

General strategy will be to delete tuples with
small capacity and preserve tuples with large
capacity.
21
Operations

Insert(v)Find the smallest i, such that
, and insert
Delete(vi)to delete from S,
replace and
by the new tuple
Compress()from right to left, merge all
mergeable pair.

22
GK Algorithm
To add the n1st observation, v, to summary S(n)
yes
no
COMPRESS()
INSERT
23
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2
3
3
3
3

Group tuples with similar capacities into bands
First (least index) node to the right with higher
capacity band becomes parent.

24
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
3
3
3
3
0
0
0
0
0
0
1
1
1
1
1
1
1
1
2
2
2

Group tuples with similar capacities into bands
First (least index) node to the right with higher
capacity band becomes parent.

25
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14

Group tuples with similar capacities into bands
First (least index) node to the right with higher
capacity band becomes parent.

26
Tree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14

Group tuples with similar capacities into bands
First (least index) node to the right with higher
capacity band becomes parent.

27
Operation (compress)

General strategy delete tuples with small
capacity and preserve tuples with large capacity.
1) Deletion cannot leave descendants unmerged ---
it must delete entire subtrees
2) Deletion can only merge a tuple with small
capacity into a tuple with similar or larger
capacity.
3) Deletion cannot create an over-full tuple
(i.e with g? gt floor(2?N))

28
Analysis

Theorem
At any time n, the total number of tuples
stored in S(n) is at most

29
Experimental Result

Measurement
S
Observed ? (vs. desired ?) max, avg, and for 16
representative quantiles
Optimal max observed ?
Compared 3 algorithms
MRL
Preallocated (1/3 number of stored observations
as MRL)
Adaptive allocate a new quantile only when
observed error is about to exceed desired ?

30
Conclusion