Approximating Quantiles - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Approximating Quantiles

Description:

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a ... In a sequence of N ordered data elements, F-quantile is the element with rank. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 25
Provided by: cacsLou
Category:

less

Transcript and Presenter's Notes

Title: Approximating Quantiles


1
Approximating Quantiles
  • Jason Rydberg
  • April 13, 2006

2
References
  • Approximate Medians and other Quantiles in One
    Pass and with Limited Memory
  • Gurmeet Singh Manku, Sridhar Rajagopalan, Bruce
    G. Lindsay
  • ACM SIGMOD 98. 1998
  • Continuously Maintaining Quantile Summaries of
    the Most Recent N Elements over a Data Stream
  • Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu
  • Proceedings of the 20th International Conference
    on Data Engineering. 2004

3
Background
  • Quantiles points at specific positions in sorted
    input data.
  • In a sequence of N ordered data elements,
    F-quantile is the element with rank .
  • Due to memory constraints, quantiles for
    streaming data must be approximated.

4
Approximate Medians and other Quantiles in One
Pass and with Limited Memory
  • Algorithm
  • Overview
  • Levels
  • NEW
  • COLLAPSE
  • OUTPUT
  • Example

5
Algorithm Overview
  • Inputs
  • b number of buffers
  • k number of elements per buffer
  • New Algorithm
  • While there are outstanding elements
  • Call NEW and assign the buffer the desired level.
  • If b full buffers, call COLLAPSE on the set of
    buffers with level l and assign the output buffer
    level l1
  • Finally, call OUTPUT

6
Levels
  • Each buffer has an integer, l(X), that identifies
    its level.
  • Global integer l is equal to the smallest l(X) of
    full buffers.

7
Levels
  • Tree representation of levels

8
NEW
  • NEW
  • Fills a buffer with the next k elements from the
    input stream. If there are less than k elements
    left, populate the buffer with alternating -8 and
    8 to make the number of elements in the buffer
    equal to k.
  • Label the buffer as full.
  • Assign the buffer a weight of 1.
  • NEW and levels
  • If more than one empty buffer, the new buffer has
    level 0.
  • Otherwise, the new buffer has level l.

9
COLLAPSE
  • COLLAPSE
  • Collapses c 2 full buffers (Xi, i1,,c) into a
    buffer Y.
  • Weight of Y
  • Selecting elements for buffer (naively)
  • Duplicate all Xi elements according to weight and
    sort.
  • Select jw(Y) offset, for j0,1,,k-1
  • Offset is selected according to w(Y)
  • w(Y) odd offset
  • w(Y) even offset

10
OUTPUT
  • OUTPUT
  • Similar to COLLAPSE, but outputs the appropriate
    quantile instead of combining buffers.

11
Example
  • Data 8, 1, 5, 7, 3, 9, 6, 2, 4, 1, 8, 3
  • b 3, k 2
  • F 0.5
  • 0.5-quantile 3. Actual 0.5-quantile 6

12
Continuously Maintaining Quantile Summaries of
the Most Recent N Elements over a Data Stream
  • Two approaches
  • n-of-N Algorithm
  • Inputs
  • EH Partitioning
  • Algorithm Overview
  • Dropping Sketches
  • Querying Quantiles
  • LIFT
  • Example

13
Two Approaches
  • Sliding window model
  • Maintains a quantile sketch of the N most
    recently seen elements.
  • n-of-N model
  • Also maintains a quantile sketch of the N most
    recently seen elements. Unlike sliding window
    model, quantile queries can be issued against any
    n N elements.

14
Inputs
  • User inputs
  • ? desired accuracy
  • N number of elements to maintain quantile
    summaries for
  • n number of elements to use for quantile queries
  • F desired quantile
  • Then, ?

15
EH Partitioning
  • Uses buckets to store information.
  • Each bucket contains
  • Quantile sketch of its earliest data element and
    all successive data elements
  • Number of elements added after the earliest data
    element, N.
  • Time stamp of the earliest data element
  • Maintains, at most, i-buckets

16
Algorithm Overview
  • When a new element e arrives at time t
  • Create a new sketch
  • Record a new 1-bucket with N 0 and timestamp t.
    Initialize a sketch containing e and place it in
    the bucket.
  • Drop sketches
  • Update sketches
  • Add e into each remaining sketch using
    GK-algorithm for -approximation.
  • Increase each N by 1.

17
Dropping Sketches
  • If the number of 1-buckets is full, then, fromi
    1 to j, if the number of i-buckets
  • Find the two oldest buckets, b1 and b2, among the
    i-buckets.
  • Drop the newest bucket (assume b2).
  • Update b1, so it is an i1-bucket.
  • Scan the buckets, beginning with the oldest, and
    delete buckets with Nb N.

18
Querying Quantiles
  • Input n r
  • Algorithm
  • Scan the buckets, beginning with the oldest, and
    find the first bucket b with Nb n.
  • Apply the LIFT algorithm, with ??, to Sb to
    generate Slift.
  • For rank r, find the first tuple i in Slift such
    that r ?n ri- ri r ?n. Return vi.

19
LIFT
  • Input S, ?
  • Algorithm
  • Initialize Slift to Ø
  • For each tuple ai in S
  • Update ai by setting ri ri
  • Insert ai into Slift
  • Return Slift

20
Example
  • Input data 12, 10, 11, 10, 1, 10, 11, 9
  • ?0.1, N8.
  • So, ?0.047619

21
Example
22
Example
  • n8, F2/3
  • r6, 0
  • Slift result
  • ((1,1,1),(9,2,2),(10,3,3),(10,4,4),(10,5,5),(11,6,
    6),(11,7,7),(12,8,8))
  • 5.2 ri- ri 6.8
  • vi 11
  • n5, F1/3
  • r2, 0
  • Slift result
  • ((1,1,1),(9,2,2),(10,3,3),(10,4,4),(11,5,5),(11,6,
    6))
  • 1.5 ri- ri 2.5
  • vi 9

23
Space complexity
  • Mankus approach
  • n-of-N approach
  • For equal N, n-of-N requires more space.

24
Summary
  • If all of the data is important, Mankus approach
    is preferable.
  • If earlier data elements can become outdated (or
    arent important) or if queries are to be
    executed against a variable number of elements,
    the n-of-N approach should be used.
Write a Comment
User Comments (0)
About PowerShow.com