Statistic estimation over data stream - PowerPoint PPT Presentation

About This Presentation
Title:

Statistic estimation over data stream

Description:

Variables are 4-wise independent. Expected value of product of 4 ... Using 4-wise independence, possible to show that. is self-join size of R (second/L2 moment) ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 42
Provided by: mino67
Category:

less

Transcript and Presenter's Notes

Title: Statistic estimation over data stream


1
Statistic estimationover data stream
Slides modified from Minos Garofalakis ( yahoo!
research) and S. Muthukrishnan (Rutgers
University)
2
Outline
  • Introduction
  • Frequent moment estimation
  • Element Frequency estimation

3
Data Stream Processing Algorithms
  • Generally, algorithms compute approximate answers
  • Provably difficult to compute answers accurately
    with limited memory
  • Approximate answers - Deterministic bounds
  • Algorithms only compute an approximate answer,
    but bounds on error
  • Approximate answers - Probabilistic bounds
  • Algorithms compute an approximate answer with
    high probability
  • With probability at least , the computed
    answer is within a factor of the actual
    answer

4
Sampling Basics
  • Idea A small random sample S of the data often
    well-represents all the data
  • For a fast approximate answer, apply modified
    query to S
  • Example select agg from R

    (n12)
  • If agg is avg, return average of the elements in
    S
  • Number of odd elements ?

Data stream 9 3 5 2 7 1 6 5 8
4 9 1
Sample S 9 5 1 8
answer 11.5
5
Probabilistic Guarantees
  • Example Actual answer is within 11.5 1 with
    prob ? 0.9
  • Randomized algorithms Answer returned is a
    specially-built random variable
  • Use Tail Inequalities to give probabilistic
    bounds on returned answer
  • Markov Inequality
  • Chebyshevs Inequality
  • Chernoff/Hoeffding Bound

6
Basic Tools Tail Inequalities
  • General bounds on tail probability of a random
    variable (that is, probability that a random
    variable deviates far from its expectation)
  • Basic Inequalities Let X be a random variable
    with expectation and variance VarX. Then
    for any

Markov
Chebyshev
7
Tail Inequalities for Sums
  • Possible to derive even stronger bounds on tail
    probabilities for the sum of independent
    Bernoulli trials
  • Chernoff Bound Let X1, ..., Xm be independent
    Bernoulli trials such that PrXi1 p (PrXi0
    1-p). Let and be
    the expectation of . Then, for any ,

Do not need to compute Var(X), but need the
independent assumption!
  • Application to count queries
  • m is size of sample S (4 in example)
  • p is fraction of odd elements in stream (2/3 in
    example)

8
The Streaming Model
  • Underlying signal One-dimensional array A1N
    with values Ai all initially zero
  • Multi-dimensional arrays as well (e.g.,
    row-major)
  • Signal is implicitly represented via a stream of
    updates
  • j-th update is ltk, cjgt implying
  • Ak Ak cj (cj can be gt0, lt0)
  • Goal Compute functions on A subject to
  • Small space
  • Fast processing of updates
  • Fast function computation

9
Streaming Model Special Cases
  • Time-Series Model
  • Only j-th update updates Aj (i.e., Aj
    cj)
  • Cash-Register Model
  • cj is always gt 0 (i.e., increment-only)
  • Typically, cj1, so we see a multi-set of
    items in one pass
  • Turnstile Model
  • Most general streaming model
  • cj can be gt0 or lt0 (i.e., increment or
    decrement)
  • Problem difficulty varies depending on the model
  • E.g., MIN/MAX in Time-Series vs. Turnstile!

10
Frequent moment computation
  • Problem
  • Data arrives online ( a1,a2,a3..am )
  • Let f(i) j aj i ( represented by
    Ai )
  • Example

F0 5 lt distinct elementsgt, F1 7, F2 11 (
1122221111) ( surprise index)
What is F8?
11
Frequent moment computation
  • Easy for F1
  • How about others ?
  • - focus on the F2 and F0
  • - Estimation of Fk

12
Linear-Projection (AMS) Sketch Synopses
  • Goal Build small-space summary for distribution
    vector f(i) (i1,..., N) seen as a stream of
    i-values
  • Basic Construct Randomized Linear Projection of
    f() project onto inner/dot product of
    f-vector
  • Simple to compute over the stream Add
    whenever the i-th value is seen
  • Generate s in small O(logN) space using
    pseudo-random generators
  • Tunable probabilistic guarantees on approximation
    error
  • Delete-Proof Just subtract to delete an
    i-th value occurrence

where vector of random values from an
appropriate distribution
13
AMS ( sketch ) cont.
  • Key Intuition Use randomized linear projections
    of f() to define random variable X such that
  • X is easily computed over the stream (in small
    space)
  • EX F2
  • VarX is small
  • Basic Idea
  • Define a family of 4-wise independent -1, 1
    random variables
  • Pr 1 Pr -1 1/2
  • Expected value of each , E ? E
    ?
  • Variables are 4-wise independent
  • Expected value of product of 4 distinct 0
    E( ) 0
  • Variables can be generated using
    pseudo-random generator using only O(log N) space
    (for seeding)!

Probabilistic error guarantees (e.g., actual
answer is 101 with probability 0.9)
14
AMS ( sketch ) cont.
  • Example

3
2
  • Suppose
  • 1,2 ?1, 3,4 ?-1 then Z ?
  • 4 ?1, 1,3,4 ?-1 then Z ?

15
AMS ( sketch ) cont.
  • Expected value of X F2
  • Using 4-wise independence, possible to show
    that

0
16
Boosting Accuracy
  • Chebyshevs Inequality
  • Boost accuracy to by averaging over several
    independent copies of X (reduces
    variance)
  • By Chebyshev

y
Average
17
Boosting Confidence
  • Boost confidence to by taking median of
    2log(1/ ) independent copies of Y
  • Each Y Bernoulli Trial

FAILURE
copies
median
Prmedian(Y)-F2 F2
(By Chernoff Bound)
18
Summary of AMS Sketching for F2
  • Step 1 Compute random variables
  • Step 2 Define X Z2
  • Steps 3 4 Average independent copies of X
    Return median of averages
  • Main Theorem Sketching approximates F2 to
    within a relative error
  • of with probability using
    space
  • Remember O(log N) space for seeding the
    construction of each X

copies
y
Average
y
median
Average
copies
y
Average
19
Binary-Join COUNT Query
  • Problem Compute answer for the query COUNT(R
    A S)
  • Example

3
2
1
Data stream R.A 4 1 2 4 1 4
0
1
3
4
2
10 (2 2 0 6)
  • Exact solution too expensive, requires O(N)
    space!
  • N sizeof(domain(A))

20
Basic AMS Sketching Technique AMS96
  • Key Intuition Use randomized linear projections
    of f() to define random variable X such that
  • X is easily computed over the stream (in small
    space)
  • EX COUNT(R A S)
  • VarX is small
  • Basic Idea
  • Define a family of 4-wise independent -1, 1
    random variables

21
AMS Sketch Construction
  • Compute random variables
    and
  • Simply add to XR(XS) whenever the i-th value
    is observed in the R.A (S.A) stream
  • Define X XRXS to be estimate of COUNT query
  • Example

3
2
1
Data stream R.A 4 1 2 4 1 4
0
1
3
4
2
2
2
1
1
Data stream S.A 3 1 2 4 2 4
1
3
4
2
22
Binary-Join AMS Sketching Analysis
  • Expected value of X COUNT(R A S)
  • Using 4-wise independence, possible to show
    that
  • is self-join size of R
    (second/L2 moment)

1
0
23
Boosting Accuracy
  • Chebyshevs Inequality
  • Boost accuracy to by averaging over several
    independent copies of X (reduces
    variance)
  • By Chebyshev

y
Average
24
Boosting Confidence
  • Boost confidence to by taking median of
    2log(1/ ) independent copies of Y
  • Each Y Bernoulli Trial

FAILURE
copies
median
(By Chernoff Bound)
25
Summary of Binary-Join AMS Sketching
  • Step 1 Compute random variables
    and
  • Step 2 Define X XRXS
  • Steps 3 4 Average independent copies of X
    Return median of averages
  • Main Theorem (AGMS99) Sketching approximates
    COUNT to within a relative error of with
    probability using space
  • Remember O(log N) space for seeding the
    construction of each X

copies
y
Average
y
median
Average
copies
y
Average
26
Distinct Value Estimation ( F0 )
  • Problem Find the number of distinct values in a
    stream of values with domain 0,...,N-1
  • Zeroth frequency moment
  • Statistics number of species or classes in a
    population
  • Important for query optimizers
  • Network monitoring distinct destination IP
    addresses, source/destination pairs, requested
    URLs, etc.
  • Example (N64)
  • Hard problem for random sampling!
  • Must sample almost the entire table to guarantee
    the estimate is within a factor of 10 with
    probability gt 1/2, regardless of the estimator
    used!

Number of distinct values 5
27
Hash (aka FM) Sketches for Distinct Value
Estimation FM85
  • Assume a hash function h(x) that maps incoming
    values x in 0,, N-1 uniformly across 0,,
    2L-1, where L O(logN)
  • Let lsb(y) denote the position of the
    least-significant 1 bit in the binary
    representation of y
  • A value x is mapped to lsb(h(x))
  • Maintain Hash Sketch BITMAP array of L bits,
    initialized to 0
  • For each incoming value x, set BITMAP
    lsb(h(x)) 1
  • Prob lsb(h(x) i ?

x 5
28
Hash (FM) Sketches for Distinct Value Estimation
FM85
  • By uniformity through h(x) Prob BITMAPk1
  • Assuming d distinct values expect d/2 to map
    to BITMAP0 , d/4 to map to BITMAP1, . . .
  • Let R position of rightmost zero in BITMAP
  • Use as indicator of log(d)
  • FM85 prove that ER ,
    where
  • Estimate d
  • Average several iid instances (different hash
    functions) to reduce estimator variance

0
L-1
29
Accuracy of FM
BITMAP 1
0
0
0
1
0
0
0
1
1
1
0
1
0
0
0
1
0
1
1
1
1
1
0
1
BITMAP m
0
0
0
1
0
1
1
1
1
1
0
1
0
Approximation with probability at least 1-
30
Hash (FM) Sketches for Distinct Value Estimation
  • FM85 assume ideal hash functions h(x)
    (N-wise independence)
  • In practice
  • h(x) , where
    a, b are random binary vectors in 0,,2L-1
  • Composable Component-wise OR/add distributed
    sketches together
  • Estimate S1 S2 Sk set-union
    cardinality

31
Cash Register Sketch (AMS)
A more general algorithm for Fk
Choose random p from 1..n and let
Stream

sampling
Estimator
Using F2 ( k2 ) as example
If we choose the first element a1 r 2 and X
7(22-11) 21 And for a2 r ? , X ?
a5 r ? , X ?
32
Cash Register Sketch (AMS)
  • YAverage A copies of X and Z median of B
    copies
  • Of Ys

A copies
y
Average
y
median
Average
B copies
y
Average
Claim This is a 1 e approx to F2, and space
used is
O(AB)
words of size O(lognlogm)
with probability at least 1-d.
33
Analysis Cash Register Sketch
E(X) F2
V(X) E(X)2 - (E(X))2.
Using (a2 - b2) 2(a-b)a, we have V(X) 2
F1F3. Also, V(X) 2 F1 F3 .
Hence,
E
(
Y
)

E
(
X
)

F
i
i
2
V
(
Y
)

V
(
X
)
/
A

i
i
34
Analysis Contd.
Applying Chebyshevs inequality
Hence, by Chernoff bounds, probability that
more than B/2 Yis deviate by far is at most d,
if we take log (1/d) of Yis. Hence, median gives
the correct approximation.
35
Computation of Fk
  • E(X) Fk
  • When A
  • B
  • Get approximation with probability at least
    1 -

36
Estimate the element frequency
  • Ask for f(1) ? f(4) ?
  • - AMS based algorithm
  • - Count Min sketch.

37
AMS ( sketch ) based algorithm.
  • Key Intuition Use randomized linear projections
    of f() to define random variable Z such that
  • For given element Ai
  • E( Z ) Ai fi
  • Similar, we have E( Z ) fj
  • Basic Idea
  • Define a family of 4-wise independent -1, 1
    random variables( same as before )
  • Pr 1 Pr -1 ½
  • Let Z
  • So E( Z )

1
0
38
AMS cont.
  • Keep an array of w ? d counters for Zij
  • Use d hash functions to map element x to 1..w

h1(a)
a
hd(a)
Zi, hi(a)
Est(fa) median i (Zi,hi(a) )
39
The Count Min (CM) Sketch
  • Simple sketch idea, can be used for point queries
    ( fi), range queries, quantiles, join size
    estimation
  • Creates a small summary as an array of w ? d
    counters C
  • Use d hash functions to map element to 1..w

W
d
40
CM Sketch Structure
h1(xi)
xi
d
hd(xi )
w
  • Each element xi is mapped to one counter per row
  • C k,hk(xi) Ck, hk(xi)1 ( -1 if deletion )
  • or cj if income is ltj, cjgt
  • Estimate Aj by taking mink Ck,hk(j)

41
CM Sketch Summary
  • CM sketch guarantees approximation error on point
    queries less than eA in size O(1/e log 1/d)
  • Probability of more error is less than 1-d
  • Hints
  • Counts are biased! Can you limit the expected
    amount of extra mass at each bucket? (Use
    Markov)
Write a Comment
User Comments (0)
About PowerShow.com