Title: CS 361A (Advanced Data Structures and Algorithms)
1CS 361A (Advanced Data Structures and Algorithms)
- Lectures 16 17 (Nov
16 and 28, 2005) - Synopses, Samples, and Sketches
- Rajeev Motwani
2Game Plan for Week
- Last Class
- Models for Streaming/Massive Data Sets
- Negative results for Exact Distinct Values
- Hashing for Approximate Distinct Values
- Today
- Synopsis Data Structures
- Sampling Techniques
- Frequency Moments Problem
- Sketching Techniques
- Finding High-Frequency Items
3Synopsis Data Structures
- Synopses
- Webster a condensed statement or outline (as of
a narrative or treatise) - CS 361A succinct data structure that lets us
answers queries efficiently - Synopsis Data Structures
- Lossy Summary (of a data stream)
- Advantages fits in memory easy to communicate
- Disadvantage lossiness implies approximation
error - Negative Results ? best we can do
- Key Techniques randomization and hashing
4Numerical Examples
- Approximate Query Processing AQUA/Bell Labs
- Database Size 420 MB
- Synopsis Size 420 KB (0.1)
- Approximation Error within 10
- Running Time 0.3 of time for exact query
- Histograms/Quantiles Chaudhuri-Motwani-Narasayya,
Manku-Rajagopalan-Lindsay, Khanna-Greenwald - Data Size 109 items
- Synopsis Size 1249 items
- Approximation Error within 1
- Desidarata
- Small Memory Footprint
- Quick Update and Query
- Provable, low-error guarantees
- Composable for distributed scenario
- Applicability?
- General-purpose e.g. random samples
- Specific-purpose e.g. distinct values estimator
- Granularity?
- Per database e.g. sample of entire table
- Per distinct value e.g. customer profiles
- Structural e.g. GROUP-BY or JOIN result samples
6Examples of Synopses
- Synopses need not be fancy!
- Simple Aggregates e.g. mean/median/max/min
- Variance?
- Random Samples
- Aggregates on small samples represent entire data
- Leverage extensive work on confidence intervals
- Random Sketches
- structured samples
- Tracking High-Frequency Items
7Random Samples
8Types of Samples
- Oblivious sampling at item level
- Limitations Bar-YossefKumarSivakumar STOC 01
- Value-based sampling e.g. distinct-value
samples - Structured samples e.g. join sampling
- Naïve approach keep samples of each relation
- Problem sample-of-join join-of-samples
- Foreign-Key Join Chaudhuri-Motwani-Narasayya
what if A sampled from L and B from R?
9Basic Scenario
- Goal maintain uniform sample of item-stream
- Sampling Semantics?
- Coin flip
- select each item with probability p
- easy to maintain
- undesirable sample size is unbounded
- Fixed-size sample without replacement
- Our focus today
- Fixed-size sample with replacement
- Show can generate from previous sample
- Non-Uniform Samples Chaudhuri-Motwani-Narasayya
10Reservoir Sampling Vitter
- Input stream of items X1 , X2, X3,
- Goal maintain uniform random sample S of size n
(without replacement) of stream so far - Reservoir Sampling
- Initialize include first n elements in S
- Upon seeing item Xt
- Add Xt to S with probability n/t
- If added, evict random previous item
- Correctness?
- Fact At each instant, S n
- Theorem At time t, any XieS with probability n/t
- Exercise prove via induction on t
- Efficiency?
- Let N be stream size
- Remark Verify this is optimal.
- Naïve implementation ? N coin flips ? time O(N)
12Improving Efficiency
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
items inserted into sample S (where n3)
- Random variable Jt number jumped over after
time t - Idea generate Jt and skip that many items
- Cumulative Distribution Function F(s) PJt
s, for tgtn s0
- Number of calls to RANDOM()?
- one per insertion into sample
- this is optimal!
- Generating Jt?
- Pick random number U e 0,1
- Find smallest j such that U F(j)
- How?
- Linear scan ? O(N) time
- Binary search with Newtons interpolation ?
O(n2(1 polylog N/n)) time - Remark see paper for optimal algorithm
14Sampling over Sliding Windows Babcock-Datar-Motwa
- Sliding Window W last w items in stream
- Model item Xt expires at time tw
- Why?
- Applications may require ignoring stale data
- Type of approximation
- Only way to define JOIN over streams
- Goal Maintain uniform sample of size n of
sliding window
15Reservoir Sampling?
- Observe
- any item in sample S will expire eventually
- must replace with random item of current window
- Problem
- no access to items in W-S
- storing entire window requires O(w) memory
- Oversampling
- Backing sample B select each item with
probability - sample S select n items from B at random
- upon expiry in S ? replenish from B
- Claim n lt B lt n log w with high probability
16Index-Set Approach
- Pick random index set I i1, , in , X?0,1,
, w-1 - Sample S items Xi with i e i1, , in (mod w)
in current window - Example
- Suppose w2, n1, and I1
- Then sample is always Xi with odd i
- Memory only O(k)
- Observe
- S is uniform random sample of each window
- But sample is periodic (union of arithmetic
progressions) - Correlation across successive windows
- Problems
- Correlation may hurt in some applications
- Some data (e.g. time-series) may be periodic
17Chain-Sample Algorithm
- Idea
- Fix expiry problem in Reservoir Sampling
- Advance planning for expiry of sampled items
- Focus on sample size 1 keep n independent such
samples - Chain-Sampling
- Add Xt to S with probability 1/mint,w evict
earlier sample - Initially standard Reservoir Sampling up to
time w - Pre-select Xts replacement Xr e Wtw Xt1, ,
Xtw - Xt expires ? must replace from Wtw
- At time r, save Xr and pre-select its own
replacement ? building chain of potential
replacements - Note if evicting earlier sample, discard its
chain as well
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
19Expectation for Chain-Sample
- T(x) Echain length for Xt at time tx
- Echain length T(w) ? e ? 2.718
- Ememory required for sample size n O(n)
20Tail Bound for Chain-Sample
- Chain hops of total length at most w
- Chain of h hops ? ordered (h1)-partition of w
- h hops of total length less than w
- plus, remainder
- Each partition has probability w-h
- Number of partitions
- h O(log w) ? probability of a partition is
O(w-c) - Thus memory O(n log w) with high probability
21Comparison of Algorithms
Algorithm Expected High-Probability
Periodic O(n) O(n)
Oversample O(n log w) O(n log w)
Chain-Sample O(n) O(n log w)
- Chain-Sample beats Oversample
- Expected memory O(n) vs O(n log w)
- High-probability memory bound both O(n log w)
- Oversample may have sample size shrink below n!
22SketchesandFrequency Moments
23Generalized Stream Model
- Input Element (i,a)
- a copies of domain-value i
- increment to ith dimension of m by a
- a need not be an integer
- Negative value captures deletions
On seeing element (i,a) (1,-1)
m0 m1 m2 m3 m4
25Frequency Moments
- Input Stream
- values from U 0,1,,N-1
- frequency vector m (m0,m1,,mN-1)
- Kth Frequency Moment Fk(m) Si mik
- F0 number of distinct values (Lecture 15)
- F1 stream size
- F2 Gini index, self-join size, Euclidean norm
- Fk for kgt2, measures skew, sometimes useful
- F8 maximum frequency
- Problem estimation in small space
- Sketches randomized estimators
26Naive Approaches
- Space N counter mi for each distinct value i
- Space O(1)
- if input sorted by i
- single counter recycled when new i value appears
- Goal
- Allow arbitrary input
- Use small (logarithmic) space
- Settle for randomization/approximation
27Sketching F2
- Random Hash h(i) 0,1,,N-1 ? -1,1
- Define Zi h(i)
- Maintain X Si miZi
- Easy for update streams (i,a) just add aZi to X
- Claim X2 is unbiased estimator for F2
- Proof EX2 E(Si miZi)2
- ESi mi2Zi2
ESi,jmimjZiZj - Si mi2EZi2
Si,jmimjEZiEZj - Si mi2 0 F2
- Last Line? Zi2 1 and EZi 0 as
from independence
28Estimation Error?
- Chebyshev bound
- Define Y X2 ? EY EX2 Si mi2 F2
- Observe EX4 E(SmiZi)4
ESmi4Zi44ESmimj3ZiZj36ESmi2mj2Zi2Zj2 -
12ESmimjmk2ZiZjZk224ESmimjmkmlZiZjZkZl - Smi4 6Smi2mj2
- By definition VarY EY2 EY2 EX4
EX22 -
Smi46Smi2mj2 Smi42Smi2mj2 - 4Smi2mj2
2EX22 2F22
29Estimation Error?
- Chebyshev bound
- P relative estimation error gt?
- Problem What if we want ? really small?
- Solution
- Compute s 8/?2 independent copies of X
- Estimator Y mean(Xi2)
- Variance reduces by factor s
- P relative estimation error gt?
30Boosting Technique
- Algorithm A Randomized ?-approximate estimator f
- P(1- ?)f f (1 ?)f
3/4 - Heavy Tail Problem Pfz, f, fz 1/16,
3/4, 3/16 - Boosting Idea
- O(log1/e) independent estimates from A(X)
- Return median of estimates
- Claim Pmedian is ?-approximate gt1- e
Proof - Pspecific estimate is ?-approximate ¾
- Bad event only if gt50 estimates not
?-approximate - Binomial tail probability less than e
31Overall Space Requirement
- Observe
- Let m Smi
- Each hash needs O(log m)-bit counter
- s 8/?2 hash functions for each estimator
- O(log 1/e) such estimators
- Total O(?-2 log 1/e log m) bits
- Question Space for storing hash function?
32Sketching Paradigm
- Random Sketch inner product
- frequency vector m (m0,m1,,mN-1)
- random vector Z (currently, uniform -1,1)
- Observe
- Linearity ? Sketch(m1) Sketch(m2) Sketch
(m1 m2) - Ideal for distributed computing
- Observe
- Suppose Given i, can efficiently generate Zi
- Then can maintain sketch for update streams
- Problem
- Must generate Zih(i) on first appearance of i
- Need O(N) memory to store h explicitly
- Need O(N) random bits
33Two birds, One stone
- Pairwise Independent Z1,Z2, , Zn
- for all Zi and Zk, PZix, Zky
PZix.PZky - property EZiZk EZi.EZk
- Example linear hash function
- Seed Slta,bgt from 0..p-1, where p is prime
- Zi h(i) aib (mod p)
- Claim Z1,Z2, , Zn are pairwise independent
- Zix and Zky ?? xaib (mod p) and yakb (mod
p) - fixing i, k, x, y ? unique solution for a, b
- PZix, Zky 1/ p2 PZix.PZky
- Memory/Randomness n log p ? 2 log p
34Wait a minute!
- Doesnt pairwise independence screw up proofs?
- No EX2 calculation only has degree-2 terms
- But what about VarX2?
- Need 4-wise independence
35Application Join-Size Estimation
- Given
- Join attribute frequencies f1 and f2
- Join size f1.f2
- Define X1 f1.Z and X2 f2.Z
- Choose Z as 4-wise independent uniform -1,1
- Exercise Show, as before,
- EX1 X2 f1.f2
- VarX1 X2 2 (f1.f2)2
- Hint a.b a.b
36Bounding Error Probability
- Using s copies of Xs taking their mean Y
- Pr Y- f1.f2 ? f1.f2 Var(Y) /
2f12f22 / s?2(f1.f2)2 - 2 /
s?2cos2 ? - Bounding error probability?
- Need s gt 2/?2cos2?
- Memory? O( log 1/e cos-2? ?-2 (log N log
m)) - Problem
- To choose s need a-priori lower bound on cos ?
f1.f2 - What if cos ? really small?
37Sketch Partitioning
Idea for dealing with f12f22/(f1.f2)2 issue --
partition domain into regions where self-join
size is smaller to compensate small join-size
(cos ?)
self-join(R1.A)self-join(R2.B) 205205 42K
self-join(R1.A)self-join(R2.B) 2005 2005
38Sketch Partitioning
- Idea
- intelligently partition join-attribute space
- need coarse statistics on stream
- build independent sketches for each partition
- Estimate S partition sketches
- Variance S partition variances
39Sketch Partitioning
- Partition Space Allocation?
- Can solve optimally, given domain partition
- Optimal Partition Find K-partition to minimize
- Results
- Dynamic Programming optimal solution for single
join - NP-hard for queries with multiple joins
40Fk for k gt 2
- Assume stream length m is known (Exercise
Show can fix with log m space overhead by
repeated-doubling estimate of m.) - Choose random stream item ap ? p
uniform from 1,2,,m - Suppose ap v e 0,1,,N-1
- Count subsequent frequency of v
- r q qp, aqv
- Define X m(rk (r-1)k)
- Stream
- 7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
- m 20
- p 9
- ap 5
- r 3
42Fk for k gt 2
- Var(X) kN1 1/k Fk2
- Bounded Error Probability ? s O(kN1 1/k / ?2)
- Boosting ? memory bound
- O(kn1 1/k ?-2 (log 1/e)(log N
log m))
Summing over m choices of stream elements
43Frequency Moments
- F0 distinct values problem (Lecture 15)
- F1 sequence length
- for case with deletions, use Cauchy distribution
- F2 self-join size/Gini index (Today)
- Fk for k gt2
- omitting grungy details
- can achieve space bound
- O(kN1 1/k ?-2 (log 1/e)(log n log m))
- F8 maximum frequency
44Communication Complexity
- Cooperatively compute function f(A,B)
- Minimize bits communicated
- Unbounded computational power
- Communication Complexity C(f) bits exchanged by
optimal protocol ? - Protocols?
- 1-way versus 2-way
- deterministic versus randomized
- Cd(f) randomized complexity for error
probability d
ALICE input A
BOB input B
45Streaming Communication Complexity
- Stream Algorithm ?1-way communication protocol
- Simulation Argument
- Given algorithm S computing f over streams
- Alice initiates S, providing A as input stream
prefix - Communicates to Bob Ss state after seeing A
- Bob resumes S, providing B as input stream
suffix - Theorem Stream algorithms space requirement is
at least the communication complexity C(f)
46Example Set Disjointness
- Set Disjointness (DIS)
- A, B subsets of 1,2,,N
- Output
- Theorem Cd(DIS) O(N), for any dlt1/2
47Lower Bound for F8
- Theorem Fix elt1/3, dlt1/2. Any stream algorithm S
with - P (1-e)F8 lt S lt (1e)F8 gt 1-d
- needs O(N) space
- Proof
- Claim S ? 1-way protocol for DIS (on any sets A
and B) - Alice streams set A to S
- Communicates Ss state to Bob
- Bob streams set B to S
- Observe
- Relative Error elt1/3 ? DIS solved exactly!
- Perror lt½ lt d ? O(N) space
- Observe
- Used only 1-way communication in proof
- Cd(DIS) bound was for arbitrary communication
- Exercise extend lower bound to multi-pass
algorithms - Lower Bound for Fk, kgt2
- Need to increase gap beyond 2
- Multiparty Set Disjointness t players
- Theorem Fix e,dlt½ and k gt 5. Any stream
algorithm S with - P (1-e)Fk lt S lt (1e)Fk gt 1-d
- needs O(N1-(2 d)/k) space
- Implies O(N1/2) even for multi-pass algorithms
49Tracking High-Frequency Items
50Problem 1 Top-K ListCharikar-Chen-Farach-Colto
- The Google Problem
- Return list of k most frequent items in stream
- Motivation
- search engine queries, network traffic,
- Remember
- Saw lower bound recently!
- Solution
- Data structure Count-Sketch ? maintaining
count-estimates of high-frequency elements
- Notation
- Assume 1, 2, , N in order of frequency
- mi is frequency of ith most frequent element
- m Smi is number of elements in stream
- FindCandidateTop
- Input stream S, int k, int p
- Output list of p elements containing top k
- Naive sampling gives solution with p ?(m log k
/ mk) - FindApproxTop
- Input stream S, int k, real ?
- Output list of k elements, each of frequency mi
gt (1-?) mk - Naive sampling gives no solution
52Main Idea
- Consider
- single counter X
- hash function h(i) 1, 2,,N ? -1,1
- Input element i ? update counter X Zi h(i)
- For each r, use XZr as estimator of mr
- Theorem EXZr mr
Proof - X Si miZi
- EXZr ESi miZiZr Si miEZi Zr mrEZr2
mr - Cross-terms cancel
53Finding Max Frequency Element
- Problem varX F2 Si mi2
- Idea t counters, independent 4-wise hashes
h1,,ht - Use t O(log m ? mi2 / (?m1)2)
- Claim New Variance lt ? mi2 / t (?m1)2 / log m
- Overall Estimator
- repeat median of averages
- with high probability, approximate m1
h1 i? 1, 1
ht i? 1, 1
54Problem with Array of Counters
- Variance dominated by highest frequency
- Estimates for less-frequent elements like k
- corrupted by higher frequencies
- variance gtgt mk
- Avoiding Collisions?
- spread out high frequency elements
- replace each counter with hashtable of b counters
55Count Sketch
- Hash Functions
- 4-wise independent hashes h1,...,ht and s1,,st
- hashes independent of each other
- Data structure hashtables of counters X(r,c)
1 2 b
56Overall Algorithm
- sr(i) one of b counters in rth hashtable
- Input i ? for each r, update X(r,sr(i)) hr(i)
- Estimator(mi) medianr X(r,sr(i)) hr(i)
- Maintain heap of k top elements seen so far
- Observe
- Not completely eliminated collision with high
frequency items - Few of estimates X(r,sr(i)) hr(i) could have
high variance - Median not sensitive to these poor estimates
57Avoiding Large Items
- b gt O(k) ? with probability O(1), no collision
with top-k elements - t hashtables represent independent trials
- Need log m/? trials to estimate with probability
1-? - Also need small variance for colliding small
elements - Claim
- Pvariance due to small items in each estimate lt
(?igtk mi2)/b O(1) - Final bound b O(k ?igtk mi2 / (?mk)2)
58Final Results
- Zipfian Distribution mi ? 1/i? Power Law
- FindApproxTop
- k (?igtkmi2) / (?mk)2 log m/?
- Roughly sampling bound with frequencies squared
- Zipfian gives improved results
- FindCandidateTop
- Zipf parameter 0.5
- O(k log N log m)
- Compare sampling bound O((kN)0.5 log k)
59Problem 2 Elephants-and-AntsManku-Motwani
- Identify items whose current frequency exceeds
support threshold s 0.1. - Jacobson 2000, Estan-Verghese 2001
60Algorithm 1 Lossy Counting
Step 1 Divide the stream into windows
Window-size w is function of support s specify
61Lossy Counting in Action ...
62Lossy Counting (continued)
63Error Analysis
How much do we undercount?
If current size of stream N and
window-size w
1/e then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
64Putting it all together
Output Elements with counter values exceeding
Approximation guarantees Frequencies
underestimated by at most eN No false
negatives False positives have true
frequency at least (se)N
- How many counters do we need?
- Worst case bound 1/e log eN counters
- Implementation details
65Number of Counters?
- Window size w 1/?
- Number of windows m ?N
- ni counters alive over last i windows
- Fact
- Claim
- Counter must average 1 increment/window to
survive - active counters
Frequency Errors For counter (X, c),
true frequency in c, ceN
Trick Track number of windows t counter
has been active For counter (X,
c, t), true frequency in c, ct-1
If (t 1), no error!
Batch Processing Decrements after k
67Algorithm 2 Sticky Sampling
? Create counters by sampling ? Maintain exact
counts thereafter
What is sampling rate?
68Sticky Sampling (continued)
For finite stream of length N Sampling rate
2/eN log 1/?s
? probability of failure
Output Elements with counter values exceeding
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
69Number of counters?
Finite stream of length N Sampling rate 2/eN
log 1/?s
Infinite stream with unknown N Gradually adjust
sampling rate
In either case, Expected number of counters
2/? log 1/?s
70References Synopses
- Synopsis data structures for massive data sets.
Gibbons and Matias, DIMACS 1999. - Tracking Join and Self-Join Sizes in Limited
Storage, Alon, Gibbons, Matias, and Szegedy. PODS
1999. - Join Synopses for Approximate Query Answering,
Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD
1999. - Random Sampling for Histogram Construction How
much is enough? Chaudhuri, Motwani, and
Narasayya. SIGMOD 1998. - Random Sampling Techniques for Space Efficient
Online Computation of Order Statistics of Large
Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD
1999. - Space-efficient online computation of quantile
summaries, Greenwald and Khanna. SIGMOD 2001.
71References Sampling
- Random Sampling with a Reservoir, Vitter.
Transactions on Mathematical Software 11(1)37-57
(1985). - On Sampling and Relational Operators. Chaudhuri
and Motwani. Bulletin of the Technical Committee
on Data Engineering (1999). - On Random Sampling over Joins. Chaudhuri,
Motwani, and Narasayya. SIGMOD 1999. - Congressional Samples for Approximate Answering
of Group-By Queries, Acharya, Gibbons, and
Poosala. SIGMOD 2000. - Overcoming Limitations of Sampling for
Aggregation Queries, Chaudhuri, Das, Datar,
Motwani and Narasayya. ICDE 2001. - A Robust Optimization-Based Approach for
Approximate Answering of Aggregate Queries,
Chaudhuri, Das and Narasayya. SIGMOD 01. - Sampling From a Moving Window Over Streaming
Data. Babcock, Datar, and Motwani. SODA 2002. - Sampling algorithms lower bounds and
applications. Bar-YossefKumarSivakumar. STOC
72References Sketches
- Probabilistic counting algorithms for data base
applications. Flajolet and Martin. JCSS (1985). - The space complexity of approximating the
frequency moments. Alon, Matias, and Szegedy.
STOC 1996. - Approximate Frequency Counts over Streaming Data.
Manku and Motwani. VLDB 2002. - Finding Frequent Items in Data Streams. Charikar,
Chen, and Farach-Colton. ICALP 2002. - An Approximate L1-Difference Algorithm for
Massive Data Streams. Feigenbaum, Kannan,
Strauss, and Viswanathan. FOCS 1999. - Stable Distributions, Pseudorandom Generators,
Embeddings and Data Stream Computation. Indyk.
FOCS 2000.