Title: G td
1???se???st??? ?p?t?µ?s? ???t?se?? se ?p?????st???
S?st?µata ?e????? ???µa?a?
G?????? ??t?d?? ATT Labs-Research http//www.res
earch.att.com/info/kotidis
2Outline
- ??sa????
- efa?µ???? p??se???st???? ap?t?µ?s?? e??t?se??
- ???sµ?? t?? p??ß??µat??
- Haar Wavelets
- ???sµ??, pa?ade??µata
- ??a? ap??? on-line a??????µ??
- ???se???st???? ?p?????sµ?? Wavelets (VLDB2001)
- JL-embeddings, sketches
- ?p?????sµ?? wavelets µ?s? sketches
- ?fa?µ???? µe p?a?µat??? ded?µ??a
- ?e?te??? a??????µ?? (STOC2002, VLDB2002)
- S?µpe??sµata
3?ed?µ??a ? ?????s??G??s?
- S?????t??s? ?a? a????s? p????f???a? p??sf??e?
st?at????? p?e????t?µa ??a ep??e???se?? - a????s? µe??d??? a?????, a????s? a????? pe?at??
- s?s??t?s? µe p????sµ?a?? ?a?a?t???st???,
?ate?????µe?? µ???et???? - ??a?e???s?µe? p?s?t?te?, s?et??? a???? ???µ??
d?a????s?? - ? e??e?a a??pt??? t?? t?µ?a t?? t??ep??????????
??e? ep?f??e? epa??stas? st? ???µ? d?µ??????a?
?a? d?a????s?? ded?µ???? - s???? ?pe?ßa??e? t?? ap????e?t???? d??at?t?te?
t?? ?pa????t?? s?st?µ?t??
4???ef????? d??t?? (ATT)
- ?e?t???? s?st?µa e??????, a????s??
- 200-300 e?at?µµ???a ???se?? t?? ?µ??a (gt60GB)
- 200 d?se?at?µµ???a e???af?? (50??)
- ? ap?t?µ?s? e??t?se?? e??a? ?????ß??a
- Communities Of Interest p??a e??a? ta 10
???µe?a µe t? µe?a??te?? s????t?ta ???se?? ap? t?
9733340865? - p??a ?ta? ? ?ata??µ? t?? ?pe?ast???? ???se?? a??
?e???af??? pe????? t??? te?e?ta???? ??? µ??e?? - p??a e??a? ? µ?s? d????e?a e??? t??ef???µat??
st?? 10 µe?a??te?e? p??e?? t?? ???a??
5IP-d??t??
Backbone router
Gateway router
Access router
- ?e??ss?te?a ded?µ??a
- ?a??te??? ???µ?? d?a????s??
- ?.?. CISCO NetFlow 150 records/day/router
- ? ap?st??? t?? ded?µ???? e??a? as?µf???/ad??at?
- µ???? ?a? 97 t?? ded?µ???? ?????ta? st? µetaf???
6???se???st??? ap?t?µ?s? e??t?se??
???t?s?
????ß?? ap??t?s?
GB/TB
- ????ße?? apa?t?se?? de? e??a? p??t?te
apa?a?t?te?! - ??a a????? a????s? µa? e?d?af????? ?????? ??
?s????? t?se?? - se e??t?se?? ?µad?p???s?? a???ße?a sta p??ta
s?µa?t??? ??f?a e??a? a??et? - ???? p?s?st? ap? ta s??????? t??ef???µata
?????ta? st?? ?tt????
7?p??p???µ??? µ??t??? ded?µ????
- ???a?a? ai, 1?i?N
- a???µ?? ???se?? ap? t? ???µe?? i (N1010)
- a???µ?? pa??t?? ap? IP-d?e????s? i (N232)
(973) 360-7212, 6 (973) 360-8347, 7 (973)
360-8408, 1 (973) 360-7212, 1 (973) 360-8404,
9 (973) 360-8404, 1 (973) 360-7212, 7 (973)
360-8347, 1
?,di
ded?µ??a
8?? p??ß??µa
- ?e????af? t?? p??a?a a se ???? ltlt ?
- ?pe?e??as?a se ??a p??asµa
- ???µ???s? se p?a?µat??? ?????
- ???se???st??? ap?t?µ?s? e??t?se?? µ?sa se
p???a????sµ??a ???a ??????.
S?µe?? pa?at???s??
ded?µ??a
?
sketch(KB/MB)
9Outline
- ??sa????
- efa?µ???? p??se???st???? ap?t?µ?s?? e??t?se??
- ???sµ?? t?? p??ß??µat??
- Haar Wavelets
- ???sµ??, pa?ade??µata
- ??a? ap??? on-line a??????µ??
- ???se???st???? ?p?????sµ?? Wavelets (VLDB2001)
- JL-embeddings, sketches
- ?p?????sµ?? wavelets µ?s? sketches
- ?fa?µ???? µe p?a?µat??? ded?µ??a
- ?e?te??? a??????µ?? (STOC2002, VLDB2002)
- S?µpe??sµata
10??sa???? sta Wavelet
- Wavelets µa??µat???? µetas??µat?sµ??
???s?µ?p????ta? p???a????sµ??? ß?s? (p.?. Haar,
Daubechies-4, Daubechies-6, Coifman, Morlet,
Gabor) - Haar wavelets p?? ap?? ???p???s?
- a?ad??µ???? ?p?????sµ?? d?af???? ?a? a????sµ?t??
a?? d?ad??? tµ?µata
Resolution Averages
Wavelets
a 2, 2, 0, 6, 4, 2, 2, 0
----
3
2, 3, 3, 1
0, 3, -1, -1
2
1
0
- ???sµ?? epe?te??eta? e????a ??a p???d??stata
ded?µ??a
11S?µp?es? µ?s? Wavelet
- ??at?µe ?ltlt? t?µ?? (t?? µe?a??te?e?)
- ?? ?2
2.25, -0.25, 0.5, -1, 0, 3, -1, -1
12On-line a??????µ?? (ap?? µ??t???)
- ???p??µe t?? p??a?a ap? a??ste?? p??? ta de???
- ??at?µe ta ? µe?a??te?a wavelet se s??? ?a? logN
ap? ta e?e??? st? µ??µ?
S???? (t?p-?)
-
-
-
-
a1
a2
a3
a4
a5
a6
a7
a8
?gtgtd?a??s?µ? µ??µ?
13?a?? ?a? ?s??µa ??a
- ?a ?-µe?a??te?a wavelets µp????? ?a ?p?????st???
µe µ??µ? O(BlogN) - IEEE TKDE ???e ?tete?µ???st???? a??????µ?? p??
?p??????e? t? µe?a??te?? (e?t?? t?? µ.?.) wavelet
st? ?e???? µ??t??? ??e???eta? ?(N/polylog(N))
µ??µ?
14Ge???? ?ate????s?
- Ta ???s?µ?p???s??µe randomized a??????µ???
- p??se??????? t? ??s? µe µe???? p??a??t?ta
ep?t???a? - ???se???st???? ?p?????sµ?? t?? wavelets µe
- sf??µa (a????st???) 1?e (p.?. 10)
- p??a??t?ta ep?t???a? 1-d (p.?. 99)
- p???-???a???µ???? apa?t?se?? µ??µ?? (?a?
p???p????t?ta) -
-
poly(logN, log(1/d), e)
15?a?at???s? 1
- ?? wavelet wl e??a? t? es?te???? ????µe?? t??
ded?µ???? a µe ??a d????sµa ß?s?? ?i
?1
?2
?3
?4
?????a?????? ß?s?
wavelets
?5
?6
?7
?8
16?a?at???s? 2
- ?? es?te???? ????µe?? 2 µ??ad?a??? d?a??sµ?t??
µp??e? ?a ?p?????ste? ap? t?? ap?stas? t???
lta,bgt cos(a,b) 1-dist2(a,b)/2
17??a ???? ???
- ?pe?????s? t?? ded?µ???? ?a? t?? wavelet-ß?s??
st? RN (N1 s?µe?a)
18JL-embeddings
- Johnson Lindenstrauss 84
- ? s?µe?a µp????? ?a ape?????st??? se
??((log?)/e2) d?ast?se?? ?ste ?? µeta?? t???
ap?st?se?? ?a d?at?????ta? µe sf??µa ?e
19Sketches
- e.g Alon96 es?te???? ????µe?? t?? a µe
O(log(N/?)/?2) ?e?d?t??a?a -1,1 d?a??sµata
2
ai
sketch(a)
r1i
8
1
-1
-1
1
-1
1
1
1
-2
r2i
-1
1
1
-1
1
1
-1
-1
r3i
0
1
1
-1
1
-1
-1
1
-1
20?d??t?te? t?? Sketches
sketch(a)
?1 ?2 ?3 ?4 ?5 . . . . . . . ??
To Xi2 e??a? unbiased estimate t?? ???µa?-2 t??
a
21Boosting d??µes?? µ?s??-????
?1 ?2 ?3 ?4 ?5 . . . . . . . ??
µ
??µ
22Boosting d??µes?? µ?s??-????
?1 ?2 ?3 ?4 ?5 . . . . . . . ??
µ
Prob?-?2 ? 4/µ1/2 ?2 gt 1-2-?/2
e
d
??µ
?? µ???? t?? a µp??e? ?a ?p?????ste? µe a???ße?a
e, µe p??a??t?ta ep?t???a? 1-d
23Sketch e??? Wavelet
B
A
C
?l
1
-1
rk
1
1
-1
1
-1
-1
-1
1
1
1
-1
-1
1
-1
1
-1
- 2nd order Reed-Muller codes ??a ta a????sµata se
?(log3(N))
24Wavelets from Sketches
p??a?a?
N
wavelet d????sµa-ß?s?
25?e????? a??????µ?? (vldb2001)
- ??s?d?? sketch(a), i
- ???d?? wavelet wi
- ?p?????se t? sketch(??) t?? d?a??sµat??-ß?s??
- ?p?????se ??2 ap? t? sketch(a)
- ?p?????se sketch(aa/Y½)sketch(a)/Y½
- ?p?????se cos(a,?)1-dist2(a,??)/2 µ?s? t??
sketch(a-??) - ep?st?e?e w?½cos(a,??)
- ???µ? O(Blog2(N)log(N/?)/?/?3)
?e?d. µetaß??t??
sketch
26S??????? ????te?t?????
data stream
seeds
sketch
wavelets
Queries
27Outline
- ??sa????
- efa?µ???? p??se???st???? ap?t?µ?s?? e??t?se??
- ???sµ?? t?? p??ß??µat??
- Haar Wavelets
- ???sµ??, pa?ade??µata
- ??a? ap??? on-line a??????µ??
- ???se???st???? ?p?????sµ?? Wavelets (VLDB2001)
- JL-embeddings, sketches
- ?p?????sµ?? wavelets µ?s? sketches
- ?fa?µ???? µe p?a?µat??? ded?µ??a
- ?e?te??? a??????µ?? (STOC2002, VLDB2002)
- S?µpe??sµata
28?e???µata (t??ef????? d??t??)
- CDRs ap? 7 µ??e? t?? Feß?. 2001
- ai ???se?? ap? t? npa-nxx i
- N65,536
- Sketch size 3,952 words
29S?????s? µe Off-line a??????µ?
- Top-7 wavelets pe??????? 90 t?? e????e?a?
- ?p????pa 65529 wavelets p??? µ????
30S?????s? µe stat??? p??ep?????
31?pe?????s? st? RN
ded?µ??a
wavelets
32G?aµµ???t?ta t?? s??ts??
- ?a?at???se?? ap? d?af??et??? s?st?µata µp????? ?a
s??d?ast???
33S??d?asµ?? ?ata?eµ?µ???? µet??se??
S??????? ??? µ?sa ap? t? d??t?? ???µ?
34?pe?t?se??
- STOC 2002 paper ?e?a???a ap? sketches
- ?p?????sµ?? histograms, wavelets µ?s? sketches se
sub-linear time,space µe µ???µ??µ relative error - efa?µ???? Exploratory Data Analysis,
visualization, databases ?.a.
35Random Subset Sums (VLDB2002)
- ?p??e?e t? a? µe p??a??t?ta 50
a?2?j-?, a? rji1
a8?
ai
? ? ? ? ?
36?atas?e?? t?? RSS
log(N)1 seed
1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0
1 0 1 0 1 0 1
1 2 2 3 1 2 2 3
1 0 1 1
x
1 0 0 1 1 0 0 1
(mod2)
rss a0, a3, a4, a7
37?e?te??? ???????µ?? (VDLB2002)
- G???e ?p???d?p?te d??st?µa sa? ?????sµa ?(logN)
d?ad???? d?ast?µ?t?? - ???e d?ad??? d??st?µa p??se????eta? ?e?1 µ?s? t??
RSS
ai
38Deciles of on-going Calls
39S?µpe??sµata
- ???se???st??? ap?t?µ?s? e??t?se?? ep???µ?t? se
p????? efa?µ???? - ta??tat? ap????s? se efa?µ???? a????s??
- µ??? ??s? ?ta? ? s?????t??s? t?? ded?µ???? e??a?
ad??at? - ??? µ???d?? (sketches/RSS) ??a s???pt???
pe????af? µe - µ???? ????, ??a p??asµa, e????se?? p?st?t?ta?
(e,d) - ??????? ap?t?µ?s?
- S??d?asµ?? ?ata?eµ?µ???? µet??se?? se s?st?µata
e??e?a? ???µa?a? - lossless ??a ?p???d?p?te ??aµµ??? s??d?asµ?
40????d??stat? ?????s? ?ed?µ????
- Cubetrees, Dwarf, SIGMOD-97, 98, 02
- ap?d?t???? d?µ?? ??????s??
- DynaMat, best paper award SIGMOD-99, TODS-01
- a?t?µat? ep?????, ??????s? µe ß?s? ta ?p?????ta
resources (ap????e?t???? ?????, ??????
?p?????sµ??), e??µ???s? - Data mining (VLDB-98, 01)
- ??ta??a?? ded?µ???? µ?s? XML, ICDE-03 ?.a.
Viewproduct,store
41???a??st?!
42Exponential fading
- Exp fading b'?a(1-?)b
- ?p? ??aµµ???t?ta h(b)?h(a)(1-?)h(b)
43Conventional View (Haar Wavelets)
44???e? efa?µ????
- ?p????? ?a ???s?µ?p??????? a?t? ??a ta a?????
s?µata se p?????? a??????µ?? a????s?? ded?µ???? - ?.?. SVD (Information Retrieval LSI)
45Wavelet Transform
- JPEG-2000
- F?s??????a (a?t????? e????a? ap? ???ast???)
- Many applications Data Compression, Noise
Reduction, Edge Detection (image processing) - Databases selectivity estimation
Matias98-00,Chakrabarti00, Gilbert00, aggregate
OLAP queries Vitter99, etc - Fast Transform O(N) space/time
- Few good-terms phenomenon
- Just few coefficients retain most of the energy
46IP Example
47Main Result
- Parameters
- ? seek inner products within (1??)
- ? failure probability
- ? guarantees hold only when cosine is greater
than ? - if wl2 ? (??/B)a2 can be estimated reliably
- If there is a top-B wavelet representation with
psedo-energy at least ?a2 then with probability
(1-?) we can find an approximate B-term
representation with pseudo-energy at least
(1-?)?a2 with space and per-item time cost - O(Blog2(N)log(N/?)/?/?3)
48???a pa?ade??µata
- ?a??????a s?st?µata ?.?
- stat?st??? ??a t?? ?s??ata??µ? t?? ded?µ????
- Stat?st??? ??a query optimization
?e?t??????
SQL Query
Optimizer
S???pt???pe????af?
49Chebyshevs Inequality
- PX-EX gt k lt VARX/k2
- ( X? X2)
- ?µe?? EX2 A2, VARX2E(X2-EX2)2 lt
A22 - ??s?? ???? Y?(?12 ?22 ?µ2 )/µ
- EY? A2
- VARY? lt A22/µ
- ??a PY?- A2gt eA2 lt (A22/µ)/e2 A221/(µe2)
- ? PY?- A2lteA2 gt1-1/(µe2)
501st Chernoff Bound
- ?st? V t.m 1 a? t? ? e??a? µ?sa sta ???a,
- PV1p sta?e?? ??a e,µ sta?e?? Poisson
Trials - ??? ? trials V
- ?st? ? ? a???µ?? t?? ep?t????? ?SUM(V)
- ? p??a??t?ta ap?t???a?
- PX lt (1-d)?p lt e-?pd2/2
- G?a 1-d1/2 (? d??µes?? ?a e??a? ?????)
- PXgt ½ ?p gt1-e-?p/8
- ?st? ?7/8 -gt 1/(µe2) 1/8 ? PY- A2gt eA2
gt1-d - ?p?? esqrt(8/µ) ?a? d e-?7/8 (ta ???a ???a
e??a? pa??µ??a)