Density Estimation for Spatial Data Streams

About This Presentation

Title:

Density Estimation for Spatial Data Streams

Description:

DB group seminar. 12. Local kernel density estimator ... DB group seminar. 16. Update of leaf(r) Update value for every leaf leaf(r) N(q) ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 37

Provided by: iCs8

Category:

more less

Transcript and Presenter's Notes

Title: Density Estimation for Spatial Data Streams

1
Density Estimation for Spatial Data Streams

Celilia M. Procopiuc and Octavian Procopiuc
ATT Shannon Labs
SSTD05
Presented by Huiping Cao

2
Outline

Background
Related work
Problem definition
Online algorithm
Experiments
References

3
Background

Streaming data
Large volume
Continuous arrival
Data stream algorithms
One pass
Small amount of space
Fast updating

4
Background

Data stream model
Operations of elements
Insertion Most common case
Deletion
Updating Most difficult case
Validation time of elements
Whole history
Landmark window Geh01, cash register this
paper
Partial recent history
Sliding window Geh01, turnstile model this
paper

5
Related work

Classified according to operations
Aggregation
avg, max, min, sum Dob02
count (distinct values) Gib01
Quantile in 1D data Gre01
Frequent items (Heavy hitter) Man02
Query estimation
Join size estimation Dob02
K Nearest Neighbor seaching
eKNN Kou04, RNN aggregationKor01
Techniques
Histogram Tha02, sample Joh05, special
synopsis ... ...

6
Problem definition

D , where pi ? Rd
Cash register model
Qi ,
Ri a d-dimensional hyper-rectangle
Selectivity of Qi sel(Qi) pi j i, pi ?
Ri
These points arrive before time step i
They lie in Ri
Problem Estimating sel(Qi)
Measurement, relative error

7
Online algorithm rough steps

Get random samples S of D
Using reservoir sampling method vit85
Using kd-tree(kd75)-like structure to index
these sample points
Maintenance of the sample and the kd-tree-like
structure online
Compute range selectivity estimated_sel(Q) using
kernel density estimator

8
Random sampling

Theorem1
Let T be the data stream seen so far, size T
let S ? T be a random sample chosen via the
reservoir sampling technique, such that S
?((d/?2)log (1/?)log(1/?)), where 0
S is the size of S.
Then with probability 1-?, for any axis-parallel
hyper-rectangle Q the following is true
sel(Q) Q?T is the selectivity of Q with
respect to the data stream seen so far,
sel(Q, S) Q?S is the selectivity of Q with
respect to the random sample.

9
Sampling

Random sampling
Problem when sel(Q) is smaller, relative error
is bigger
Better selectivity estimator kernel density
estimator

10
Kernel density estimator

S s1, , sm random subset of D
where x (x1, , xd) and si (si1, , sid) are
d-dimensional points
Bj kernel bandwidth along dimension j
Sco92
Global parameter

11
One-dimensional kernels

(a) Kernel function, B 1
(b) Contribution of multiple kernels to estimate
of range query

12
Local kernel density estimator

kd-tree structure T(S) index of the sample data
Each leaf contains one point si ? leaf(si)
Two leaves are disjoint
Union of all leaves is Rd
Each leaf maintain d1 values ?i, ?i1, ?i2,,
?id
?ij approximates the standard distribution of
the points in the cell centered at si along
dimension j
R a1, b1 ? ? ad, bd
Ti subset of points in tree leaf leaf(si)

13
Update T(S)

Purpose maintain ?i, ?ij (1 j d)
?i is the number of stream points contained in
leaf(si)
Assume p is the current point in the data stream
If p is not selected in S according to sample
algorithm
Find the leaf that contains p, leaf(si),
Increment ?I
Add (pj sij)2 to ?ij
If p is selected in S
A point q will be deleted from S
Delete leaf(q)
Add a new leaf corresponding to p

14
Delete leaf(q)

u parent node of leaf(q)
v sibling of leaf(q)
box(u) axis parallel hyper-rectangle of node u
h(u) hyper-plane orthogonal to a coordinate axis
that divides box(u) into two smaller boxes
associated with the children of u.
N(q), neighbors of leaf(q)
leaves in the subtree of v that have one boundary
contained in h(u)

15
Delete leaf(q)

Redistribute points in leaf(q) to N(q)
Extending the bounding box of each neighbor of
leaf(q) past h(u), until it hits the left
boundary of leaf(q)
Update ?, ? values for all leaves in N(q)
Notations
Leaf(r)? N(q)
boxe(r) the expanded box of r

16
Update ? of leaf(r)

Update ? value for every leaf leaf(r) ? N(q)
compute selectivity sel(boxe(r)) of the boxe(r)
w.r.t. leaf(q)
?r ?r sel(boxe(r))

17
Update ? of leaf(r)

?j, ?j be the intersection of boxe(r) and the
kernal function of q along dimension j.
Discretize it by ? equidistant points (? is a
large constant)
?j v1, v2, , v? ?j
Update ?rj as following
Wti is the approximate number of points of
leaf(q) whose jth coordinate lies in the
interval vi,vi1.

18
Update ? of leaf(r)

Updating ?rj by discretizing the intersection of
boxe(r) and the kernel of q along dimension j
(the gray area represents wt2)
All points in this interval is approximated by
its midpoint

19
Insert a leaf

p newly inserted point
q existing sample point such that p?leaf(q)
Split leaf(q) by a hyperplane
Pass through the midpoint (pq)/2
Direction alternative rule of kd-tree
If i is the splitting dim for the parent of q,
then the splitting dim for q is (i1) mod d
Update ? and ? values for p and q using similar
procedure for updating

20
Extension

Allow deletion of a point p from the data stream
If p is not a kernel center
Compute leaf(si) such that p? leaf(si)
?i ?i -1
?ij ?ij (pj - sij)2
p is a kernel center
Delete leaf(p)
Replace p with a newly coming point p
This does not follow the sample procedure, may
make the sample not uniform w.r.t. points in D

21
Experiments

Different number of dimensions
Different query loads
Range selectivity
Measurement
Accuracy
Trade-off between accuracy and space usage

22
Data

Synthetic data, generator is for projected
cluster Agg99
SD2(2D), SD4 (4D)
1 million points, 90 are contained in clusters,
10 uniformly distributed
Real data
NM2
1 million 2D data with real-valued attributes
Each point an aggregate of measurements taken in
15m interval, reflecting minimum and maximum
delay times between pairs of severs on ATTs
backbone network

23
Query loads

2 query workload for each dataset
Queries are chosen randomly in the attribute
space
Each workload contains 200 queries
Each query in a workload has the same
selectivity,
0.5 for the first workload (low selectivity)
10 for the second (high selectivity)

24
Accuracy measure

Qi , its relative error is Erri
Let Qi1, , Qik be the query workload, the
average relative error of this workload is
avg_err

25
Validating local kernels in an off-line setting(1)

MPLKernels (Multi-Pass Local kernels)
Scan the data once, get random sample points
Compute the kd-tree on them
Scan the data second times, compute ? and ?
Only useful in off-line setting
GKernels (Global Kernels) Gun00
Kernel bandwidth function of global standard
deviation of the data along each dimension
One-pass approximation ? Two-pass accurate
computation
Sample Random sampling
LKernels one pass local kernels

26
Validating local kernels in an off-line setting(2)
27
Validating local kernels in an off-line setting(3)
28
Comparison with histogram methods(1)

Histogram method Tha02
faster heuristic EGreedy

29
General online setting(1)

Queries arrive interleaved with points
Compare
Sample
LKernels
MPLernels

30
General online setting(2)
31
General online setting(3)
32
General online setting(4)
33
References

kd75 J.L. Bentley. Multidimensional Binary
Search Trees Used for Associative Searching.
Communication of the ACM, 18(9), September 1975.
vit85 J.S. Vitter. Random sampling with a
reservoir. ACM Transactions on Mathematical
Software, 11(1) 37-57, 1985.
Sco92 D. W. Scott. Multivariate Density
Estimation. Wiley-Interscience, 1992.
Agg99 C. C. Aggarwal, C. M. Procopiuc, J. L.
Wolf, P. S. Yu, and J. S. Park. Fast algorithms
for projected clustering. In SIGMOD99, pages
6172.
Gun00 D. Gunopulos, G. Kollios, V. J. Tsotras,
and C. Domeniconi. Approximating multidimensional
aggregate range queries over real attributes. In
SIGMOD00, pages 463474.
Geh01 J. Gehrke, F. Korn and D. Srivastava. On
computing correlated aggregates over continual
data streams. In SIGMOD01.
Gib01 P. Gibbons. Distinct sampling for
Highly-Accurate Answers to Distinct Values
Queries and Event Reports. In VLDB01.
Gre01 M. Greenwald and S. Khanna,
Space-Efficient Online Computation of Quantile
Summaries. In SIGMOD01.

34
References

Dob02 A. Dobra, M. Garofalakis, J. Gehrke and
R. Rastogi. Processing complex aggregate queries
over data streams. In SiGMOD02.
Kor02 Flip Korn, S. Muthukrishnan, Divesh
Srivastava. Reverse nearest neighbor aggregats
over data streams. In VLDB02.
Tha02 N. Thaper, S. Guha, P. Indyk, and N.
Koudas. Dynamic multidimensional histograms. In
SIGMOD02, pages 428439.
Man02 G. Manku and R. Motwani. Approximate
Frequency Counts over Data Streams. In VLDB02,
pages 346-357.
Kou04 Nick Koudas and Beng Chin Ooi and
Kian-Lee Tan and Rui Zhang, Approximate NN
Queries on Streams with Guaranteed
Error/performance Bounds. In VLDB04, pages
804-815.
Joh05 T. Johnson, S. Muthukrishnan and I.
Rozenbaum. Sampling Algorithms in a stream
Operator. In SIGMOD05.

35
Appendix reservoir sampling

This algorithm (called Algorithm X in Vitters
paper) obtains a random sample of size n during a
single pass through the relation.
The number of tuples in the relation does not
need to be known beforehand. The algorithm
proceeds by inserting the first n tuples into a
reservoir.
Then a random number of records are skipped, and
the next tuple replaces a randomly selected tuple
in the reservoir.
Another random number of records are then
skipped, and so forth, until the last record has
been scanned.

36
Appendix kd-tree

Start from the root-cell and bisect recursively
the cells through their longest axis, so that an
equal number of particles lie in each sub-volume

Write a Comment

User Comments (0)

About PowerShow.com

Density Estimation for Spatial Data Streams - PowerPoint PPT Presentation

Density Estimation for Spatial Data Streams

DB group seminar. 12. Local kernel density estimator ... DB group seminar. 16. Update of leaf(r) Update value for every leaf leaf(r) N(q) ... – PowerPoint PPT presentation