Density Estimation for Spatial Data Streams - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Density Estimation for Spatial Data Streams

Description:

DB group seminar. 12. Local kernel density estimator ... DB group seminar. 16. Update of leaf(r) Update value for every leaf leaf(r) N(q) ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 37
Provided by: iCs8
Category:

less

Transcript and Presenter's Notes

Title: Density Estimation for Spatial Data Streams


1
Density Estimation for Spatial Data Streams
  • Celilia M. Procopiuc and Octavian Procopiuc
  • ATT Shannon Labs
  • SSTD05
  • Presented by Huiping Cao

2
Outline
  • Background
  • Related work
  • Problem definition
  • Online algorithm
  • Experiments
  • References

3
Background
  • Streaming data
  • Large volume
  • Continuous arrival
  • Data stream algorithms
  • One pass
  • Small amount of space
  • Fast updating

4
Background
  • Data stream model
  • Operations of elements
  • Insertion Most common case
  • Deletion
  • Updating Most difficult case
  • Validation time of elements
  • Whole history
  • Landmark window Geh01, cash register this
    paper
  • Partial recent history
  • Sliding window Geh01, turnstile model this
    paper

5
Related work
  • Classified according to operations
  • Aggregation
  • avg, max, min, sum Dob02
  • count (distinct values) Gib01
  • Quantile in 1D data Gre01
  • Frequent items (Heavy hitter) Man02
  • Query estimation
  • Join size estimation Dob02
  • K Nearest Neighbor seaching
  • eKNN Kou04, RNN aggregationKor01
  • Techniques
  • Histogram Tha02, sample Joh05, special
    synopsis ... ...

6
Problem definition
  • D , where pi ? Rd
  • Cash register model
  • Qi ,
  • Ri a d-dimensional hyper-rectangle
  • Selectivity of Qi sel(Qi) pi j i, pi ?
    Ri
  • These points arrive before time step i
  • They lie in Ri
  • Problem Estimating sel(Qi)
  • Measurement, relative error

7
Online algorithm rough steps
  • Get random samples S of D
  • Using reservoir sampling method vit85
  • Using kd-tree(kd75)-like structure to index
    these sample points
  • Maintenance of the sample and the kd-tree-like
    structure online
  • Compute range selectivity estimated_sel(Q) using
    kernel density estimator

8
Random sampling
  • Theorem1
  • Let T be the data stream seen so far, size T
  • let S ? T be a random sample chosen via the
    reservoir sampling technique, such that S
    ?((d/?2)log (1/?)log(1/?)), where 0
    S is the size of S.
  • Then with probability 1-?, for any axis-parallel
    hyper-rectangle Q the following is true
  • sel(Q) Q?T is the selectivity of Q with
    respect to the data stream seen so far,
  • sel(Q, S) Q?S is the selectivity of Q with
    respect to the random sample.

9
Sampling
  • Random sampling
  • Problem when sel(Q) is smaller, relative error
    is bigger
  • Better selectivity estimator kernel density
    estimator

10
Kernel density estimator
  • S s1, , sm random subset of D
  • where x (x1, , xd) and si (si1, , sid) are
    d-dimensional points
  • Bj kernel bandwidth along dimension j
  • Sco92
  • Global parameter

11
One-dimensional kernels
  • (a) Kernel function, B 1
  • (b) Contribution of multiple kernels to estimate
    of range query

12
Local kernel density estimator
  • kd-tree structure T(S) index of the sample data
  • Each leaf contains one point si ? leaf(si)
  • Two leaves are disjoint
  • Union of all leaves is Rd
  • Each leaf maintain d1 values ?i, ?i1, ?i2,,
    ?id
  • ?ij approximates the standard distribution of
    the points in the cell centered at si along
    dimension j
  • R a1, b1 ? ? ad, bd
  • Ti subset of points in tree leaf leaf(si)

13
Update T(S)
  • Purpose maintain ?i, ?ij (1 j d)
  • ?i is the number of stream points contained in
    leaf(si)
  • Assume p is the current point in the data stream
  • If p is not selected in S according to sample
    algorithm
  • Find the leaf that contains p, leaf(si),
  • Increment ?I
  • Add (pj sij)2 to ?ij
  • If p is selected in S
  • A point q will be deleted from S
  • Delete leaf(q)
  • Add a new leaf corresponding to p

14
Delete leaf(q)
  • u parent node of leaf(q)
  • v sibling of leaf(q)
  • box(u) axis parallel hyper-rectangle of node u
  • h(u) hyper-plane orthogonal to a coordinate axis
    that divides box(u) into two smaller boxes
    associated with the children of u.
  • N(q), neighbors of leaf(q)
  • leaves in the subtree of v that have one boundary
    contained in h(u)

15
Delete leaf(q)
  • Redistribute points in leaf(q) to N(q)
  • Extending the bounding box of each neighbor of
    leaf(q) past h(u), until it hits the left
    boundary of leaf(q)
  • Update ?, ? values for all leaves in N(q)
  • Notations
  • Leaf(r)? N(q)
  • boxe(r) the expanded box of r

16
Update ? of leaf(r)
  • Update ? value for every leaf leaf(r) ? N(q)
  • compute selectivity sel(boxe(r)) of the boxe(r)
    w.r.t. leaf(q)
  • ?r ?r sel(boxe(r))

17
Update ? of leaf(r)
  • ?j, ?j be the intersection of boxe(r) and the
    kernal function of q along dimension j.
  • Discretize it by ? equidistant points (? is a
    large constant)
  • ?j v1, v2, , v? ?j
  • Update ?rj as following
  • Wti is the approximate number of points of
    leaf(q) whose jth coordinate lies in the
    interval vi,vi1.

18
Update ? of leaf(r)
  • Updating ?rj by discretizing the intersection of
    boxe(r) and the kernel of q along dimension j
    (the gray area represents wt2)
  • All points in this interval is approximated by
    its midpoint

19
Insert a leaf
  • p newly inserted point
  • q existing sample point such that p?leaf(q)
  • Split leaf(q) by a hyperplane
  • Pass through the midpoint (pq)/2
  • Direction alternative rule of kd-tree
  • If i is the splitting dim for the parent of q,
    then the splitting dim for q is (i1) mod d
  • Update ? and ? values for p and q using similar
    procedure for updating

20
Extension
  • Allow deletion of a point p from the data stream
  • If p is not a kernel center
  • Compute leaf(si) such that p? leaf(si)
  • ?i ?i -1
  • ?ij ?ij (pj - sij)2
  • p is a kernel center
  • Delete leaf(p)
  • Replace p with a newly coming point p
  • This does not follow the sample procedure, may
    make the sample not uniform w.r.t. points in D

21
Experiments
  • Different number of dimensions
  • Different query loads
  • Range selectivity
  • Measurement
  • Accuracy
  • Trade-off between accuracy and space usage

22
Data
  • Synthetic data, generator is for projected
    cluster Agg99
  • SD2(2D), SD4 (4D)
  • 1 million points, 90 are contained in clusters,
    10 uniformly distributed
  • Real data
  • NM2
  • 1 million 2D data with real-valued attributes
  • Each point an aggregate of measurements taken in
    15m interval, reflecting minimum and maximum
    delay times between pairs of severs on ATTs
    backbone network

23
Query loads
  • 2 query workload for each dataset
  • Queries are chosen randomly in the attribute
    space
  • Each workload contains 200 queries
  • Each query in a workload has the same
    selectivity,
  • 0.5 for the first workload (low selectivity)
  • 10 for the second (high selectivity)

24
Accuracy measure
  • Qi , its relative error is Erri
  • Let Qi1, , Qik be the query workload, the
    average relative error of this workload is
    avg_err

25
Validating local kernels in an off-line setting(1)
  • MPLKernels (Multi-Pass Local kernels)
  • Scan the data once, get random sample points
  • Compute the kd-tree on them
  • Scan the data second times, compute ? and ?
  • Only useful in off-line setting
  • GKernels (Global Kernels) Gun00
  • Kernel bandwidth function of global standard
    deviation of the data along each dimension
  • One-pass approximation ? Two-pass accurate
    computation
  • Sample Random sampling
  • LKernels one pass local kernels

26
Validating local kernels in an off-line setting(2)
27
Validating local kernels in an off-line setting(3)
28
Comparison with histogram methods(1)
  • Histogram method Tha02
  • faster heuristic EGreedy

29
General online setting(1)
  • Queries arrive interleaved with points
  • Compare
  • Sample
  • LKernels
  • MPLernels

30
General online setting(2)
31
General online setting(3)
32
General online setting(4)
33
References
  • kd75 J.L. Bentley. Multidimensional Binary
    Search Trees Used for Associative Searching.
    Communication of the ACM, 18(9), September 1975.
  • vit85 J.S. Vitter. Random sampling with a
    reservoir. ACM Transactions on Mathematical
    Software, 11(1) 37-57, 1985.
  • Sco92 D. W. Scott. Multivariate Density
    Estimation. Wiley-Interscience, 1992.
  • Agg99 C. C. Aggarwal, C. M. Procopiuc, J. L.
    Wolf, P. S. Yu, and J. S. Park. Fast algorithms
    for projected clustering. In SIGMOD99, pages
    6172.
  • Gun00 D. Gunopulos, G. Kollios, V. J. Tsotras,
    and C. Domeniconi. Approximating multidimensional
    aggregate range queries over real attributes. In
    SIGMOD00, pages 463474.
  • Geh01 J. Gehrke, F. Korn and D. Srivastava. On
    computing correlated aggregates over continual
    data streams. In SIGMOD01.
  • Gib01 P. Gibbons. Distinct sampling for
    Highly-Accurate Answers to Distinct Values
    Queries and Event Reports. In VLDB01.
  • Gre01 M. Greenwald and S. Khanna,
    Space-Efficient Online Computation of Quantile
    Summaries. In SIGMOD01.

34
References
  • Dob02 A. Dobra, M. Garofalakis, J. Gehrke and
    R. Rastogi. Processing complex aggregate queries
    over data streams. In SiGMOD02.
  • Kor02 Flip Korn, S. Muthukrishnan, Divesh
    Srivastava. Reverse nearest neighbor aggregats
    over data streams. In VLDB02.
  • Tha02 N. Thaper, S. Guha, P. Indyk, and N.
    Koudas. Dynamic multidimensional histograms. In
    SIGMOD02, pages 428439.
  • Man02 G. Manku and R. Motwani. Approximate
    Frequency Counts over Data Streams. In VLDB02,
    pages 346-357.
  • Kou04 Nick Koudas and Beng Chin Ooi and
    Kian-Lee Tan and Rui Zhang, Approximate NN
    Queries on Streams with Guaranteed
    Error/performance Bounds. In VLDB04, pages
    804-815.
  • Joh05 T. Johnson, S. Muthukrishnan and I.
    Rozenbaum. Sampling Algorithms in a stream
    Operator. In SIGMOD05.

35
Appendix reservoir sampling
  • This algorithm (called Algorithm X in Vitters
    paper) obtains a random sample of size n during a
    single pass through the relation.
  • The number of tuples in the relation does not
    need to be known beforehand. The algorithm
    proceeds by inserting the first n tuples into a
    reservoir.
  • Then a random number of records are skipped, and
    the next tuple replaces a randomly selected tuple
    in the reservoir.
  • Another random number of records are then
    skipped, and so forth, until the last record has
    been scanned.

36
Appendix kd-tree
  • Start from the root-cell and bisect recursively
    the cells through their longest axis, so that an
    equal number of particles lie in each sub-volume
Write a Comment
User Comments (0)
About PowerShow.com