A Samplingbased Estimator for Topk Selection Query - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

A Samplingbased Estimator for Topk Selection Query

Description:

Goal: find a good approximation of the top-k points quickly ... To determine the range query for a top-k query with query point q using histograms ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 18
Provided by: Kan
Category:

less

Transcript and Presenter's Notes

Title: A Samplingbased Estimator for Topk Selection Query


1
A Sampling-based Estimator for Top-k Selection
Query
  • Chung-Min Chen Yibei Ling
  • ICDE 2002
  • Presented by Kan Kin Fai

2
Outline
  • Introduction
  • Histogram-based Method
  • Sampling-based Method
  • Experimental Results
  • Conclusion

3
Introduction
  • Given a distance function and a query point q,
    the top-k query is to find the top k points from
    the dataset that are closest to q.
  • Example searching an apartment by specifying a
    price and a location

4
Introduction
  • Goal find a good approximation of the top-k
    points quickly
  • Approach translate a top-k query into a range
    query
  • Distance Functions
  • Euclidean distance (L2-norm distance)
  • Summation distance (L1-norm distance)
  • Maximum distance (L?-norm distance)

5
Histogram-based Method
  • To determine the range query for a top-k query
    with query point q using histograms
  • Drawbacks
  • poor scalability of histograms with data
    dimensionality
  • non-trivial maintenance overhead of
    multidimensional histograms

6
Histogram-based Method
  • Strategies NoRestart, Start, Inter1 and Inter2

7
Sampling-based Method
  • Main idea
  • take a random sample S of size s from the dataset
    D of size n. (sampling rate r s / n)
  • given a query point q, compute the distances
    between q and all the points in S sort the
    sample points in ascending order of the computed
    distance.
  • take the first l points from the sorted sequence
    where l k r and determine the range query
    from them.

8
Sampling-based Method
  • Determining the range query
  • the Minimum Bounding Rectangle (MBR)
  • Sym set the side length on the ith dimension to
    2di, where di max(qi - xi for all (x1,,xm)
    ? the l points).
  • Squ set the side length on the ith dimension to
    2d, where d max(di) for 1?i ?m.
  • the Minimum Bounding Square on Shape (MBSS)

9
Sampling-based Method
10
Sampling-based Method
  • Para
  • use L? to sort the sampling points regardless of
    the distance function
  • take l c ? r ? k 1 points from the sorted
    sequence c is the magnification factor (MF)
  • set the range query to be the smallest square
    centered at q that encloses the l points.
  • Pros give accurate result size

11
Sampling-based Method
  • Let Q(D) be the result of the range query Q and
    top(D,q,k) be the set containing the k closet
    points to q.

12
Sampling-based Method
  • Deciding the magnification factor c for a given
    recall
  • fixing k, plot a graph with recall vs. MF
  • use linear interpolation to compute the needed
    magnification factor c from the graph

13
Experimental Results
14
Experimental Results
15
Experimental Results
16
Experimental Results
17
Conclusions
  • This paper presents a sampling-based method to
    process approximate top-k queries.
  • Experimental results show that
  • the proposed method outperforms the
    histogram-based method
  • the mapping scheme scales well for
    high-dimensional data.
  • Easy to implement and maintain!
Write a Comment
User Comments (0)
About PowerShow.com