Group Nearest Neighbor Queries - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Group Nearest Neighbor Queries

Description:

1Department of Computer Science. Hong Kong University of Science and Technology. and ... Motivated by the threshold algorithm [Fagin et al, PODS 01] for top-k queries ... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 18
Provided by: sun111
Category:

less

Transcript and Presenter's Notes

Title: Group Nearest Neighbor Queries


1
Group Nearest Neighbor Queries
Dimitris Papadias1 Qiongmao Shen1 Yufei Tao2
Kyriakos Mouratidis1
1Department of Computer Science Hong Kong
University of Science and Technology and 2Depart
ment of Computer Science City University of Hong
Kong
2
Conventional NN search with R-trees
  • Depth-first Roussopoulos et al., SIGMOD 95
  • Best-first traversal Hjaltason and Samet TODS
    99, incremental and optimal

3
Group NN queries
  • Input a set Pp1,,pN of static data points
    indexed by an R-tree and a group of query points
    Qq1,,qn.
  • Output the k (?1) data point(s) with the
    smallest sum of distances to all points in Q.
  • The distance between a data point p and Q is
    defined as dist(p,Q)?i1npqi, where pqi is
    the Euclidean distance between p and query point
    qi and n is the cardinality of Q.
  • Example three users at locations q1, q2, q3 want
    to find a meeting point (e.g., a restaurant) the
    corresponding query returns the data point p that
    minimizes the sum of Euclidean distances pqi
    for 1?i?3

4
Multiple Query Method (MQM)
  • Motivated by the threshold algorithm Fagin et
    al, PODS 01 for top-k queries
  • Idea Perform incremental NN queries for each
    point in Q and combine their results.
  • ltp10, 7gt, ltp11, 6gt, T5 (23)
  • ltp11, 6gt
  • T6 (33)
  • MQM terminates

5
Single Point Method (SPM)
  • We stop when we find the first point p such that
    n?pq ? dist(q,Q) ? dist(best_NN,Q). This is
    because dist(p,Q) ? n?pq?dist(q,Q) and,
    therefore, dist(p,Q) ? dist(best_NN,Q).
  • Heuristic 1 Let q be the centroid of Q and
    best_dist be the distance of the best GNN found
    so far. Node N can be pruned if

6
Minimum Bounding Method (MBM)
  • Applies the MBR of Q to prune the search space.
  • Heuristic 2 Let M be the MBR of Q, and best_dist
    be the distance of the best GNN found so far. A
    node N cannot contain qualifying points, if
  • Heuristic 3 A node N cannot contain qualifying
    points, if

7
Group Closest Pairs Method (GCP)
  • Point P and query Q datasets are indexed by
    R-trees.
  • Outputs closest pairs ltpi,qjgt incrementally and
    keep the count(pi) of pairs in which pi has
    appeared, as well as, the accumulated distance
    (curr_dist(pi)) of pi in all these pairs.
  • When the count of pi equals the cardinality n of
    Q, the global distance of pi, with respect to all
    query points, has been computed.

8
Comments on the performance of GCP
9
File Multiple Query Method (F-MQM)
  • Adaptation of MQM if Q does not fit in memory.
  • F-MQM sorts query points according to their
    Hilbert value and splits Q into blocks Q1, ..,
    Qm that fit in memory.
  • For each block, it incrementally computes the GNN
    using one of the main memory algorithms
  • It finally combines their results using MQM.
  • Complication once a NN of a group has been
    retrieved, we cannot compute its global distance
    (i.e., with respect to all data points)
    immediately.

10
F-MQM (cont)
  • Solution lazy evaluation
  • First we find the GNN p1 of the first group Q1
  • Then, we load in memory the second group Q2 and
    retrieve its NN p2. At the same time, we also
    compute the distance between p1 and Q2.
  • Similarly, when we load Q3, we update the current
    distances of p1 and p2 taking into account the
    objects of the third group.
  • After the end of the first round, we only have
    one data point (p1), whose global distance with
    respect to all query points has been computed.

11
File Minimum Bounding Method (F-MBM)
  • First, the points of Q are sorted by their
    Hilbert value and are assigned to groups (that
    fit in memory) according to this order.
  • For each group Qi, F-MBM keeps in memory its MBR
    Mi and cardinality ni (but not its contents).
  • F-MBM descends the R-tree of P (in depth-first or
    best-first traversal), only following nodes that
    may contain qualifying points.

Heuristic Let best_dist be the distance of the
best GNN found so far. A node N can be safely
pruned if
12
Experiments - Settings
  • Two real datasets (i) PP with 24493 populated
    places in North America, and (ii) TS, which
    contains the centroids of 194971 MBRs
    representing streams (poly-lines) of Iowa,
    Kansas, Missouri and Nebraska.
  • For all experiments we use a Pentium 2.4GHz CPU
    with 1GByte memory.
  • The page size of the R-trees BKSS00 is set to
    1KByte, resulting in a capacity of 50 entries per
    node.
  • All implementations are based on the best-first
    traversal.
  • For memory-resident queries we use workloads of
    100 queries.
  • Each query has a number n of points, distributed
    uniformly in a MBR of area M, which is randomly
    generated in the workspace of P.
  • The values of n and M are identical for all
    queries in the same workload (i.e., the only
    change between two queries in the same workload
    is the position of the query MBR).

13
Cost vs. Query cardinality n (memory-resident Q)
Cost vs. cardinality n of Q (M8, k8, TS
dataset)
14
Cost vs. size of MBR M of Q (memory-resident Q)
Cost vs. size of MBR of Q (n64, k8, TS dataset)
15
Cost vs. size of MBR M of Q (disk-resident Q)
workspaces of P and Q have the same centroid
F-MQM uses MBM for processing each group
Cost vs. size of MBR of Q (k8, PTS, QPP)
Each group contains 10,000 query points
Cost vs. size of MBR of Q (k8, PPP, QTS)
16
Cost vs. overlap area (disk-resident Q)
Cost vs. overlap area (k8, PTS, QPP)
Cost vs. overlap area (k8, PPP, QTS)
17
Conclusions
  • A novel problem and several interesting
    processing techniques.
  • Future Work
  • Other aggregate functions (e.g., max)
  • Weighted queries (e.g., some users are more
    important than others)
  • Aggregate access methods for more efficient
    processing (e.g., aR-trees)
  • Sub-group queries (e.g., find the data object
    that minimizes the sum of distances with respect
    to g out of the n query points, where gltn)
  • Application to related problems (e.g.,
    discovering k-medoids in large spatial databases)
Write a Comment
User Comments (0)
About PowerShow.com