Title: Group Nearest Neighbor Queries
1Group Nearest Neighbor Queries
Dimitris Papadias1 Qiongmao Shen1 Yufei Tao2
Kyriakos Mouratidis1
1Department of Computer Science Hong Kong
University of Science and Technology and 2Depart
ment of Computer Science City University of Hong
Kong
2Conventional NN search with R-trees
- Depth-first Roussopoulos et al., SIGMOD 95
- Best-first traversal Hjaltason and Samet TODS
99, incremental and optimal
3Group NN queries
- Input a set Pp1,,pN of static data points
indexed by an R-tree and a group of query points
Qq1,,qn. - Output the k (?1) data point(s) with the
smallest sum of distances to all points in Q. - The distance between a data point p and Q is
defined as dist(p,Q)?i1npqi, where pqi is
the Euclidean distance between p and query point
qi and n is the cardinality of Q. - Example three users at locations q1, q2, q3 want
to find a meeting point (e.g., a restaurant) the
corresponding query returns the data point p that
minimizes the sum of Euclidean distances pqi
for 1?i?3
4Multiple Query Method (MQM)
- Motivated by the threshold algorithm Fagin et
al, PODS 01 for top-k queries - Idea Perform incremental NN queries for each
point in Q and combine their results.
- ltp10, 7gt, ltp11, 6gt, T5 (23)
- ltp11, 6gt
- T6 (33)
- MQM terminates
5Single Point Method (SPM)
- We stop when we find the first point p such that
n?pq ? dist(q,Q) ? dist(best_NN,Q). This is
because dist(p,Q) ? n?pq?dist(q,Q) and,
therefore, dist(p,Q) ? dist(best_NN,Q). - Heuristic 1 Let q be the centroid of Q and
best_dist be the distance of the best GNN found
so far. Node N can be pruned if
6Minimum Bounding Method (MBM)
- Applies the MBR of Q to prune the search space.
- Heuristic 2 Let M be the MBR of Q, and best_dist
be the distance of the best GNN found so far. A
node N cannot contain qualifying points, if
- Heuristic 3 A node N cannot contain qualifying
points, if
7Group Closest Pairs Method (GCP)
- Point P and query Q datasets are indexed by
R-trees. - Outputs closest pairs ltpi,qjgt incrementally and
keep the count(pi) of pairs in which pi has
appeared, as well as, the accumulated distance
(curr_dist(pi)) of pi in all these pairs. - When the count of pi equals the cardinality n of
Q, the global distance of pi, with respect to all
query points, has been computed.
8Comments on the performance of GCP
9File Multiple Query Method (F-MQM)
- Adaptation of MQM if Q does not fit in memory.
- F-MQM sorts query points according to their
Hilbert value and splits Q into blocks Q1, ..,
Qm that fit in memory. - For each block, it incrementally computes the GNN
using one of the main memory algorithms - It finally combines their results using MQM.
- Complication once a NN of a group has been
retrieved, we cannot compute its global distance
(i.e., with respect to all data points)
immediately.
10F-MQM (cont)
- Solution lazy evaluation
- First we find the GNN p1 of the first group Q1
- Then, we load in memory the second group Q2 and
retrieve its NN p2. At the same time, we also
compute the distance between p1 and Q2. - Similarly, when we load Q3, we update the current
distances of p1 and p2 taking into account the
objects of the third group. - After the end of the first round, we only have
one data point (p1), whose global distance with
respect to all query points has been computed.
11File Minimum Bounding Method (F-MBM)
- First, the points of Q are sorted by their
Hilbert value and are assigned to groups (that
fit in memory) according to this order. - For each group Qi, F-MBM keeps in memory its MBR
Mi and cardinality ni (but not its contents). - F-MBM descends the R-tree of P (in depth-first or
best-first traversal), only following nodes that
may contain qualifying points.
Heuristic Let best_dist be the distance of the
best GNN found so far. A node N can be safely
pruned if
12Experiments - Settings
- Two real datasets (i) PP with 24493 populated
places in North America, and (ii) TS, which
contains the centroids of 194971 MBRs
representing streams (poly-lines) of Iowa,
Kansas, Missouri and Nebraska. - For all experiments we use a Pentium 2.4GHz CPU
with 1GByte memory. - The page size of the R-trees BKSS00 is set to
1KByte, resulting in a capacity of 50 entries per
node. - All implementations are based on the best-first
traversal. - For memory-resident queries we use workloads of
100 queries. - Each query has a number n of points, distributed
uniformly in a MBR of area M, which is randomly
generated in the workspace of P. - The values of n and M are identical for all
queries in the same workload (i.e., the only
change between two queries in the same workload
is the position of the query MBR).
13Cost vs. Query cardinality n (memory-resident Q)
Cost vs. cardinality n of Q (M8, k8, TS
dataset)
14Cost vs. size of MBR M of Q (memory-resident Q)
Cost vs. size of MBR of Q (n64, k8, TS dataset)
15Cost vs. size of MBR M of Q (disk-resident Q)
workspaces of P and Q have the same centroid
F-MQM uses MBM for processing each group
Cost vs. size of MBR of Q (k8, PTS, QPP)
Each group contains 10,000 query points
Cost vs. size of MBR of Q (k8, PPP, QTS)
16Cost vs. overlap area (disk-resident Q)
Cost vs. overlap area (k8, PTS, QPP)
Cost vs. overlap area (k8, PPP, QTS)
17Conclusions
- A novel problem and several interesting
processing techniques. - Future Work
- Other aggregate functions (e.g., max)
- Weighted queries (e.g., some users are more
important than others) - Aggregate access methods for more efficient
processing (e.g., aR-trees) - Sub-group queries (e.g., find the data object
that minimizes the sum of distances with respect
to g out of the n query points, where gltn) - Application to related problems (e.g.,
discovering k-medoids in large spatial databases)