Title: Reverse Nearest Neighbor Aggregates
1Reverse Nearest Neighbor Aggregates
Over Data Streams
Flip Korn, S. Muthukrishnan and Divesh
Srivastava.
VLDB 2002
Alexander Izbinsky
1
2Background
- RNN(q) returns a set of data points that have
the query point q as the nearest neighbor.
- Advanced database applications
- fixed wireless telephone access application
load detection problemcount how many users
are currently using a specific base station q ?
if qs load is too heavy ? activating an inactive
base station to lighten the load of that over
loaded base station
- Asymetric Property
- The Nearest Neighbor Relation is not symmetric,
the set of points that are closest to a query
point (i.e., the Nearest Neighbors) differs from
the set of points that have the query point as
their Nearest Neighbor (called the Reverse
Nearest Neighbors)
2
3Nonsymmetrical Property of RNN Queries
- NN(q) p NN(p) q
- If p is the nearest neighbor of q, then q need
not be the nearest neighbor of p (in this case
the nearest neighbor of p is r). - those efficient NN algorithms cannot directly
applied to solve the RNN problems. Algorithms for
RNN problems are needed. - A straight forward solution-- check for each
point whether it has q as its nearest neighbor
-- not suitable for large data set!
3
4Two Versions of RNN Problem
- Bichromatic Version
- the data points are of two categories, say red
and blue. The RNN query point q is in one of the
categories, say blue. So RNN(q) must determine
the red points which have the query point q as
the closest blue point. - e.g. fixed wireless telephone access application
clients/red (e.g. call initiation or
termination) - servers/blue (e.g. fixed wireless base stations)
- Monochromatic Version
- all points are of the same color is the
monochromatic version.
4
5Introduction
- RNN queries have been studied for finite, stored
data sets - RNN can identify "influence" of a data point on
the database - F. Korn and S. Muthukrishnan, Influence Sets
Based on Reverse Nearest Neighbor Queries - I. Stanoi, M. Riedewald, D., Mirek Riedewald, D.
Agrawal, A.E. Abbadi, Discovery of influence sets
in frequently updated databases - C. Yang, King-Ip Lin, An index structure for
efficient reverse nearest neighbor queries -
5
6Determining the Influence Set
- Finding the set of customers affected by the
opening of a new store outlet location - Notifying the subset of subscribers to a digital
library who will find a newly added document most
relevant - Finding set of users whose profiles are more
similar to the new service offering than to any
other service
The interest is not the exact RNN set, But
aggregates on this set - RNNA !
6
7RNNA Application 1
Fixed Wireless Telephony Access
- Fixed Physical Position
- Defined Coverage Area
- Calls Arrives in Streams
- Worst-Case Signal Strength RNN MAXDIST
- Load on Base Station RNN COUNT
- Optimization RNNA problems
7
8RNNA Application 2
Highway Traffic Monitoring
- Fixed Physical Position
- Detect vehicles, estimate speed and length
- User Queries Arrives in Streams
- Periodic Updates of Closest Sensor
- Load on Sensor RNN COUNT
- Accuracy of Information RNN MAXDIST
- Optimization RNNA problems
8
9RNNA Computations
- Max-RNNA Given K servers, return the maximum
RNNA over all clients to any of the servers - List-RNNA Given K servers, return the RNNA
over all clients to each of the servers - Opt-RNNA Find a set of at most K servers for
which their RNNAs are below a given threshold
Exact computation is not possible
9
10RNNA Approximations
- Max-RNN-Count
- Insertion and Deletion 3-approximation
- Insertion only (1?) -approximation
- Max-RNN-MAXDIST
- (1?) -approximation
- List-RNN-COUNT List-RNN-MAXDIST
- Lower- Upper-bound as function of the true
counts
- Opt-RNN-COUNT
- 8-approximation
- Opt-RNN-MAXDIST
- (1?) approximation
Space near-linear in the number of available
servers
10
11Related Works
- No previous works for RNNA over Data Streams
- Algorithms over Data Streams
- Algorithms for computing RNN over a
- conventional DB
11
12Algorithms over Data Streams
- Space requirements of Selection and Sorting as a
function of the number of passes over the data - J. I. Munro and M. S. Paterson. Selection and
Sorting with Limited Storage - Formalization of the Data Stream Model
- A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J.
Strauss. Surfing Wavelets on Streams One-Pass
Summaries for Approximate Aggregate Queries and
M. R. Henzinger, P. Raghavan, S. Rajagopalan.
Computing on data streams
12
13Algorithms over Data Streams
- Computing the approximate median and other
quantiles in a single pass over data set - R. Agrawal, A. Swami, A One-Pass Space-Efficient
Algorithm for Finding Quantiles - G.S. Manku, S. Rajagopalan, B.G. Lindsay.
Approximate Medians and other Quantiles in One
Pass and with Limited Memory - G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random
Sampling Techniques for Space Efficient Online
Computation of Order Statistics of Large
Datasets - M. Greenwald and S. Khanna. Space- Efficient
Online Computation of Quantile Summaries
13
14Algorithms over Data Streams
- Computing the approximate online quantiles with
probabilistic guaranties over data stream - A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J.
Strauss. How to Summarize the Universe Dynamic
Maintenance of Quantiles - Histogram construction over data stream
- A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S.
Muthukrishnan, M.J. Strauss. Fast, Small-Space
Algorithms for Approximate Histogram Maintenance
14
15Algorithms over Data Streams
- Maintaining summary structures for maintaining
approximate aggregates over data stream - A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J.
Strauss. Surfing Wavelets on Streams One-Pass
Summaries for Approximate Aggregate Queries and
M. R. Henzinger, P. Raghavan, S. Rajagopalan.
Computing on data streams - J. Gehrke, F. Korn, and D. Srivastava. On
computing correlated aggregates over continual
data streams
15
16Algorithms over Data Streams Mining Data Stream
- Construction of decision trees
- P. Domingos, G. Hulten. Mining High-Speed Data
Streams - J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh.
BOAT Optimistic Decision Tree Construction - Association rules
- C. Hidber. Online Association Rule Mining
- Similarity matching
- G. Cormode, M. Datar, P. Indyk, S.
Muthukrishnan. Comparing Data Streams Using
Hamming Norms
16
17Algorithms over Data Streams Mining Data Stream
- Clustering algorithms (k-median clustering
problem) - M. Charikar, C. Chekuri, T. Feder, R. Motwani.
Incremental Clustering and Dynamic Information
Retrieval - S. Guha, N. Mishra, R. Motwani, L. O'Callaghan.
Clustering Data Streams
17
18Algorithms over Data Streams Dynamic Maintenance
- Lp norms
- P. Indyk. Stable Distributions, Pseudorandom
Generators, Embeddings and Data Stream
Computation - Hamming norms
- G. Cormode, M. Datar, P. Indyk, S.
Muthukrishnan. Comparing Data Streams Using
Hamming Norms - Quantiles
- A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J.
Strauss. How to Summarize the Universe Dynamic
Maintenance of Quantiles - Sliding window
- M. Datar. Maintaining Stream Statistics over
Sliding Windows
18
19Algorithms for computing RNN over a conventional
DB
- Study of RNN in data bases
- F. Korn and S. Muthukrishnan, Influence Sets
Based on Reverse Nearest Neighbor Queries - Efficient access methods for indexing RNN
- I. Stanoi, M. Riedewald, D., Mirek Riedewald, D.
Agrawal, A.E. Abbadi, Discovery of influence sets
in frequently updated databases - C. Yang, King-Ip Lin, An index structure for
efficient reverse nearest neighbor queries
19
20Problem Definition
Collection of n available servers (not necessary
active) li location of server i Clients arrive
and depart Lj location of client j RNN of
server i is the set of all clients that have i as
their NN server
20
21Instances of Aggregates
- RNN-COUNT(i) number of clients currently in the
system for which i is the NN LOAD for active
servers
- RNN-MAXDIST(i ) largest distance to a client
that has i as its NN QUALITY for active
servers
- Streams of clients are large cant be stored in
memory computing approximate RNNA values
21
22Focus of the Problem
- Max-RNNA Given K active servers, return the
maximum RNNA over all clients to their closest
active server Worst-case Load or Quality
- List-RNNA Given K active servers, return a
list of the RNNA over all clients to each of the
K active servers - Maximum Load or Worst-case
Quality
- Opt-RNNA Find a set of at most K servers from
the available ones to be active, for which their
RNNAs are below a given threshold Optimization
22
23Algorithm
Assumption Servers are on as straight line
Counters for servers i, j and client k CLij -gt
Lk?li, (lilj)/2) CRij -gt Lk?((lilj)/2, lj
23
24Algorithm for RNN-COUNT ( i )
The algorithm Let l be the closest active server
from the left of i and r from the
right. RNN-COUNT(i) CLil CRir
Require O(n2) space O(n2) updates
We want space near-linear and less updates ?
Approximation is needed
24
25Data Structure
Definitions s1,.. sk are the K servers
designated to be active Assumption Servers are
sorted l1? ?ln
Counter number of clients for server i C(i) -gt
Lk?li, li1) at the right side of server
i C(0) at left side of server 1
Require O(n) space O(log n) updates (look for
wanted server)
25
26Answering Queries
Max-RNNA (s1,.. sk)
Max-RNNA(s1,.. sk) maxi RNN-COUNT(si)
26
27Example Max-RNNA (s1,.. sk)
27
28Max-RNNA (s1,.. sk)
28
29Answering Queries
List-RNNA (s1,.. sk)
Mi for each si
The Proof is similar to previous theorem
29
30Answering Queries
Opt-RNNA
- Greedy Algorithm finds
- Minimal Number of active servers K
- maxi RNN-COUNT(si)?C
30
31Answering Queries
Opt-RNNA
31
32Opt-RNNA
32
33Opt-RNNA "Dual" Problem
Minimize maxi RNN-COUNT(si)
Given upper bound on number of servers K
- Algorithm
- Choose different values of C
- Run Greedy Algorithm of Opt-RNNA
- Repeat until solve with number of servers K?K
33
34Insert-Only Clients
Data Structure
Assumption Servers are sorted l1? ?ln Counter
number of clients for server i C(i) -gt Lk?li,
li1) at the right side of server i C(0) at
left side of server 1
Count Partitioning
Maintain l-quantiles (Greenwald Khanna) ci1cil
number of clients lying in li, Lcik Within
(1??)kC(i)/l, where 1?k ?l Require O(logC(i)/?)
space
34
35Answering Queries
Max-RNNA (s1,.. sk)
Max-RNNA(s1,.. sk) maxi RNN-COUNT(si)
35
36Max-RNNA (s1,.. sk)
36
37Insert-Only Clients
List-RNNA (s1,.. sk)
Implementation in the same way
Opt-RNNA
Maintenance of data structure for deletion ?
37
38Algorithm for RNN-MAXDIST ( i )
The algorithm Histogram based on space
partitioning Assumption Servers are sorted l1?
?ln Exponential sized buckets Domain size U,
such that U min(Lj,li), max(Lj,li) Dividers
between servers i and (i1) gij at distance (1
?)j from li Number of dividers is O(log1 ?
li1-li)
38
39Data Structure
Counter number of clients between gik and gik1
is gik
- For updates of client j
- Find i such that Lj?li, li1)
- Find k such that Lj?gik , gik1)
- Update value gik
Require O(n log1 ? U) space O(log1 ? U) updates
39
40Answering Queries
Max-RNNA (s1,.. sk)
Max-RNNA(s1,.. sk) maxi RNN-MAXDIST(si)
40
41Max-RNNA (s1,.. sk)
Details of the proof will be given in the future
paper.
41
42Answering Queries
List-RNNA (s1,.. sk)
DimaxRDi,LDi for each si
The Proof is similar to previous theorem
42
43Answering Queries
Opt-RNNA
- Greedy Algorithm with limited backtracking finds
- Minimal Number of active servers K
- maxi RNN-MAXDIST(si)?D
43
44Opt-RNNA
The proof will be given in the future paper.
44
45Opt-RNNA "Dual" Problem
Minimize maxi RNN-MAXDIST(si)
Given upper bound on number of servers K
- Algorithm
- Choose different values of D
- Run Greedy Algorithm of Opt-RNNA
- Repeat until solve with number of servers K?K
45
46Extensions
Nearest Neighbor and Reverse Nearest Neighbor
Queries for Moving Objects R.Benetis,
C.S.Jensen,G.Karciauskas, S.Saltenis Reverse
Nearest Neighbor Queries for Dynamic
Databases SHOU Yu Tao
Assumption the clients are on the same axis as
the servers
46
47Summary of Results
47
48Experiments
The following aspects were tested
Experimental data CALIFORNIA latitude of 63k
buildings in California, uniform and binomial
distributions
48
49Average Error of List-RNN-Count Test AVG i (
Ci/Ci )
49
50Average Error of List-RNN-Maxdist Test AVG i (
Di/Di )
50
51Maximum Error of Max-RNN-Count Test ( max Ci/max
Ci )
51
52Maximum Error of List-RNN-Maxdist Test ( max
Di/max Di )
52
53Conclusions
- RNNA supports computations based on geographical
distances or vector-space similarity between
servers and clients
- Applications of RNNA
- Classical facility location
- Emerging fixed wireless telephony access and
sensor-based - traffic monitoring
- Data of RNNA arrives in streams
- RNNA performs online computations
53
54Conclusions
- We study three problems
- Max-RNNA
- List-RNNA
- Opt-RNNA
- Two aggregates
- COUNT
- MAXDIST
- Approximate algorithms with near-linear space
usage
54
55Questions and Answers
Any Questions?
?