NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms presentation

About This Presentation

Transcript and Presenter's Notes

Title: NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

1
NNH Improving Performance of Nearest-Neighbor
Searches Using Histograms

Liang Jin (UC Irvine)
Nick Koudas (ATT)
Chen Li (UC Irvine)

2
Outline

Motivation NN search
NNH Proposed histogram structure
Main idea
Utilizing NNH in a search (KNN, join)
Constructing NNH
Incremental maintenance
Experiments

3
NN (nearest-neighbor) search

KNN find the k nearest neighbors of an object.

4
Example image search
Query image

Images represented as features (color histogram,
texture moments, etc.)
Similarity search using these features
Find 10 most similar images for the query image

5
Other Applications

Web-page search
Find 100 most similar pages for a given page
Page represented as word-frequency vector
Similarity vector distance
GIS find 5 closest cities of Irvine
CAD, information retrieval, molecular biology,
data cleansing,
Challenges Efficiency, Scalability

6
NN Algorithms

Distance measurement
For objects are points, distance well defined
Usually Euclidean
Other distances possible
For arbitrary-shaped objects, assume we have a
distance function between them
Most algorithms assume a high-dimensional tree
structure exists for the datasets.

7
Example R-Trees
Take 2-d space as an example.
8
Minimal Bounding Rectangle

MBR is an n-dimensional rectangle that bounds its
corresponding objects.
MBR face property Every face of any MBR contains
at least one point of some object

9
Search process (1-NN for example)

Most algorithms traverse the structure (e.g.,
R-tree) top down, and follow a branch-and-bound
approach
Keep a priority queue of nodes (mbrs) to be
visited
Sorted based on the minimum distance between q
and each node
Improvement
Use MINDIST and MINMAXDIST
Reduce the queue size
Avoid unnecessary disk IOs to access MBRs

Priority queue
10
MINDIST MINMAXDIST
11
Pruning in NN search
1. Discard mbr1 if MINDIST(q,mbr1) gt
MINMAXDIST(q,mbr2)

3. Discard mbr1 if MINDIST(q,mbr1) gt disk(q,o)

12
Problem

Queue size may be large
Example 60,000, 32d (image) vectors, 50 NNs
Max queue size 15K entries
Avg queue size half (7.5K entries)
If queue cant fit in memory, more disk IOs!
Problem worse for k-NN joins
E.g., 1500 x 1500 join
Max queue size 1.7M entries gt 1GB memory!
750 seconds to run
Couldnt scale up to 2000 objects!
Disk thrashing

13
Our Solution Nearest-Neighbor Histogram (NNH)

Main idea
Utilizing NNH in a search (KNN, join)
Constructing NNH
Incremental maintenance

14
NNH Nearest-Neighbor Histograms
pm
p2
p1
m of pivots
Distances of its nearest neighbors r1, r2, ,
15
Main idea

Keep a histogram of NN distances of a
pre-selected collection of objects (pivots).
They are not part of the database
They give a big picture of objects locations
Use the histogram to estimate the NN distance of
each certain query object.
Use these estimated NN distances to do more
pruning in an NN search

16
Structure

Nearest Neighbor Vectors

each ri is the distance of ps i-th NN T length
of each vector

Nearest Neighbor Histogram
Collection of m pivots with their NN vectors

17
Estimate NN distance for query object

NNH does not give exact NN information for an
object
But we can estimate an upper bound for the k-NN
distance ?qest of q

Triangle inequality
18
Estimate NN for query object(cont)

Apply the triangle inequality to all pivots
Upper bound estimate of NN distance of q

Complexity O(m)

19
Utilizing estimates in NN search

More pruning prune an mbr if

mbr
MINDIST
q
20
Utilizing estimates in NN join

K-NN join for each object o1 in D1, find its
k-nearest neighbors in D2.
Preliminary algorithm by Hjaltason and Samet
HS98
Traverse two trees top down keep a queue of pairs

21
Utilizing estimates in NN join (contt)

Construct NNH for D2.
For each object o1 in D1, keep its estimated NN
radius ?o1est using NNH of D2.
Similar to k-NN query, ignore mbr for o1 if

MINDIST
o1
mbr
22
More powerful prune MBR pairs
23
Prune MBR pairs (cont)
mbr1
mbr2
MINDIST
Prune this MBR pair if
24
How to construct an NNH?

If we have selected the m pivots
Just run KNN queries for them to construct NNH
Time is O(m)
Offline
Important selecting pivots
Size-Constraint NNH Construction
Error-Constraint NNH Construction

25
Size-constraint NNH construction

of pivots m determines
Storage size
Initial construction cost
Incremental-maintenance cost
Choose m best pivots

26
Size-constraint NNH construction

Given m of pivots
Assuming
query objects are from the database D
H(pi,k) doesnt vary too much
Goal Find pivots p1, p2, , pm to minimize
object distances to the pivots
Clustering problem
Many algorithms available
Use K-means for its simplicity and efficiency

27
Error-constraint NNH construction

Assumptions
A threshold r is set apriori
Any estimate to the k-NN distance less than r is
considered good enough.
I.e., a maximum error of r is tolerated for any
distance estimate.

28
Error-constraint NNH construction (cont)

Find a set points S p1, p2, , pm from the
dataset D
For each point pi, its kNNs are within distance
r/2
Then, for any point q within distance r/2 from
pi, we get a distance estimate for the KNN of q

29
Error-constraint NNH construction (cont)

Problem find points such that
They cover the entire data set with spheres of
radius r/2
The sum of distances of all points in each sphere
to its center is minimized
An instance of the k-center problem
Efficient 2-approximation algorithm using a
single pass over the dataset

30
Incremental Maintenance

How to update the NNH when inserting or deleting
objects?
Need to shift each vector
Associate a valid length Ei to each NN vector.

31
Insertion

Locate the position j in each NN vector where

32
Insertion (cont)

If j not found, we dont need to update this
pivot NN vector (why?)
If found
insert the new radius
shift the vector to the right
increment Ei by 1.

33
Deletion

Similar to the Insertion
Locate position of
If not found, no update for this vector
If found
remove rj
shift the rest to the left
decrement Ei by 1

34
Experiments

Dataset
Corel image database
Contains 60,000 images
Each image represented by a 32-dimensional float
vector
Test bed
PC 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000.
GNU C in CYGWIN

35
Questions to be answered

Is the pruning using NNH estimates powerful?
KNN queries
NN-join queries
Is it cheap to have such a structure?
Storage
Initial construction
Incremental maintenance

36
Improvement in k-NN search

Run k-means algorithm to generate 400 pivots, and
construct the NNH
Perform 10-NN queries on 100 randomly selected
query objects.
Queue size as the benchmark for memory usage.
Max queue size
Average queue size

37
Reduced Memory Requirement
38
Reduced running time
39
Effects of different of pivots
40
Improvement in k-NN joins

Selected two subsets from the Corel dataset. Each
contains 1500 objects.
Unfortunately couldnt run the PC due to large
memory requirement
Ran on a SUN Ultra 4 workstation with four 300MHz
CPU and 3GB Memory.
Constructed NNH (400 pivots) for D2.

41
Join Reduced Memory Requirement
42
Join Reduced running time
43
Join Effects of different of pivots
44
JoinRunning time for different data sizes
45
Cost/Benefit of NNH
For 60,000 32-d float vectors.
Pivot (m) 10 50 100 150 200 250 300 350 400
Construction time (sec) 0.7 3.59 6.6 9.4 11.5 13.7 15.7 17.8 20.4
Storage space (kB) 2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms) 0 0 0 0 0 0 0 0 0
Improved q-size(kNN)() 40 30 28 24 24 24 23 20 18
Improved q-size(join)() 45 34 28 26 26 25 24 24 22
0 means almost zero.
46
Conclusion

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms PowerPoint PPT Presentation