Title: NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms
1NNH Improving Performance of Nearest-Neighbor
Searches Using Histograms
- Liang Jin (UC Irvine)
- Nick Koudas (ATT)
- Chen Li (UC Irvine)
2Outline
- Motivation NN search
- NNH Proposed histogram structure
- Main idea
- Utilizing NNH in a search (KNN, join)
- Constructing NNH
- Incremental maintenance
- Experiments
3NN (nearest-neighbor) search
- KNN find the k nearest neighbors of an object.
4Example image search
Query image
- Images represented as features (color histogram,
texture moments, etc.) - Similarity search using these features
- Find 10 most similar images for the query image
5Other Applications
- Web-page search
- Find 100 most similar pages for a given page
- Page represented as word-frequency vector
- Similarity vector distance
- GIS find 5 closest cities of Irvine
- CAD, information retrieval, molecular biology,
data cleansing, - Challenges Efficiency, Scalability
6NN Algorithms
- Distance measurement
- For objects are points, distance well defined
- Usually Euclidean
- Other distances possible
- For arbitrary-shaped objects, assume we have a
distance function between them - Most algorithms assume a high-dimensional tree
structure exists for the datasets.
7Example R-Trees
Take 2-d space as an example.
8Minimal Bounding Rectangle
- MBR is an n-dimensional rectangle that bounds its
corresponding objects. - MBR face property Every face of any MBR contains
at least one point of some object
9Search process (1-NN for example)
- Most algorithms traverse the structure (e.g.,
R-tree) top down, and follow a branch-and-bound
approach - Keep a priority queue of nodes (mbrs) to be
visited - Sorted based on the minimum distance between q
and each node - Improvement
- Use MINDIST and MINMAXDIST
- Reduce the queue size
- Avoid unnecessary disk IOs to access MBRs
Priority queue
10MINDIST MINMAXDIST
11Pruning in NN search
1. Discard mbr1 if MINDIST(q,mbr1) gt
MINMAXDIST(q,mbr2)
- 3. Discard mbr1 if MINDIST(q,mbr1) gt disk(q,o)
12Problem
- Queue size may be large
- Example 60,000, 32d (image) vectors, 50 NNs
- Max queue size 15K entries
- Avg queue size half (7.5K entries)
- If queue cant fit in memory, more disk IOs!
- Problem worse for k-NN joins
- E.g., 1500 x 1500 join
- Max queue size 1.7M entries gt 1GB memory!
- 750 seconds to run
- Couldnt scale up to 2000 objects!
- Disk thrashing
13Our Solution Nearest-Neighbor Histogram (NNH)
- Main idea
- Utilizing NNH in a search (KNN, join)
- Constructing NNH
- Incremental maintenance
14NNH Nearest-Neighbor Histograms
pm
p2
p1
m of pivots
Distances of its nearest neighbors r1, r2, ,
15Main idea
- Keep a histogram of NN distances of a
pre-selected collection of objects (pivots). - They are not part of the database
- They give a big picture of objects locations
- Use the histogram to estimate the NN distance of
each certain query object. - Use these estimated NN distances to do more
pruning in an NN search
16Structure
each ri is the distance of ps i-th NN T length
of each vector
- Nearest Neighbor Histogram
- Collection of m pivots with their NN vectors
17Estimate NN distance for query object
- NNH does not give exact NN information for an
object - But we can estimate an upper bound for the k-NN
distance ?qest of q
Triangle inequality
18Estimate NN for query object(cont)
- Apply the triangle inequality to all pivots
- Upper bound estimate of NN distance of q
19Utilizing estimates in NN search
- More pruning prune an mbr if
mbr
MINDIST
q
20Utilizing estimates in NN join
- K-NN join for each object o1 in D1, find its
k-nearest neighbors in D2. - Preliminary algorithm by Hjaltason and Samet
HS98 - Traverse two trees top down keep a queue of pairs
21Utilizing estimates in NN join (contt)
- Construct NNH for D2.
- For each object o1 in D1, keep its estimated NN
radius ?o1est using NNH of D2. - Similar to k-NN query, ignore mbr for o1 if
MINDIST
o1
mbr
22More powerful prune MBR pairs
23Prune MBR pairs (cont)
mbr1
mbr2
MINDIST
Prune this MBR pair if
24How to construct an NNH?
- If we have selected the m pivots
- Just run KNN queries for them to construct NNH
- Time is O(m)
- Offline
- Important selecting pivots
- Size-Constraint NNH Construction
- Error-Constraint NNH Construction
25Size-constraint NNH construction
- of pivots m determines
- Storage size
- Initial construction cost
- Incremental-maintenance cost
- Choose m best pivots
26Size-constraint NNH construction
- Given m of pivots
- Assuming
- query objects are from the database D
- H(pi,k) doesnt vary too much
- Goal Find pivots p1, p2, , pm to minimize
object distances to the pivots - Clustering problem
- Many algorithms available
- Use K-means for its simplicity and efficiency
27Error-constraint NNH construction
- Assumptions
- A threshold r is set apriori
- Any estimate to the k-NN distance less than r is
considered good enough. - I.e., a maximum error of r is tolerated for any
distance estimate.
28Error-constraint NNH construction (cont)
- Find a set points S p1, p2, , pm from the
dataset D - For each point pi, its kNNs are within distance
r/2 - Then, for any point q within distance r/2 from
pi, we get a distance estimate for the KNN of q
29Error-constraint NNH construction (cont)
- Problem find points such that
- They cover the entire data set with spheres of
radius r/2 - The sum of distances of all points in each sphere
to its center is minimized - An instance of the k-center problem
- Efficient 2-approximation algorithm using a
single pass over the dataset
30Incremental Maintenance
- How to update the NNH when inserting or deleting
objects? - Need to shift each vector
- Associate a valid length Ei to each NN vector.
31Insertion
- Locate the position j in each NN vector where
32Insertion (cont)
- If j not found, we dont need to update this
pivot NN vector (why?) - If found
- insert the new radius
- shift the vector to the right
- increment Ei by 1.
33Deletion
- Similar to the Insertion
- Locate position of
- If not found, no update for this vector
- If found
- remove rj
- shift the rest to the left
- decrement Ei by 1
34Experiments
- Dataset
- Corel image database
- Contains 60,000 images
- Each image represented by a 32-dimensional float
vector - Test bed
- PC 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000.
- GNU C in CYGWIN
35Questions to be answered
- Is the pruning using NNH estimates powerful?
- KNN queries
- NN-join queries
- Is it cheap to have such a structure?
- Storage
- Initial construction
- Incremental maintenance
36Improvement in k-NN search
- Run k-means algorithm to generate 400 pivots, and
construct the NNH - Perform 10-NN queries on 100 randomly selected
query objects. - Queue size as the benchmark for memory usage.
- Max queue size
- Average queue size
37Reduced Memory Requirement
38Reduced running time
39Effects of different of pivots
40Improvement in k-NN joins
- Selected two subsets from the Corel dataset. Each
contains 1500 objects. - Unfortunately couldnt run the PC due to large
memory requirement - Ran on a SUN Ultra 4 workstation with four 300MHz
CPU and 3GB Memory. - Constructed NNH (400 pivots) for D2.
41Join Reduced Memory Requirement
42Join Reduced running time
43Join Effects of different of pivots
44JoinRunning time for different data sizes
45Cost/Benefit of NNH
For 60,000 32-d float vectors.
Pivot (m) 10 50 100 150 200 250 300 350 400
Construction time (sec) 0.7 3.59 6.6 9.4 11.5 13.7 15.7 17.8 20.4
Storage space (kB) 2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms) 0 0 0 0 0 0 0 0 0
Improved q-size(kNN)() 40 30 28 24 24 24 23 20 18
Improved q-size(join)() 45 34 28 26 26 25 24 24 22
0 means almost zero.
46Conclusion
- NNH efficient, effective approach to improving
NN-search performance. - Can be easily embedded into current
implementation of NN algorithms. - Can be efficiently constructed and maintained.
- Offers substantial performance advantages.
47Work conducted in the Flamingo Project on Data
Cleansing at UC Irvine