NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms


1
NNH Improving Performance of Nearest-Neighbor
Searches Using Histograms
  • Liang Jin (UC Irvine)
  • Nick Koudas (ATT)
  • Chen Li (UC Irvine)

2
Outline
  • Motivation NN search
  • NNH Proposed histogram structure
  • Main idea
  • Utilizing NNH in a search (KNN, join)
  • Constructing NNH
  • Incremental maintenance
  • Experiments

3
NN (nearest-neighbor) search
  • KNN find the k nearest neighbors of an object.

4
Example image search
Query image
  • Images represented as features (color histogram,
    texture moments, etc.)
  • Similarity search using these features
  • Find 10 most similar images for the query image

5
Other Applications
  • Web-page search
  • Find 100 most similar pages for a given page
  • Page represented as word-frequency vector
  • Similarity vector distance
  • GIS find 5 closest cities of Irvine
  • CAD, information retrieval, molecular biology,
    data cleansing,
  • Challenges Efficiency, Scalability

6
NN Algorithms
  • Distance measurement
  • For objects are points, distance well defined
  • Usually Euclidean
  • Other distances possible
  • For arbitrary-shaped objects, assume we have a
    distance function between them
  • Most algorithms assume a high-dimensional tree
    structure exists for the datasets.

7
Example R-Trees
Take 2-d space as an example.
8
Minimal Bounding Rectangle
  • MBR is an n-dimensional rectangle that bounds its
    corresponding objects.
  • MBR face property Every face of any MBR contains
    at least one point of some object

9
Search process (1-NN for example)
  • Most algorithms traverse the structure (e.g.,
    R-tree) top down, and follow a branch-and-bound
    approach
  • Keep a priority queue of nodes (mbrs) to be
    visited
  • Sorted based on the minimum distance between q
    and each node
  • Improvement
  • Use MINDIST and MINMAXDIST
  • Reduce the queue size
  • Avoid unnecessary disk IOs to access MBRs

Priority queue
10
MINDIST MINMAXDIST
11
Pruning in NN search
1. Discard mbr1 if MINDIST(q,mbr1) gt
MINMAXDIST(q,mbr2)
  • 3. Discard mbr1 if MINDIST(q,mbr1) gt disk(q,o)

12
Problem
  • Queue size may be large
  • Example 60,000, 32d (image) vectors, 50 NNs
  • Max queue size 15K entries
  • Avg queue size half (7.5K entries)
  • If queue cant fit in memory, more disk IOs!
  • Problem worse for k-NN joins
  • E.g., 1500 x 1500 join
  • Max queue size 1.7M entries gt 1GB memory!
  • 750 seconds to run
  • Couldnt scale up to 2000 objects!
  • Disk thrashing

13
Our Solution Nearest-Neighbor Histogram (NNH)
  • Main idea
  • Utilizing NNH in a search (KNN, join)
  • Constructing NNH
  • Incremental maintenance

14
NNH Nearest-Neighbor Histograms
pm
p2
p1
m of pivots
Distances of its nearest neighbors r1, r2, ,
15
Main idea
  • Keep a histogram of NN distances of a
    pre-selected collection of objects (pivots).
  • They are not part of the database
  • They give a big picture of objects locations
  • Use the histogram to estimate the NN distance of
    each certain query object.
  • Use these estimated NN distances to do more
    pruning in an NN search

16
Structure
  • Nearest Neighbor Vectors

each ri is the distance of ps i-th NN T length
of each vector
  • Nearest Neighbor Histogram
  • Collection of m pivots with their NN vectors

17
Estimate NN distance for query object
  • NNH does not give exact NN information for an
    object
  • But we can estimate an upper bound for the k-NN
    distance ?qest of q

Triangle inequality
18
Estimate NN for query object(cont)
  • Apply the triangle inequality to all pivots
  • Upper bound estimate of NN distance of q
  • Complexity O(m)

19
Utilizing estimates in NN search
  • More pruning prune an mbr if

mbr
MINDIST
q
20
Utilizing estimates in NN join
  • K-NN join for each object o1 in D1, find its
    k-nearest neighbors in D2.
  • Preliminary algorithm by Hjaltason and Samet
    HS98
  • Traverse two trees top down keep a queue of pairs

21
Utilizing estimates in NN join (contt)
  • Construct NNH for D2.
  • For each object o1 in D1, keep its estimated NN
    radius ?o1est using NNH of D2.
  • Similar to k-NN query, ignore mbr for o1 if

MINDIST
o1
mbr
22
More powerful prune MBR pairs
23
Prune MBR pairs (cont)
mbr1
mbr2
MINDIST
Prune this MBR pair if
24
How to construct an NNH?
  • If we have selected the m pivots
  • Just run KNN queries for them to construct NNH
  • Time is O(m)
  • Offline
  • Important selecting pivots
  • Size-Constraint NNH Construction
  • Error-Constraint NNH Construction

25
Size-constraint NNH construction
  • of pivots m determines
  • Storage size
  • Initial construction cost
  • Incremental-maintenance cost
  • Choose m best pivots

26
Size-constraint NNH construction
  • Given m of pivots
  • Assuming
  • query objects are from the database D
  • H(pi,k) doesnt vary too much
  • Goal Find pivots p1, p2, , pm to minimize
    object distances to the pivots
  • Clustering problem
  • Many algorithms available
  • Use K-means for its simplicity and efficiency

27
Error-constraint NNH construction
  • Assumptions
  • A threshold r is set apriori
  • Any estimate to the k-NN distance less than r is
    considered good enough.
  • I.e., a maximum error of r is tolerated for any
    distance estimate.

28
Error-constraint NNH construction (cont)
  • Find a set points S p1, p2, , pm from the
    dataset D
  • For each point pi, its kNNs are within distance
    r/2
  • Then, for any point q within distance r/2 from
    pi, we get a distance estimate for the KNN of q

29
Error-constraint NNH construction (cont)
  • Problem find points such that
  • They cover the entire data set with spheres of
    radius r/2
  • The sum of distances of all points in each sphere
    to its center is minimized
  • An instance of the k-center problem
  • Efficient 2-approximation algorithm using a
    single pass over the dataset

30
Incremental Maintenance
  • How to update the NNH when inserting or deleting
    objects?
  • Need to shift each vector
  • Associate a valid length Ei to each NN vector.

31
Insertion
  • Locate the position j in each NN vector where

32
Insertion (cont)
  • If j not found, we dont need to update this
    pivot NN vector (why?)
  • If found
  • insert the new radius
  • shift the vector to the right
  • increment Ei by 1.

33
Deletion
  • Similar to the Insertion
  • Locate position of
  • If not found, no update for this vector
  • If found
  • remove rj
  • shift the rest to the left
  • decrement Ei by 1

34
Experiments
  • Dataset
  • Corel image database
  • Contains 60,000 images
  • Each image represented by a 32-dimensional float
    vector
  • Test bed
  • PC 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000.
  • GNU C in CYGWIN

35
Questions to be answered
  • Is the pruning using NNH estimates powerful?
  • KNN queries
  • NN-join queries
  • Is it cheap to have such a structure?
  • Storage
  • Initial construction
  • Incremental maintenance

36
Improvement in k-NN search
  • Run k-means algorithm to generate 400 pivots, and
    construct the NNH
  • Perform 10-NN queries on 100 randomly selected
    query objects.
  • Queue size as the benchmark for memory usage.
  • Max queue size
  • Average queue size

37
Reduced Memory Requirement
38
Reduced running time
39
Effects of different of pivots
40
Improvement in k-NN joins
  • Selected two subsets from the Corel dataset. Each
    contains 1500 objects.
  • Unfortunately couldnt run the PC due to large
    memory requirement
  • Ran on a SUN Ultra 4 workstation with four 300MHz
    CPU and 3GB Memory.
  • Constructed NNH (400 pivots) for D2.

41
Join Reduced Memory Requirement
42
Join Reduced running time
43
Join Effects of different of pivots
44
JoinRunning time for different data sizes
45
Cost/Benefit of NNH
For 60,000 32-d float vectors.
Pivot (m) 10 50 100 150 200 250 300 350 400
Construction time (sec) 0.7 3.59 6.6 9.4 11.5 13.7 15.7 17.8 20.4
Storage space (kB) 2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms) 0 0 0 0 0 0 0 0 0
Improved q-size(kNN)() 40 30 28 24 24 24 23 20 18
Improved q-size(join)() 45 34 28 26 26 25 24 24 22
0 means almost zero.
46
Conclusion
  • NNH efficient, effective approach to improving
    NN-search performance.
  • Can be easily embedded into current
    implementation of NN algorithms.
  • Can be efficiently constructed and maintained.
  • Offers substantial performance advantages.

47
Work conducted in the Flamingo Project on Data
Cleansing at UC Irvine
Write a Comment
User Comments (0)
About PowerShow.com