Similarity Search in High Dimensions via Hashing - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Similarity Search in High Dimensions via Hashing

Description:

... of (r1, r2, p1, p2) sensitive functions, {hi(.)} dist(p,q) r1 ProbH [h(q) ... Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 22
Provided by: yah4
Category:

less

Transcript and Presenter's Notes

Title: Similarity Search in High Dimensions via Hashing


1
Similarity Search in High Dimensions via Hashing
  • Aristides Gionis, Piotr Indyk, Rajeev Motwani
  • Presented by
  • Srujana Merugu Yousuf Ahmed

2
Outline
  • Introduction
  • Problem Description
  • Key Idea
  • Experiments and Results
  • Conclusions

3
Introduction
  • Similarity Search over High-Dimensional Data
  • Image databases, document collections etc
  • Curse of Dimensionality
  • All space partitioning techniques degrade to
    linear search for high dimensions
  • Exact vs. Approximate Answer
  • Approximate might be good-enough and much-faster
  • Time-quality trade-off

4
Problem Description
  • ? - Nearest Neighbor Search (? - NNS)
  • Given a set P of points in a normed space ,
    preprocess P so as to efficiently return a point
    p ? P for any given query point q, such that
  • dist(q,p) ? (1 ? ) ? min r ? P dist(q,r)
  • Generalizes to K- nearest neighbor search ( K gt1)

5
Problem Description
6
Key Idea
  • Locality Sensitive Hashing ( LSH ) instead of
    space partitioning to get sub-linear dependence
    on the data-size for high-dimensional data
  • Preprocessing
  • Hash the data-point using several LSH functions
    so that probability of collision is higher for
    closer objects
  • Querying
  • Hash query point and retrieve elements in the
    buckets containing the query point

7
Algorithm Preprocessing
  • Input
  • Set of N points p1 , .. pn
  • L ( number of hash tables )
  • Output
  • Hash tables Ti , i 1 , 2, . L
  • Foreach i 1 , 2, . L
  • Initialize Ti with a random hash function
    gi(.)
  • Foreach i 1 , 2, . L
  • Foreach j 1 , 2, . N
  • Store point pj on bucket gi(pj) of hash table
    Ti

8
LSH - Algorithm
P
pi
g1(pi)
g2(pi)
gL(pi)
TL
T2
T1
9
Algorithm ? - NNS Query
  • Input
  • Query point q
  • K ( number of approx. nearest neighbors )
  • Access
  • Hash tables Ti , i 1 , 2, . L
  • Output
  • Set S of K ( or less ) approx. nearest neighbors
  • S ? ?
  • Foreach i 1 , 2, . L
  • S ? S ? points found in gi(q) bucket of hash
    table Ti

10
LSH - Analysis
  • Family H of (r1, r2, p1, p2) sensitive functions,
    hi(.)
  • dist(p,q) lt r1 ? ProbH h(q) h(p) ? p1
  • dist(p,q) ? r2 ? ProbH h(q) h(p) ? p2
  • LSH functions gi(.) h1(.) hk(.)
  • For a proper choice of k and l, a simpler
    problem, (r,?)-Neighbor, and hence the actual
    problem can be solved
  • Query Time O(d ?n1/(1?) )
  • d dimensions , n data size

11
LSH - Analysis
12
Experiments
  • Data Sets
  • Color images from COREL Draw library (20,000
    points,dimensions up to 64)
  • Texture information of aerial photographs
    (270,000 points, dimensions 60)
  • Evaluation
  • Speed, Miss Ratio, Error () for various data
    sizes, dimensions, and K values
  • Compare Performance with SR-Tree ( Spatial Data
    Structure )

13
Performance Measures
  • Speed
  • Number of disk block accesses in order to answer
    the query ( hash tables)
  • Miss Ratio
  • Fraction of cases when less than K points are
    found for K-NNS
  • Error
  • Average of fractional error in distance to point
    found by LSH as compared to nearest neighbor
    distance taken over entire set of queries

14
Speed vs. Data Size
15
Speed vs. Dimension
16
Speed vs. Nearest Neighbors
17
Speed vs. Error
18
Miss Ratio vs. Data Size
19
Conclusion
  • Better Query Time than Spatial Data Structures
  • Scales well to higher dimensions and larger data
    size ( Sub-linear dependence )
  • Predictable running time
  • Extra storage over-head
  • Inefficient for data with distances concentrated
    around average

20
Future Work
  • Investigate Hybrid-Data Structures obtained by
    merging tree and hash based structures.
  • Make use of the structure of the data-set to
    systematically obtain LSH functions
  • Explore other applications of LSH-type techniques
    to data mining

21
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com