Similarity Search in High Dimensions via Hashing - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Similarity Search in High Dimensions via Hashing

Description:

... of (r1, r2, p1, p2)-sensitive functions, {hi(.)} dist(p,q) r1 ProbH [h(q) ... Speed, Miss Ratio, Error (%) for various data sizes, dimensions, and K values ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 21
Provided by: fatih2
Category:

less

Transcript and Presenter's Notes

Title: Similarity Search in High Dimensions via Hashing


1
Similarity Search in High Dimensions via Hashing
  • Aristides Gionis, Piotr Indyk, Rajeev Motwani
  • Presented by
  • Fatih Uzun

2
Outline
  • Introduction
  • Problem Description
  • Key Idea
  • Experiments and Results
  • Conclusions

3
Introduction
  • Similarity Search over High-Dimensional Data
  • Image databases, document collections etc
  • Curse of Dimensionality
  • All space partitioning techniques degrade to
    linear search for high dimensions
  • Exact vs. Approximate Answer
  • Approximate might be good-enough and much-faster
  • Time-quality trade-off

4
Problem Description
  • ? - Nearest Neighbor Search (? - NNS)
  • Given a set P of points in a normed space ,
    preprocess P so as to efficiently return a point
    p ? P for any given query point q, such that
  • dist(q,p) ? (1 ? ) ? min r ? P dist(q,r)
  • Generalizes to K- nearest neighbor search ( K gt1)

5
Problem Description
6
Key Idea
  • Locality Sensitive Hashing ( LSH ) to get
    sub-linear dependence on the data-size for
    high-dimensional data
  • Preprocessing
  • Hash the data-point using several LSH functions
    so that probability of collision is higher for
    closer objects

7
Algorithm Preprocessing
  • Input
  • Set of N points p1 , .. pn
  • L ( number of hash tables )
  • Output
  • Hash tables Ti , i 1 , 2, . L
  • Foreach i 1 , 2, . L
  • Initialize Ti with a random hash function
    gi(.)
  • Foreach i 1 , 2, . L
  • Foreach j 1 , 2, . N
  • Store point pj on bucket gi(pj) of hash table
    Ti

8
LSH - Algorithm
P
pi
g1(pi)
g2(pi)
gL(pi)
TL
T2
T1
9
Algorithm ? - NNS Query
  • Input
  • Query point q
  • K ( number of approx. nearest neighbors )
  • Access
  • Hash tables Ti , i 1 , 2, . L
  • Output
  • Set S of K ( or less ) approx. nearest neighbors
  • S ? ?
  • Foreach i 1 , 2, . L
  • S ? S ? points found in gi(q) bucket of hash
    table Ti

10
LSH - Analysis
  • Family H of (r1, r2, p1, p2)-sensitive functions,
    hi(.)
  • dist(p,q) lt r1 ? ProbH h(q) h(p) ? p1
  • dist(p,q) ? r2 ? ProbH h(q) h(p) ? p2
  • p1 gt p2 and r1 lt r2
  • LSH functions gi(.) h1(.) hk(.)
  • For a proper choice of k and l, a simpler
    problem, (r,?)-Neighbor, and hence the actual
    problem can be solved
  • Query Time O(d ?n1/(1?) )
  • d dimensions , n data size

11
Experiments
  • Data Sets
  • Color images from COREL Draw library (20,000
    points,dimensions up to 64)
  • Texture information of aerial photographs
    (270,000 points, dimensions 60)
  • Evaluation
  • Speed, Miss Ratio, Error () for various data
    sizes, dimensions, and K values
  • Compare Performance with SR-Tree ( Spatial Data
    Structure )

12
Performance Measures
  • Speed
  • Number of disk block accesses in order to answer
    the query ( hash tables)
  • Miss Ratio
  • Fraction of cases when less than K points are
    found for K-NNS
  • Error
  • Average of fractional error in distance to point
    found by LSH as compared to nearest neighbor
    distance taken over entire set of queries

13
Speed vs. Data Size
14
Speed vs. Dimension
15
Speed vs. Nearest Neighbors
16
Speed vs. Error
17
Miss Ratio vs. Data Size
18
Conclusion
  • Better Query Time than Spatial Data Structures
  • Scales well to higher dimensions and larger data
    size ( Sub-linear dependence )
  • Predictable running time
  • Extra storage over-head
  • Inefficient for data with distances concentrated
    around average

19
Future Work
  • Investigate Hybrid-Data Structures obtained by
    merging tree and hash based structures.
  • Make use of the structure of the data-set to
    systematically obtain LSH functions
  • Explore other applications of LSH-type techniques
    to data mining

20
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com