Similarity Search in High Dimensions via Hashing - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Similarity Search in High Dimensions via Hashing

Description:

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 23
Provided by: Jiy6
Learn more at: https://web.ece.ucsb.edu
Category:

less

Transcript and Presenter's Notes

Title: Similarity Search in High Dimensions via Hashing


1
Similarity Search in High Dimensions via Hashing
  • Aristides Gionis, Protr Indyk and Rajeev Motwani
  • Department of Computer Science
  • Stanford University
  • presented by Jiyun Byun
  • Vision Research Lab in ECE at UCSB

2
Outline
  • Introduction
  • Locality Sensitive Hashing
  • Analysis
  • Experiments
  • Concluding Remarks

3
Introduction
  • Nearest neighbor search (NNS)
  • The curse of dimensionality
  • experimental approach use heuristic
  • analytical approach
  • Approximate approach
  • e-Nearest Neighbor Search (e-NNS)
  • Goal for any given query q Rd, returns the
    points p ? P
  • where d(q,P) is the distance of q to the
    its closest points in P
  • right answers are much closer than irrelevant
    ones
  • time/quality trade off

4
Locality Sensitive Hashing (LSH)
  • Collision probability depends on distance between
    points
  • higher collision probability for close objects
  • small collision probability for those that far
    apart
  • Given a query point,
  • hash it using a set of hash functions
  • inspect the entries in each bucket

5
  • Locality Sensitive Hashing

6
Locality Sensitive Hashing (LSH)Setting
  • C the largest coordinate among all points in
    the given dataset P of dimension d (Rd)
  • Embed P into the Hamming cube 0,1d
  • dimension d Cd
  • v(p) UnaryC(x1)UnaryC(xd)
  • use the unary code for each point along each
    dimension
  • isometric embedding
  • d1(p,q) dH(v(p),v(q))
  • embedding preserves the distance between points

7
Locality Sensitive Hashing (LSH)Hash
functions(1/2)
  • Build a hash function on Hamming cube in d
    dimensions
  • Choose L subsets of the dimensions I1,I2, ..IL
  • Ij consists of k elements from 1,,d
  • found by sampling uniformly at random with
    replacement
  • Project each point on each Ij.
  • gj(p) projection of p on Ij obtained by
    concatenating the bit values of p for dimensions
    Ij
  • Store p in buckets gj(p), j 1.. L

8
Locality Sensitive Hashing (LSH)Hash
functions(2/2)
  • Two levels of hashing
  • LSH function
  • maps a point p to bucket gj(p)
  • standard hash function
  • maps the contents of buckets into a hash table of
    size M
  • B bucket capacity ? memory utilization
    parameter

9
Query processing
  • Search buckets gj(q)
  • until CL points are found or all L indices are
    searched.
  • Approximate K-NNS
  • output the K points closest to q
  • fewer if less than K points are found
  • -neighbor with parameter r

10
Analysis
  • where r1 lt r2 and P1gtP2
  • Family of single projections in Hamming cube Hd
    is (r, r(1 ), 1-r/d, 1- r(1 )/d) sensitive
  • if dH(q,p) r (r bits on which p and q
    differ)Pr h(q) h(p) r/d

11
LSH solve(r ) Neighbor problem
  • Determine if
  • there exists a point within distance r of query
    point q
  • or whether all points are at least a distance
    r(1 ) away from q
  • In the former case,
  • return a point within distance r(1 ) of q.
  • Repeat construction to boost the probability.

12
e-NN problem
  • For a given query point q,
  • return a point p from the dataset P
  • multiple instances of (r, )-neighbor solution.
  • (r0, )-neighbor, (r0(1 ), )-neighbor, (r0(1
    )2, )-neighbor, ,rmax neighbor

13
Experiments(1/3)
  • Datasets
  • color histograms (Corel Draw)
  • n 20,000 d 8,,64
  • texture features (Aerial photos)
  • n 270,000 d 60
  • Query sets
  • Disk
  • second level bucket is directly mapped to a disk
    block

14
Experiments(2/3)
  • profiles

Normalized frequency
Normalized frequency
Interpoint distance
Interpoint distance
color histogram
texture features
15
Experiments(3/3)
  • Performance
  • speed average number of blocks accessed
  • effective error
  • dLSH LSH NN distance(q) , d NN distance(q)
  • miss ratio
  • the fraction of queries for which no answer was
    found

16
Experiments color histogram(1/4)
  • Error vs. Number of indices(L)

17
Experiments color histogram(2/4)
  • Dependence on n

Disk Accesses
Disk Accesses
Number of database points
Number of database points
Approximate 1 NNS
Approximate 10 NNS
18
Experiments color histogram(3/4)
  • Miss ratios

Miss ratio
Miss ratio
Number of database points
Number of database points
Approximate 1 NNS
Approximate 10 NNS
19
Experiments color histogram(4/4)
  • Dependence on d

Disk Accesses
Disk Accesses
Number of dimension
Number of dimension
Approximate 1 NNS
Approximate 10 NNS
20
Experiments texture features(1/2)
  • Number of indices vs. Error

21
Experiments texture features(2/2)
  • Number of indices vs. Size

22
Concluding remarks
  • Locality Sensitive Hashing
  • fast approximation
  • dynamic/join version
  • Future work
  • hybrid techniques tree-based and hashing-based
Write a Comment
User Comments (0)
About PowerShow.com